On Wed, Nov 4, 2009 at 10:14 AM, Phani Bhushan Tholeti <[email protected]> wrote: > On Tue, Nov 3, 2009 at 18:33, narendra sisodiya > <[email protected]> wrote: >> >> I have a file which is a downloaded file from youtube. I have saved it as >> local file. Now I want to extract links out of it. the format for links >> inside file is like this >> >> href="/watch?v=xyCKsE8D68Q&feature=PlayList&p=06D0D25CEA35E441&index=19" >> href="/watch?v=vjhaSMqmqTo&feature=PlayList&p=06D0D25CEA35E441&index=17" >> >> etc, >> when I greped on --> href="/watch <-- >> I can get lines but that grep result also contain many data which i do not >> need. >> I want all such strings (and not whole line) which has >> href="/watch?v= ************** " >> format. I want to store all such links into another text file. I have seen >> some tuts on regular expression but I think it is better to ask as i am >> doing some basic mistake in forming regular express. >> >> I have also attached the input.txt file. >> >> Output file must be like this. >> >> /watch?v=xyCKsE8D68Q&feature=PlayList&p=06D0D25CEA35E441&index=1 >> /watch?v=m3gMgK7h-BA&feature=PlayList&p=06D0D25CEA35E441&index=2 >> /watch?v==vjhaSMqmqTo&feature=PlayList&p=06D0D25CEA35E441&index=3 >> ..... >> and so one. >> > > > i think this shd do: > > First the regex: > > regex: > href="\/watch\?v=[^"]+ > href="\/watch\?v= match any string that has href=\watch?v= > [^"]+ one or more characters but not containing double > quote (") > > Now the implementation: > > gawk -F'"' '{if(/href="\/watch\?v=[^"]+"/) for(i=1;i<=NF;i++) if($i ~ > /^\/watch\?v=/) print $i}' <filename> > > i dont know how to print the matched part in awk. > There's a match(string, regex) fn. but that's sort of restricted in the > sense that you can get info only of the first match (position and index), so > to get all links in a line, one has to iterate over the line again till the > regex is no longer found(match return zero). > So if you are sure that there is only one link per line (i.e, upto \n) then > you can do this: > > print ( substr($0, match( $0, /href="\/watch\?v=[^"]+"/), RLENGTH ) > you might want to add 6 or 7 to the return of match, to get the o/p you want > > If you are using without compatibility mode, then you might try this: > > match ( $0, /href="(\/watch\?v=[^"]+)"/, matchedParts) > print macthedParts > #again assuming only one link per line, else you will have to iterate over > the array > > > Using python regex module and Perl you can directly print the part in > parathesis from > href="(\/watch\?v=[^"]+)" > using references. That would bring down everything to one line. > But somehow I prefer awk over Perl and Python for such small things is a > waste of time (esp for one off scripts like this) ...... > > So hope this helps.......... > > -- > Lots o' Luv, > Phani Bhushan Thanks a Lot. Surely this will help me. Just as a side note, The purpose of the whole execesie is to download all the video's from a youtube's playlist and convert them into ogv files and finally combine them into single ogv file for a playlist. Let me code tonight for it. Thanks again,
-- ┌─────────────────────────┐ │ Narendra Sisodiya ( नरेन्द्र सिसोदिया ) │ Web : http://narendra.techfandu.org │ Twitter : http://tinyurl.com/dz7e4a └─────────────────────────┘ --~--~---------~--~----~------------~-------~--~----~ Do you have another question? Click here - http://groups.google.com/group/iitdlug/post Read archive - http://www.mail-archive.com/[email protected]/ l...@iitd community mailing list -- http://groups.google.com/group/iitdlug -~----------~----~----~----~------~----~------~--~---
