[LUG@IITD:5449] Re: Wnat Help on grep regular expression {sed or awk also welcome}

narendra sisodiya Tue, 03 Nov 2009 21:04:28 -0800

On Wed, Nov 4, 2009 at 10:14 AM, Phani Bhushan Tholeti <[email protected]> wrote:
> On Tue, Nov 3, 2009 at 18:33, narendra sisodiya
> <[email protected]> wrote:
>>
>> I have a file which is a downloaded file from youtube. I have saved it as
>> local file. Now I want to extract links out of it. the format for links
>> inside file is like this
>>
>> href="/watch?v=xyCKsE8D68Q&feature=PlayList&p=06D0D25CEA35E441&index=19"
>> href="/watch?v=vjhaSMqmqTo&feature=PlayList&p=06D0D25CEA35E441&index=17"
>>
>> etc,
>> when I greped on -->  href="/watch <--
>> I can get lines but that grep result also contain many data which i do not
>> need.
>> I want all such strings (and not whole line) which has
>>          href="/watch?v= ************** "
>> format. I want to store all such links into another text file. I have seen
>> some tuts on regular expression but I think it is better to ask as i am
>> doing some basic mistake in forming regular express.
>>
>> I have also attached the input.txt file.
>>
>> Output file must be like this.
>>
>> /watch?v=xyCKsE8D68Q&feature=PlayList&p=06D0D25CEA35E441&index=1
>> /watch?v=m3gMgK7h-BA&feature=PlayList&p=06D0D25CEA35E441&index=2
>> /watch?v==vjhaSMqmqTo&feature=PlayList&p=06D0D25CEA35E441&index=3
>> .....
>> and so one.
>>
>
>
> i think this shd do:
>
> First the regex:
>
> regex:
> href="\/watch\?v=[^"]+
> href="\/watch\?v=    match any string that has href=\watch?v=
> [^"]+                       one or more characters but not containing double
> quote (")
>
> Now the implementation:
>
> gawk -F'"' '{if(/href="\/watch\?v=[^"]+"/) for(i=1;i<=NF;i++) if($i ~
> /^\/watch\?v=/) print $i}' <filename>
>
> i dont know how to print the matched part in awk.
> There's a match(string, regex) fn. but that's sort of restricted in the
> sense that you can get info only of the first match (position and index), so
> to get all links in a line, one has to iterate over the line again till the
> regex is no longer found(match return zero).
> So if you are sure that there is only one link per line (i.e, upto \n) then
> you can do this:
>
> print ( substr($0, match( $0, /href="\/watch\?v=[^"]+"/), RLENGTH )
> you might want to add 6 or 7 to the return of match, to get the o/p you want
>
> If you are using without compatibility mode, then you might try this:
>
> match ( $0, /href="(\/watch\?v=[^"]+)"/, matchedParts)
> print macthedParts
> #again assuming only one link per line, else you will have to iterate over
> the array
>
>
> Using python regex module and Perl you can directly print the part in
> parathesis from
> href="(\/watch\?v=[^"]+)"
> using references. That would bring down everything to one line.
> But somehow I prefer awk over Perl and Python for such small things is a
> waste of time (esp for one off scripts like this) ......
>
> So hope this helps..........
>
> --
> Lots o' Luv,
> Phani Bhushan
Thanks a Lot. Surely this will help me.
Just as a side note, The purpose of the whole execesie is to download
all the video's from a youtube's playlist and convert them into ogv
files and finally combine them into single ogv file for a playlist.
Let me code tonight for it. Thanks again,


-- 
┌─────────────────────────┐
│    Narendra Sisodiya ( नरेन्द्र सिसोदिया )
│    Web : http://narendra.techfandu.org
│    Twitter : http://tinyurl.com/dz7e4a
└─────────────────────────┘

--~--~---------~--~----~------------~-------~--~----~
Do you have another question? Click here - 
http://groups.google.com/group/iitdlug/post
Read archive - http://www.mail-archive.com/[email protected]/
l...@iitd community mailing list -- http://groups.google.com/group/iitdlug
-~----------~----~----~----~------~----~------~--~---

[LUG@IITD:5449] Re: Wnat Help on grep regular expression {sed or awk also welcome}

Reply via email to