On Tue, Nov 3, 2009 at 18:33, narendra sisodiya <[email protected]
> wrote:

> I have a file which is a downloaded file from youtube. I have saved it as
> local file. Now I want to extract links out of it. the format for links
> inside file is like this
>
> href="/watch?v=xyCKsE8D68Q&feature=PlayList&p=06D0D25CEA35E441&index=19"
> href="/watch?v=vjhaSMqmqTo&feature=PlayList&p=06D0D25CEA35E441&index=17"
>
> etc,
> when I greped on -->  href="/watch <--
> I can get lines but that grep result also contain many data which i do not
> need.
> I want all such strings (and not whole line) which has
>          href="/watch?v= ************** "
> format. I want to store all such links into another text file. I have seen
> some tuts on regular expression but I think it is better to ask as i am
> doing some basic mistake in forming regular express.
>
> I have also attached the input.txt file.
>
> Output file must be like this.
>
> /watch?v=xyCKsE8D68Q&feature=PlayList&p=06D0D25CEA35E441&index=1
> /watch?v=m3gMgK7h-BA&feature=PlayList&p=06D0D25CEA35E441&index=2
> /watch?v==vjhaSMqmqTo&feature=PlayList&p=06D0D25CEA35E441&index=3
> .....
> and so one.
>
>

i think this shd do:

First the regex:

regex:
href="\/watch\?v=[^"]+
href="\/watch\?v=    match any string that has href=\watch?v=
[^"]+                       one or more characters but not containing double
quote (")

Now the implementation:

gawk -F'"' '{if(/href="\/watch\?v=[^"]+"/) for(i=1;i<=NF;i++) if($i ~
/^\/watch\?v=/) print $i}' <filename>

i dont know how to print the matched part in awk.
There's a match(string, regex) fn. but that's sort of restricted in the
sense that you can get info only of the first match (position and index), so
to get all links in a line, one has to iterate over the line again till the
regex is no longer found(match return zero).
So if you are sure that there is only one link per line (i.e, upto \n) then
you can do this:

print ( substr($0, match( $0, /href="\/watch\?v=[^"]+"/), RLENGTH )
you might want to add 6 or 7 to the return of match, to get the o/p you want

If you are using without compatibility mode, then you might try this:

match ( $0, /href="(\/watch\?v=[^"]+)"/, matchedParts)
print macthedParts
#again assuming only one link per line, else you will have to iterate over
the array


Using python regex module and Perl you can directly print the part in
parathesis from
href="(\/watch\?v=[^"]+)"
using references. That would bring down everything to one line.
But somehow I prefer awk over Perl and Python for such small things is a
waste of time (esp for one off scripts like this) ......

So hope this helps..........

-- 
Lots o' Luv,
Phani Bhushan

Let not your sense of morals prevent you from doing what is right - Isaac
Asimov (Salvor Hardin in Foundation and Empire)

--~--~---------~--~----~------------~-------~--~----~
Do you have another question? Click here - 
http://groups.google.com/group/iitdlug/post
Read archive - http://www.mail-archive.com/[email protected]/
l...@iitd community mailing list -- http://groups.google.com/group/iitdlug
-~----------~----~----~----~------~----~------~--~---

Reply via email to