This might be better done on the command line.

$ grep -Po '(?<=href=")[^"]+' [file name]

This will give you the content of every href attribute in the file, and 
nothing else. Just a list of URLs.

If there are any URLs you want to exclude, such as mailto:, javascript: or 
anchors (e.g. href="#name"), you can be more specific.

$ grep -Po '(?<=href=")(/|http)[^"]+' [file name]

This will match only hrefs that start with "http", "https", or "/" (the 
link is relative to the document root).

If you want to extract the hrefs from every HTML file in a directory and 
put them into a file, just do this:

$ cd [directory containing the HTML files]
$ grep -Po '(?<=href=")[^"]+' *.htm* > my_hrefs.txt

This writes the contents of all the href attributes in all the files ending 
in .html (or ".htm") into a new file called my_hrefs.txt. To append to an 
existing file, use >> instead of >.

On Friday, March 1, 2013 6:38:29 PM UTC-5, Nick wrote:
>
> Hi,
>
> I need to extract the URLs from a large number of HTML files. Basically, 
> take something like this:
>
> <ul>
>
> <li><a href="http://www.youtube.com"; class="youtube">YouTube</a></li>
> <li><a href="http://www.facebook.com"; class="facebook">Facebook</a></li>
> <li><a href="http://www.twitter.com"; class="twitter">Twitter</a></li>
> </ul>
>
> And output this:
> http://www.youtube.com
> http://www.facebook.com
> http://www.twitter.com
>
> Any suggestions on how to best approach this?
>
> thanks!
>

-- 
-- 
You received this message because you are subscribed to the 
"BBEdit Talk" discussion group on Google Groups.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
<http://groups.google.com/group/bbedit?hl=en>
If you have a feature request or would like to report a problem, 
please email "[email protected]" rather than posting to the group.
Follow @bbedit on Twitter: <http://www.twitter.com/bbedit>

--- 
You received this message because you are subscribed to the Google Groups 
"BBEdit Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to