This might be better done on the command line. $ grep -Po '(?<=href=")[^"]+' [file name]
This will give you the content of every href attribute in the file, and nothing else. Just a list of URLs. If there are any URLs you want to exclude, such as mailto:, javascript: or anchors (e.g. href="#name"), you can be more specific. $ grep -Po '(?<=href=")(/|http)[^"]+' [file name] This will match only hrefs that start with "http", "https", or "/" (the link is relative to the document root). If you want to extract the hrefs from every HTML file in a directory and put them into a file, just do this: $ cd [directory containing the HTML files] $ grep -Po '(?<=href=")[^"]+' *.htm* > my_hrefs.txt This writes the contents of all the href attributes in all the files ending in .html (or ".htm") into a new file called my_hrefs.txt. To append to an existing file, use >> instead of >. On Friday, March 1, 2013 6:38:29 PM UTC-5, Nick wrote: > > Hi, > > I need to extract the URLs from a large number of HTML files. Basically, > take something like this: > > <ul> > > <li><a href="http://www.youtube.com" class="youtube">YouTube</a></li> > <li><a href="http://www.facebook.com" class="facebook">Facebook</a></li> > <li><a href="http://www.twitter.com" class="twitter">Twitter</a></li> > </ul> > > And output this: > http://www.youtube.com > http://www.facebook.com > http://www.twitter.com > > Any suggestions on how to best approach this? > > thanks! > -- -- You received this message because you are subscribed to the "BBEdit Talk" discussion group on Google Groups. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at <http://groups.google.com/group/bbedit?hl=en> If you have a feature request or would like to report a problem, please email "[email protected]" rather than posting to the group. Follow @bbedit on Twitter: <http://www.twitter.com/bbedit> --- You received this message because you are subscribed to the Google Groups "BBEdit Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
