On Dec 20, 7:14 pm, "Littlefield, Tyler" <ty...@tysdomain.com> wrote: > Hello all, > I have a question. I guess this worked pre 2.6; I don't remember the > last time I used it, but it was a while ago, and now it's failing. > Anyone mind looking at it and telling me what's going wrong? Also, is > there a quick way to match on a certain site? like links from google.com > and only output those? > #!/usr/bin/env python > > #This program is free software: you can redistribute it and/or modify it > under the terms of the GNU General Public License as published > #by the Free Software Foundation, either version 3 of the License, or > (at your option) any later version. > > #This program is distributed in the hope that it will be useful, but > WITHOUT ANY WARRANTY; without even the implied warranty of > #MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > General Public License for more details. > # > #You should have received a copy of the GNU General Public License along > with this program. If not, see > #http://www.gnu.org/licenses/. > > """ > This script will parse out all the links in an html document and write > them to a textfile. > """ > import sys,optparse > import htmllib,formatter > > #program class declarations: > class Links(htmllib.HTMLParser): > def __init__(self,formatter): > htmllib.HTMLParser.__init__(self, formatter) > self.links=[] > def start_a(self, attrs): > if (len(attrs)>0): > for a in attrs: > if a[0]=="href": > self.links.append(a[1]) > print a[1] > break > > def main(argv): > if (len(argv)!=3): > print("Error:\n"+argv[0]+" <input> <output>.\nParses <input> > for all links and saves them to <output>.") > return 1 > lcount=0 > format=formatter.NullFormatter() > html=Links(format) > print "Retrieving data:" > page=open(argv[1],"r") > print "Feeding data to parser:" > html.feed(page.read()) > page.close() > print "Writing links:" > output=open(argv[2],"w") > for i in (html.links): > output.write(i+"\n") > lcount+=1 > output.close() > print("Wrote "+str(lcount)+" links to "+argv[2]+"."); > print("done.") > > if (__name__ == "__main__"): > #we call the main function passing a list of args, and exit with > the return code passed back. > sys.exit(main(sys.argv)) > > -- > > Thanks, > Ty
This doesn't answer your original question, but excluding the command line handling, how's this do you?: import lxml from urlparse import urlsplit doc = lxml.html.parse('http://www.google.com') print map(urlsplit, doc.xpath('//a/@href')) [SplitResult(scheme='http', netloc='www.google.co.uk', path='/imghp', query='hl=en&tab=wi', fragment=''), SplitResult(scheme='http', netloc='video.google.co.uk', path='/', query='hl=en&tab=wv', fragment=''), SplitResult(scheme='http', netloc='maps.google.co.uk', path='/maps', query='hl=en&tab=wl', fragment=''), SplitResult(scheme='http', netloc='news.google.co.uk', path='/nwshp', query='hl=en&tab=wn', fragment=''), ...] Much nicer IMHO, plus the lxml.html has iterlinks() and other convenience functions for handling HTML. hth Jon. -- http://mail.python.org/mailman/listinfo/python-list