http://homepages.inf.ed.ac.uk/wadler/language.pdf
I think sam is a much safer bet than some hideous lib that pretends to be capable of parsing (pseudo)HTML. Years ago some people tried to write a web browser in python... some years later they gave up, all they had produced was a spec for an XML format to store bookmarks. Quoting boyd: "hysterical." uriel > On Tue, Feb 07, 2006 at 10:50:22PM -0800, Lyndon Nerenberg wrote: >> So I thought, but something's not right. I can't demonstrate more >> until I get to work in the morning. > > Hmm. I'm going to make an unpopular but pragmatic suggestion: Don't use > sed or sam, but instead, use a language with an HTML parser available. > There are some jobs for which regular expressions aren't the best tool; > I personally think this is one of them. Here's a script I posted to > USENET years ago to extract data from a table. > > #!/usr/local/bin/python > > import sys > import htmllib > import formatter > > class MyParser(htmllib.HTMLParser): > def __init__(self, format): > htmllib.HTMLParser.__init__(self, format) > self.state = 0 > > def do_tr(self, data): > if self.state: > print htmllib.HTMLParser.save_end(self) > self.state = 0 > > def do_td(self, data): > if self.state: > print "%s, " % htmllib.HTMLParser.save_end(self), > self.state = 1 > htmllib.HTMLParser.save_bgn(self) > > parse = MyParser(formatter.NullFormatter()) > for file in sys.argv[1:]: > parse.feed(open(sys.argv[1],"r").read()) > parse.close() > > I wonder if this even still works..... > > - Dan C.
