http://homepages.inf.ed.ac.uk/wadler/language.pdf

I think sam is a much safer bet than some hideous lib that pretends to
be capable of parsing (pseudo)HTML.

Years ago some people tried to write a web browser in python...  some
years later they gave up, all they had produced was a spec for an XML
format to store bookmarks.  Quoting boyd: "hysterical."

uriel

> On Tue, Feb 07, 2006 at 10:50:22PM -0800, Lyndon Nerenberg wrote:
>> So I thought, but something's not right.  I can't demonstrate more  
>> until I get to work in the morning.
> 
> Hmm.  I'm going to make an unpopular but pragmatic suggestion: Don't use
> sed or sam, but instead, use a language with an HTML parser available.
> There are some jobs for which regular expressions aren't the best tool;
> I personally think this is one of them.  Here's a script I posted to
> USENET years ago to extract data from a table.
> 
> #!/usr/local/bin/python
> 
> import sys
> import htmllib
> import formatter
> 
> class MyParser(htmllib.HTMLParser):
>         def __init__(self, format):
>                 htmllib.HTMLParser.__init__(self, format)
>                 self.state = 0
> 
>         def do_tr(self, data):
>                 if self.state:
>                         print htmllib.HTMLParser.save_end(self)
>                         self.state = 0
> 
>         def do_td(self, data):
>                 if self.state:
>                         print "%s, " % htmllib.HTMLParser.save_end(self),
>                 self.state = 1
>                 htmllib.HTMLParser.save_bgn(self)
> 
> parse = MyParser(formatter.NullFormatter())
> for file in sys.argv[1:]:
>         parse.feed(open(sys.argv[1],"r").read())
> parse.close()
> 
> I wonder if this even still works.....
> 
>       - Dan C.

Reply via email to