Re: [9fans] More 'Sam I am'

Dan Cross Wed, 08 Feb 2006 09:35:59 -0800

On Tue, Feb 07, 2006 at 10:50:22PM -0800, Lyndon Nerenberg wrote:
> So I thought, but something's not right.  I can't demonstrate more  
> until I get to work in the morning.


Hmm.  I'm going to make an unpopular but pragmatic suggestion: Don't use
sed or sam, but instead, use a language with an HTML parser available.
There are some jobs for which regular expressions aren't the best tool;
I personally think this is one of them.  Here's a script I posted to
USENET years ago to extract data from a table.

#!/usr/local/bin/python

import sys
import htmllib
import formatter

class MyParser(htmllib.HTMLParser):
        def __init__(self, format):
                htmllib.HTMLParser.__init__(self, format)
                self.state = 0

        def do_tr(self, data):
                if self.state:
                        print htmllib.HTMLParser.save_end(self)
                        self.state = 0

        def do_td(self, data):
                if self.state:
                        print "%s, " % htmllib.HTMLParser.save_end(self),
                self.state = 1
                htmllib.HTMLParser.save_bgn(self)

parse = MyParser(formatter.NullFormatter())
for file in sys.argv[1:]:
        parse.feed(open(sys.argv[1],"r").read())
parse.close()

I wonder if this even still works.....

        - Dan C.

Re: [9fans] More 'Sam I am'

Reply via email to