On Tue, Feb 07, 2006 at 10:50:22PM -0800, Lyndon Nerenberg wrote:
> So I thought, but something's not right. I can't demonstrate more
> until I get to work in the morning.
Hmm. I'm going to make an unpopular but pragmatic suggestion: Don't use
sed or sam, but instead, use a language with an HTML parser available.
There are some jobs for which regular expressions aren't the best tool;
I personally think this is one of them. Here's a script I posted to
USENET years ago to extract data from a table.
#!/usr/local/bin/python
import sys
import htmllib
import formatter
class MyParser(htmllib.HTMLParser):
def __init__(self, format):
htmllib.HTMLParser.__init__(self, format)
self.state = 0
def do_tr(self, data):
if self.state:
print htmllib.HTMLParser.save_end(self)
self.state = 0
def do_td(self, data):
if self.state:
print "%s, " % htmllib.HTMLParser.save_end(self),
self.state = 1
htmllib.HTMLParser.save_bgn(self)
parse = MyParser(formatter.NullFormatter())
for file in sys.argv[1:]:
parse.feed(open(sys.argv[1],"r").read())
parse.close()
I wonder if this even still works.....
- Dan C.