Hmm.  I'm going to make an unpopular but pragmatic suggestion: Don't use
sed or sam, but instead, use a language with an HTML parser available.
There are some jobs for which regular expressions aren't the best tool;
I personally think this is one of them.  Here's a script I posted to
USENET years ago to extract data from a table.

The problem with this is the data I want is interspersed with data that I don't want. And the bits I don't want are variable length inconsistent multi-line text that is a bitch to filter out of the rendered output stream. It turns out that sam (against the raw HTML) was the only tool that was able to do the job. I just wish I could wrap it in a shell script that I could throw at the directory containing all the .html files.

--lyndon

Reply via email to