In article <[EMAIL PROTECTED]> you write:
>On Wed, Feb 08, 2006 at 10:14:53AM -0800, Lyndon Nerenberg wrote:
>> The problem with this is the data I want is interspersed with data that
>> I don't want. And the bits I don't want are variable length
>> inconsistent multi-line text that is a bitch to filter out of the
>> rendered output stream. It turns out that sam (against the raw HTML)
>> was the only tool that was able to do the job. I just wish I could wrap
>> it in a shell script that I could throw at the directory containing all
>> the .html files.
>
>I'm not talking about rendering, just parsing. Well, ultimately,
>what's important is that you get what you need out of the solution, I
>guess. Still, regular expressions alone give you part of the story,
>but not the whole thing. I submit that the power to actually parse
>the tokens in the data as opposed to just matching them (even if the
>regular expression language you're using is powerful enough to match
>the structure of the document) is more powerful. But hey, if sam
>floats your boat, fish on that river!
>
> - Dan C.
Possibly of interest is the xmlgawk project:
http://www.sourceforge.net/projects/xmlgawk
This is an extended version of GNU Awk with an XML parser module add-on.
The idea that instead of reading lines, you get XML tokens (tags, fields
in the tags, and marked-up data). I am not directly involved in it, but
it looks like a rather promising alternative for people who would like
to process XML type data in the more traditional Unixy fashion.
Arnold
--
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd. arnold AT skeeve DOT com
P.O. Box 354 Home Phone: +972 8 979-0381 Fax: +1 206 350 8765
Nof Ayalon Cell Phone: +972 50 729-7545
D.N. Shimshon 99785 ISRAEL