In article <[EMAIL PROTECTED]> you write:
>On Wed, Feb 08, 2006 at 10:14:53AM -0800, Lyndon Nerenberg wrote:
>> The problem with this is the data I want is interspersed with data that 
>> I don't want.  And the bits I don't want are variable length 
>> inconsistent multi-line text that is a bitch to filter out of the 
>> rendered output stream.  It turns out that sam (against the raw HTML) 
>> was the only tool that was able to do the job.  I just wish I could wrap 
>> it in a shell script that I could throw at the directory containing all 
>> the .html files.
>
>I'm not talking about rendering, just parsing.  Well, ultimately,
>what's important is that you get what you need out of the solution, I
>guess.  Still, regular expressions alone give you part of the story,
>but not the whole thing.  I submit that the power to actually parse
>the tokens in the data as opposed to just matching them (even if the
>regular expression language you're using is powerful enough to match
>the structure of the document) is more powerful.  But hey, if sam
>floats your boat, fish on that river!
>
>       - Dan C.

Possibly of interest is the xmlgawk project:

        http://www.sourceforge.net/projects/xmlgawk

This is an extended version of GNU Awk with an XML parser module add-on.
The idea that instead of reading lines, you get XML tokens (tags, fields
in the tags, and marked-up data).  I am not directly involved in it, but
it looks like a rather promising alternative for people who would like
to process XML type data in the more traditional Unixy fashion.

Arnold
-- 
Aharon (Arnold) Robbins --- Pioneer Consulting Ltd.     arnold AT skeeve DOT com
P.O. Box 354            Home Phone: +972  8 979-0381    Fax: +1 206 350 8765
Nof Ayalon              Cell Phone: +972 50  729-7545
D.N. Shimshon 99785     ISRAEL

Reply via email to