Brent Baccala <[EMAIL PROTECTED]> writes:

> I've got a set of scripts that alter HTML content (expected to be in
> spanish) by adding a link to every word that triggers a lookup in a
> spanish/english dictionary.  I use HTML::Parser.
> 
> Anyway, I've come across some documents that don't parse right.  They
> appear to have been generated by Microsoft Office, and include tags like
> this:
> 
> <![if !supportEmptyParas]>&nbsp;<![endif]>
> 
> The "if" and "supportEmptyParas" end up getting flagged as text, even if
> I've called marked_sections(1)

This stuff does not follow the marked_sections syntax so I'm not
surprised.  As a marked section it would have to be expressed
something like:

  <![ &supportEmptyParams; [ &nbsp; ]]>

where &supportEmptyParams; expands to either "IGNORE" or "INCLUDE".

I don't know SGML well enough to tell if this is something worth
supporting or if this stuff is valid SGML at all.  Does anybody else
know?

A simple hack to avoid this stuff might be to run something like
s/<!(if|endif)\[.*?\]>// on the text before feeding it to HTML::Parser.

> Since I don't really know SGML, I'm not sure how this should be handled,
> or even if it can be handled without having the Microsoft schema (which
> I can't find) available to be parsed.  Anyway, I thought I'd let you
> know.  The URL of the original document is:
> 
>       http://www.sgci.mec.es/uk/Pub/Tecla/2001/julio2b.htm
> 
> and the page for my scripts is:
> 
>       http://vyger.freesoft.org/software/spanish
> 
> Thanks for your work with HTML::Parser, it's made this script fairly
> easy to write.

Good to hear!

Regards,
Gisle

Reply via email to