Brent Baccala <[EMAIL PROTECTED]> writes: > I've got a set of scripts that alter HTML content (expected to be in > spanish) by adding a link to every word that triggers a lookup in a > spanish/english dictionary. I use HTML::Parser. > > Anyway, I've come across some documents that don't parse right. They > appear to have been generated by Microsoft Office, and include tags like > this: > > <![if !supportEmptyParas]> <![endif]> > > The "if" and "supportEmptyParas" end up getting flagged as text, even if > I've called marked_sections(1) This stuff does not follow the marked_sections syntax so I'm not surprised. As a marked section it would have to be expressed something like: <![ &supportEmptyParams; [ ]]> where &supportEmptyParams; expands to either "IGNORE" or "INCLUDE". I don't know SGML well enough to tell if this is something worth supporting or if this stuff is valid SGML at all. Does anybody else know? A simple hack to avoid this stuff might be to run something like s/<!(if|endif)\[.*?\]>// on the text before feeding it to HTML::Parser. > Since I don't really know SGML, I'm not sure how this should be handled, > or even if it can be handled without having the Microsoft schema (which > I can't find) available to be parsed. Anyway, I thought I'd let you > know. The URL of the original document is: > > http://www.sgci.mec.es/uk/Pub/Tecla/2001/julio2b.htm > > and the page for my scripts is: > > http://vyger.freesoft.org/software/spanish > > Thanks for your work with HTML::Parser, it's made this script fairly > easy to write. Good to hear! Regards, Gisle
