[Pharo-users] How should XMLHTMLParser handle strange HTML?

PBKResearch Thu, 02 Apr 2020 10:18:06 -0700

Hello


I have come across a strange problem in using XMLHTMLParser to parse some
HTML files which use strange constructions. The input files have been
generated by using MS Outlook to translate incoming messages, stored in .msg
files, into HTML. The translated files display normally in Firefox, and the
XMLHTMLParser appears to generate a normal parse, but examination of the
parse output shows that the structure is distorted, and about half the input
text has been put into one string node.

 

Hunting around, I am convinced that the trouble lies in the presence in the
HTML source of pairs of comment-like tags, with this form:

<![if !supportLists]>

<![endif]>

since the distorted parse starts at the first occurrence of one of these
tags.

 

I don't know whether these are meant to be a structure in some programming
language - there is no reference to supportLists anywhere in the source
code. When it is displayed in Firefox, use of the 'Inspect Element' option
shows that the browser has treated them as comments, displaying them with
the necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source
code by inserting the dashes, and XMLHTMLParser parsed everything correctly.


 

I have a workaround, therefore; either edit in the dashes to make them into
legitimate comments, or equivalently edit out these tags completely. The
only question of general interest is whether XMLHTMLParser should be
expected to handle these in some other way, rather than produce a distorted
parse without comment. The Firefox approach, turning them into comments,
seems sensible. It would also be interesting if anyone has any idea what is
going on in the source code.

 

Thanks for any help

 

Peter Kenny

[Pharo-users] How should XMLHTMLParser handle strange HTML?

Reply via email to