Hi Peter,
Just in case it helps you parsing the files... I had to parse HTML with a XMLParser (no XMLHTMLParser) so what I did was to pass it first through html tidy [1] converting it to xhtml which is compatible with XML parsers (it is XML, after all). Regards, [1] http://www.html-tidy.org/ Esteban A. Maringolo On Thu, Apr 2, 2020 at 2:17 PM PBKResearch <pe...@pbkresearch.co.uk> wrote: > > Hello > > > > I have come across a strange problem in using XMLHTMLParser to parse some > HTML files which use strange constructions. The input files have been > generated by using MS Outlook to translate incoming messages, stored in .msg > files, into HTML. The translated files display normally in Firefox, and the > XMLHTMLParser appears to generate a normal parse, but examination of the > parse output shows that the structure is distorted, and about half the input > text has been put into one string node. > > > > Hunting around, I am convinced that the trouble lies in the presence in the > HTML source of pairs of comment-like tags, with this form: > > <![if !supportLists]> > > <![endif]> > > since the distorted parse starts at the first occurrence of one of these tags. > > > > I don’t know whether these are meant to be a structure in some programming > language – there is no reference to supportLists anywhere in the source code. > When it is displayed in Firefox, use of the ‘Inspect Element’ option shows > that the browser has treated them as comments, displaying them with the > necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code > by inserting the dashes, and XMLHTMLParser parsed everything correctly. > > > > I have a workaround, therefore; either edit in the dashes to make them into > legitimate comments, or equivalently edit out these tags completely. The only > question of general interest is whether XMLHTMLParser should be expected to > handle these in some other way, rather than produce a distorted parse without > comment. The Firefox approach, turning them into comments, seems sensible. It > would also be interesting if anyone has any idea what is going on in the > source code. > > > > Thanks for any help > > > > Peter Kenny