Re: [Pharo-users] How should XMLHTMLParser handle strange HTML?

Esteban Maringolo Thu, 02 Apr 2020 11:54:54 -0700

Hi Peter,


Just in case it helps you parsing the files...

I had to parse HTML with a XMLParser (no XMLHTMLParser) so what I did
was to pass it first through html tidy [1] converting it to xhtml
which is compatible with XML parsers (it is XML, after all).

Regards,

[1] http://www.html-tidy.org/

Esteban A. Maringolo

On Thu, Apr 2, 2020 at 2:17 PM PBKResearch <pe...@pbkresearch.co.uk> wrote:
>
> Hello
>
>
>
> I have come across a strange problem in using XMLHTMLParser to parse some 
> HTML files which use strange constructions. The input files have been 
> generated by using MS Outlook to translate incoming messages, stored in .msg 
> files, into HTML. The translated files display normally in Firefox, and the 
> XMLHTMLParser appears to generate a normal parse, but examination of the 
> parse output shows that the structure is distorted, and about half the input 
> text has been put into one string node.
>
>
>
> Hunting around, I am convinced that the trouble lies in the presence in the 
> HTML source of pairs of comment-like tags, with this form:
>
> <![if !supportLists]>
>
> <![endif]>
>
> since the distorted parse starts at the first occurrence of one of these tags.
>
>
>
> I don’t know whether these are meant to be a structure in some programming 
> language – there is no reference to supportLists anywhere in the source code. 
> When it is displayed in Firefox, use of the ‘Inspect Element’ option shows 
> that the browser has treated them as comments, displaying them with the 
> necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code 
> by inserting the dashes, and XMLHTMLParser parsed everything correctly.
>
>
>
> I have a workaround, therefore; either edit in the dashes to make them into 
> legitimate comments, or equivalently edit out these tags completely. The only 
> question of general interest is whether XMLHTMLParser should be expected to 
> handle these in some other way, rather than produce a distorted parse without 
> comment. The Firefox approach, turning them into comments, seems sensible. It 
> would also be interesting if anyone has any idea what is going on in the 
> source code.
>
>
>
> Thanks for any help
>
>
>
> Peter Kenny

Re: [Pharo-users] How should XMLHTMLParser handle strange HTML?

Reply via email to