Hi Esteban

Thanks for the suggestion. I have skimmed through the description of tidy. I 
think the things it puts right (mis-matched tags etc.) are exactly the things 
that XMLHTMLParser looks for and fixes. For example, in my distorted parses, 
the final </body> and </html> tags had been absorbed into the massive string 
node that contains most of the input text; the parser detected this and 
inserted them at the right point to close the parse correctly. 
Since my workaround, of editing out the specific features that cause the parse 
to go wrong, seems to fix the problem, I shall probably continue with it.

Thanks for your help

Peter Kenny

-----Original Message-----
From: Pharo-users <pharo-users-boun...@lists.pharo.org> On Behalf Of Esteban 
Maringolo
Sent: 02 April 2020 19:53
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] How should XMLHTMLParser handle strange HTML?

Hi Peter,


Just in case it helps you parsing the files...

I had to parse HTML with a XMLParser (no XMLHTMLParser) so what I did was to 
pass it first through html tidy [1] converting it to xhtml which is compatible 
with XML parsers (it is XML, after all).

Regards,

[1] http://www.html-tidy.org/

Esteban A. Maringolo

On Thu, Apr 2, 2020 at 2:17 PM PBKResearch <pe...@pbkresearch.co.uk> wrote:
>
> Hello
>
>
>
> I have come across a strange problem in using XMLHTMLParser to parse some 
> HTML files which use strange constructions. The input files have been 
> generated by using MS Outlook to translate incoming messages, stored in .msg 
> files, into HTML. The translated files display normally in Firefox, and the 
> XMLHTMLParser appears to generate a normal parse, but examination of the 
> parse output shows that the structure is distorted, and about half the input 
> text has been put into one string node.
>
>
>
> Hunting around, I am convinced that the trouble lies in the presence in the 
> HTML source of pairs of comment-like tags, with this form:
>
> <![if !supportLists]>
>
> <![endif]>
>
> since the distorted parse starts at the first occurrence of one of these tags.
>
>
>
> I don’t know whether these are meant to be a structure in some programming 
> language – there is no reference to supportLists anywhere in the source code. 
> When it is displayed in Firefox, use of the ‘Inspect Element’ option shows 
> that the browser has treated them as comments, displaying them with the 
> necessary dashes as e.g. <!--[if !supportLists]-->. I edited the source code 
> by inserting the dashes, and XMLHTMLParser parsed everything correctly.
>
>
>
> I have a workaround, therefore; either edit in the dashes to make them into 
> legitimate comments, or equivalently edit out these tags completely. The only 
> question of general interest is whether XMLHTMLParser should be expected to 
> handle these in some other way, rather than produce a distorted parse without 
> comment. The Firefox approach, turning them into comments, seems sensible. It 
> would also be interesting if anyone has any idea what is going on in the 
> source code.
>
>
>
> Thanks for any help
>
>
>
> Peter Kenny


Reply via email to