RE: [ANNOUNCE] Xerces HTML Parser

Scott Sanders Thu, 14 Feb 2002 14:07:47 -0800

I personally find this to be greatly helpful, after having completely
hacking Jtidy to take care of most of the 'edge' conditions in malformed
HTML that we could find, just to get a DOM, just to be able to use XSLT.
If this was part of Xerces, or even an add-in, it would be greatly
appreciated by at least one person.


Scott Sanders

> -----Original Message-----
> From: Andy Clark [mailto:[EMAIL PROTECTED]] 
> Sent: Thursday, February 14, 2002 1:53 PM
> To: [EMAIL PROTECTED]; 
> [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: [ANNOUNCE] Xerces HTML Parser
> 
> 
> It was bugging me that the first version of the NekoHTML parser 
> could only handle the character encoding "Cp1252" (which is 
> the basic Windows encoding), so I updated the code to be able 
> to automatically handle UTF-8 (w/ BOM) and UTF-16. In 
> addition, it can detect the presence of a <meta 
> http-equiv='content-type' content='text/html; charset=XXX'> 
> tag and scan the remaining document using charset "XXX", 
> assuming that Java has an appropriate decoder available.
> 
> You can download the latest code from the following URL:
> 
>   http://www.apache.org/~andyc/
> 
> I am very interested in hearing from people to see if the 
> code is useful and if they think it should be a standard part 
> of Xerces-J. 
> 
> Solving the problem of changing the character decoder in the 
> middle of the stream when the <meta> tag is detected was 
> rather interesting. If you want to know the technical 
> details, read on...
> 
> The code isn't that complicated but it turned out to be not
> as straightforward as I thought. First, the Java decoders 
> have a nasty habit of reading 8K of bytes despite only asking for 
> as little as a single character! This is annoying, at best, 
> because you can't change the decoder because the original 
> decoder has already consumed more bytes than it should.
> 
> Then, even if the Java decoders were written to only consume
> as many bytes as needed to return the requested characters, 
> there's still a problem caused by buffering. Since I buffer a 
> block of characters to improve performance, this again 
> consumes bytes *past* the <meta> tag which will destroy any 
> chance of changing the decoder mid-stream.
> 
> So to solve this problem, I wrote a "playback" input stream 
> which buffers all of the bytes read on the underlying input 
> stream. If the scanner detects a <meta> tag that changes the 
> encoding, then the stream is played back again. And if the 
> <body> tag is found (or a tag whose parent should be the 
> <body> tag), then the buffer is cleared. So at worst, just 
> the beginnging of the document is buffered which isn't 
> too bad.
> 
> You may notice that if the stream is played back, then the 
> parser will scan document contents that it has already 
> seen. This was simple enough to fix, though. When the
> character encoding is changed, I note how many elements I
> have already seen. Then, when the stream is re-scanned, I 
> ignore the events until the number of elements is back to 
> where I was when I detected the <meta> tag.
> 
> So there's got to be an easier way to change the decoder
> of the stream than to go through all of this trouble,
> right? Not unless I want to re-write every known character 
> decoder. So I'm stuck with this kind of a solution. But it 
> seems to work very well.
> 
> -- 
> Andy Clark * [EMAIL PROTECTED]
> 
> ---------------------------------------------------------------------
> In case of troubles, e-mail:     [EMAIL PROTECTED]
> To unsubscribe, e-mail:          [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

---------------------------------------------------------------------
In case of troubles, e-mail:     [EMAIL PROTECTED]
To unsubscribe, e-mail:          [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [ANNOUNCE] Xerces HTML Parser

Reply via email to