Make it two persons! Paulo Gaspar
> -----Original Message----- > From: Scott Sanders [mailto:[EMAIL PROTECTED]] > Sent: Thursday, February 14, 2002 11:12 PM > To: [EMAIL PROTECTED] > Subject: RE: [ANNOUNCE] Xerces HTML Parser > > > I personally find this to be greatly helpful, after having completely > hacking Jtidy to take care of most of the 'edge' conditions in malformed > HTML that we could find, just to get a DOM, just to be able to use XSLT. > If this was part of Xerces, or even an add-in, it would be greatly > appreciated by at least one person. > > Scott Sanders > > > -----Original Message----- > > From: Andy Clark [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, February 14, 2002 1:53 PM > > To: [EMAIL PROTECTED]; > > [EMAIL PROTECTED]; [EMAIL PROTECTED] > > Subject: Re: [ANNOUNCE] Xerces HTML Parser > > > > > > It was bugging me that the first version of the NekoHTML parser > > could only handle the character encoding "Cp1252" (which is > > the basic Windows encoding), so I updated the code to be able > > to automatically handle UTF-8 (w/ BOM) and UTF-16. In > > addition, it can detect the presence of a <meta > > http-equiv='content-type' content='text/html; charset=XXX'> > > tag and scan the remaining document using charset "XXX", > > assuming that Java has an appropriate decoder available. > > > > You can download the latest code from the following URL: > > > > http://www.apache.org/~andyc/ > > > > I am very interested in hearing from people to see if the > > code is useful and if they think it should be a standard part > > of Xerces-J. > > > > Solving the problem of changing the character decoder in the > > middle of the stream when the <meta> tag is detected was > > rather interesting. If you want to know the technical > > details, read on... > > > > The code isn't that complicated but it turned out to be not > > as straightforward as I thought. First, the Java decoders > > have a nasty habit of reading 8K of bytes despite only asking for > > as little as a single character! This is annoying, at best, > > because you can't change the decoder because the original > > decoder has already consumed more bytes than it should. > > > > Then, even if the Java decoders were written to only consume > > as many bytes as needed to return the requested characters, > > there's still a problem caused by buffering. Since I buffer a > > block of characters to improve performance, this again > > consumes bytes *past* the <meta> tag which will destroy any > > chance of changing the decoder mid-stream. > > > > So to solve this problem, I wrote a "playback" input stream > > which buffers all of the bytes read on the underlying input > > stream. If the scanner detects a <meta> tag that changes the > > encoding, then the stream is played back again. And if the > > <body> tag is found (or a tag whose parent should be the > > <body> tag), then the buffer is cleared. So at worst, just > > the beginnging of the document is buffered which isn't > > too bad. > > > > You may notice that if the stream is played back, then the > > parser will scan document contents that it has already > > seen. This was simple enough to fix, though. When the > > character encoding is changed, I note how many elements I > > have already seen. Then, when the stream is re-scanned, I > > ignore the events until the number of elements is back to > > where I was when I detected the <meta> tag. > > > > So there's got to be an easier way to change the decoder > > of the stream than to go through all of this trouble, > > right? Not unless I want to re-write every known character > > decoder. So I'm stuck with this kind of a solution. But it > > seems to work very well. > > > > -- > > Andy Clark * [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > > In case of troubles, e-mail: [EMAIL PROTECTED] > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > In case of troubles, e-mail: [EMAIL PROTECTED] > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- In case of troubles, e-mail: [EMAIL PROTECTED] To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
