It was bugging me that the first version of the NekoHTML parser could only handle the character encoding "Cp1252" (which is the basic Windows encoding), so I updated the code to be able to automatically handle UTF-8 (w/ BOM) and UTF-16. In addition, it can detect the presence of a <meta http-equiv='content-type' content='text/html; charset=XXX'> tag and scan the remaining document using charset "XXX", assuming that Java has an appropriate decoder available.
You can download the latest code from the following URL: http://www.apache.org/~andyc/ I am very interested in hearing from people to see if the code is useful and if they think it should be a standard part of Xerces-J. Solving the problem of changing the character decoder in the middle of the stream when the <meta> tag is detected was rather interesting. If you want to know the technical details, read on... The code isn't that complicated but it turned out to be not as straightforward as I thought. First, the Java decoders have a nasty habit of reading 8K of bytes despite only asking for as little as a single character! This is annoying, at best, because you can't change the decoder because the original decoder has already consumed more bytes than it should. Then, even if the Java decoders were written to only consume as many bytes as needed to return the requested characters, there's still a problem caused by buffering. Since I buffer a block of characters to improve performance, this again consumes bytes *past* the <meta> tag which will destroy any chance of changing the decoder mid-stream. So to solve this problem, I wrote a "playback" input stream which buffers all of the bytes read on the underlying input stream. If the scanner detects a <meta> tag that changes the encoding, then the stream is played back again. And if the <body> tag is found (or a tag whose parent should be the <body> tag), then the buffer is cleared. So at worst, just the beginnging of the document is buffered which isn't too bad. You may notice that if the stream is played back, then the parser will scan document contents that it has already seen. This was simple enough to fix, though. When the character encoding is changed, I note how many elements I have already seen. Then, when the stream is re-scanned, I ignore the events until the number of elements is back to where I was when I detected the <meta> tag. So there's got to be an easier way to change the decoder of the stream than to go through all of this trouble, right? Not unless I want to re-write every known character decoder. So I'm stuck with this kind of a solution. But it seems to work very well. -- Andy Clark * [EMAIL PROTECTED] --------------------------------------------------------------------- In case of troubles, e-mail: [EMAIL PROTECTED] To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]