On Apr 24, 2008, at 11:11 AM, Vadim Gritsenko wrote:

On Apr 24, 2008, at 10:14 AM, James Cowie wrote:

it all depends on the content you wish to deliver from the transformation. if you know that you will allways require UTF-8 then set this as the default, you should be able to detect browser version and work from there.

Transformer never works with java character encoding directly. It is always receives textual data in either char[] format (ContentHandler#characters method) or String format (attributes in ContentHandler#startElement method).

*If* transformer, for its internal needs, has to serialize textual data into binary format (convert String or char[] to byte[]), then it almost always should use UTF-8. If it interfaces with some legacy system then it could be configured with another encoding. But IIUC this is not the case here.

But whatever transformer does internally, it does not affect what it produces as its output, since its output is content passed to the ContentHandler#characters and ContentHandler#startElement methods - which take only textual data (no character encoding is applicable here) and not a binary data.

What did I miss? :)

Well AFAIU problem in NekoHTMLTransformer is it corrupts text data here:

            ByteArrayInputStream bais =
                new ByteArrayInputStream(text.getBytes());

It must have used:

            ByteArrayInputStream bais =
                new ByteArrayInputStream(text.getBytes("UTF-8"));

But the best way is to avoid transcoding step all together:

            Reader bais =
                new StringReader(text);
            DOMBuilder builder = new DOMBuilder();
            parser.setContentHandler(builder);
            parser.parse(new InputSource(bais));



Vadim

Reply via email to