Re: [jira] Commented: (COCOON-2063) NekoHTMLTransformer needs to set the default-encoding of the current system to work properly with UTF-8

Vadim Gritsenko Thu, 24 Apr 2008 08:21:30 -0700

On Apr 24, 2008, at 11:11 AM, Vadim Gritsenko wrote:

On Apr 24, 2008, at 10:14 AM, James Cowie wrote:
it all depends on the content you wish to deliver from thetransformation. if you know that you will allways require UTF-8then set this as the default, you should be able to detect browserversion and work from there.
Transformer never works with java character encoding directly. It isalways receives textual data in either char[] format(ContentHandler#characters method) or String format (attributes inContentHandler#startElement method).
*If* transformer, for its internal needs, has to serialize textualdata into binary format (convert String or char[] to byte[]), thenit almost always should use UTF-8. If it interfaces with some legacysystem then it could be configured with another encoding. But IIUCthis is not the case here.
But whatever transformer does internally, it does not affect what itproduces as its output, since its output is content passed to theContentHandler#characters and ContentHandler#startElement methods -which take only textual data (no character encoding is applicablehere) and not a binary data.
What did I miss? :)


Well AFAIU problem in NekoHTMLTransformer is it corrupts text data here:

            ByteArrayInputStream bais =
                new ByteArrayInputStream(text.getBytes());

It must have used:

            ByteArrayInputStream bais =
                new ByteArrayInputStream(text.getBytes("UTF-8"));

But the best way is to avoid transcoding step all together:

            Reader bais =
                new StringReader(text);
            DOMBuilder builder = new DOMBuilder();
            parser.setContentHandler(builder);
            parser.parse(new InputSource(bais));



Vadim

Re: [jira] Commented: (COCOON-2063) NekoHTMLTransformer needs to set the default-encoding of the current system to work properly with UTF-8

Reply via email to