so we agree? The bug is in HTMLGenerator, but the expected encoding isn't UTF-8 (reading from http://www.w3.org/ doesn't work for me (NullPointerException)), but ISO-8859-1 or maybe the default encoding of the JVM. Can you file a bug in bugzilla?
Regards,
Joerg
Yury Mikhienko wrote:
Hi Joerg!
Thanx for your reply.
The pure Tidy works properly (output stream encoding is the same as the input stream encoding).
The problem, from my point of view, is in transformer (or streamer [if xpath is null value]) input stream encoding (HTMLGenerator),
because Tidy DOM parser returns KOI8-R encoded document (the same as Tidy input document encoding), but HTMLGenegator needs, I guess, UTF-8 encoded document in input stream for it's transformer or streamer.
What do you think about my guessing?
Hello Yuri,
I only can confirm the bug in HTML generator. It seems it can not read the KOI8-R encoded file correctly. I tested it with your html snippet saved to a static file.
serializer.setOutputProperty(OutputKeys.ENCODING, "KOI8-R"); of course does not help, because that's only the output. Configuring the serializer in the sitemap to KOI8-R works correctly, if the input file is not encoded in KOI8-R (and I guess in some other more or less exotic encodings too).
If it were a bug in the serializer, the character reference like ð would be ok, because a character, that's not directly available in this encoding, must be expressed/referenced by such a reference.
I hope, I didn't say anything wrong ;-) Yuri, I think it's the best to post a bug in bugzilla at http://nagoya.apache.org/bugzilla/.
Regards,
Joerg
--------------------------------------------------------------------- Please check that your question has not already been answered in the FAQ before posting. <http://xml.apache.org/cocoon/faq/index.html>
To unsubscribe, e-mail: <[EMAIL PROTECTED]> For additional commands, e-mail: <[EMAIL PROTECTED]>