Hi Pål!
The output of tidy already contains question marks in place of M$ characters:
http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
I tried to add switches to JTidy:
Tidy tidy = new Tidy();
tidy.setXmlOut(true);
tidy.setRawOut(true);
tidy.setTidyMark(false);
tidy.setCharEncoding(3); -- 3 = UTF-8 in JTidy R7
Document xmlDocument = tidy.parseDOM(in, null);
But it was not enough. The real solution implies (also?) to set the encoding of
JTidy input string to "UTF-8" and NOT to the encoding of the HTTP response
(which is here ISO-8859-1). Response encoding seems to be ignored by PDF
readers but probably has to be set to "UTF-8" also:
InputStream in = new ByteArrayInputStream(("<title>" + nameOfPage +
"</title>" + htmlOfPage)
.getBytes("UTF-8"));
Please find herewith the modified source code. I would deeply appreciate that
you publish a new JAR as it would permit me to normalize my setting (I
currently patch the Jar with the compiled class!)
Have a nice evening!
Christophe Dupriez