On 11/05/09 06:31, David Lu wrote:
I have some Java code that converts Microsoft Word documents
to plain text. It works fine. However, I do have a problem
with some special characters such as the left and right double
quotes, the trademark and copyright symbols, etc., that are
not working as I expect.
Basically, for a Word document, when I use the GUI and "Save As"
the document as "Text (encoded)" and "UTF-8", I get all the special
characters in the output file.
However, when I use Java and call storeAsURL() with the same
input file, using "Text (encoded)" for FilterName and "UTF-8"
for FilterOptions, some of the characters, namely the trademark
and copyright symbols, and a few others, are saved as question
marks.
I've also tried using "Windows-1252/WinLatin 1" as the encoding
with the same results.
The "Save As" from GUI seems to work "better" than calling
storeAsURL() in terms of preserving more characters. But the
documentation for storeAsURL() seems to indicate it's the same
as "Save As". So do I need to specify additional properties
for storeAsURL()?
This sounds strange, and I suggest you file a bug for it.
Windows-1252 has additional characters compared to ISO 8859-1, in the
range 0x80--0x9F, and at first it sounded like that fact might somehow
be related to the problem. However, trademark (U+2122) is in that area
(0x99) while copyright (U+00A9) is not (0xA9), yet you say that both
have the problem...
-Stephan
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]