character encoding and charsets

Justin Warren Thu, 03 May 2007 07:26:41 -0700

Hi guys..


I have an interesting problem. I am using POI to extract text from a
word doc. (word 2000/03 usually). But the document is written in
Chinese. So naturally, when I write the extracted text to a plaintext
file, I get random ascii characters. So, I want to be able to decode the
charset into UTF-8. Is there any way to determine the charset so I can
decode it?

 

In eclipse, I am doing a WordExtractor.getParagraphs() and if I set a
breakpoint, I can see the Chinese characters. Also, I noticed that there
is a property in HWPFDocument called field_27_cChFtnEdn. Is that
possibly what I should be looking at?

 

Thanks

character encoding and charsets

Reply via email to