Hi all,
I used WordExtractor to extract texts from MS Word documents. The
documents have many non-text charaters that display as squares, and
sometimes as lines. However, most of the texts appear clearly. I did
hex dumps of the texts and found that some squares have the values A0
and some have B7. I tried to remove them using the String method
"String replace(char oldChar, char newChar)", but it does not remove
them.
Any idea on how I can remove these lines and square? It looks like some
appear each time a line of text wraps within a table in the MS Word
document.
Please see piece of code below (I used the version of WordExtractor in
the Nutch project parse-msword plugin because the one in the main POI
lib throws exceptions when I use it under Tomcat, ...)
Thank you in advance.
Nguessan
.....
WordExtractor wordextractor = new WordExtractor();
try{
String textContent = wordextractor.extractText(bin);
textContent = textContent.replace('\u00A0','\u0000');
textContent = textContent.replace('\u00B7','\u0000');
writer = new BufferedWriter(new FileWriter(filename));
writer.write(textContent);
....
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/