No-break space and middle dot in String produced with WordExtractor

nguessan Thu, 13 Jul 2006 14:41:37 -0700

Hi all,
I used WordExtractor to extract texts from MS Word documents. The
documents have many non-text charaters that display as squares, and
sometimes as lines. However, most of the texts appear clearly. I did
hex dumps of the texts and found that some squares have the values A0
and some have B7. I tried to remove them using the String method
"String replace(char oldChar, char newChar)", but it does not remove
them.
Any idea on how I can remove these lines and square? It looks like some
appear each time a line of text wraps within a table in the MS Word
document.
Please see piece of code below (I used the version of WordExtractor in
the Nutch project parse-msword plugin because the one in the main POI
lib throws exceptions when I use it under Tomcat, ...)
Thank you in advance.


Nguessan


   .....
WordExtractor wordextractor = new WordExtractor();
try{
   String textContent = wordextractor.extractText(bin);
   textContent = textContent.replace('\u00A0','\u0000');
   textContent = textContent.replace('\u00B7','\u0000');
   writer = new BufferedWriter(new FileWriter(filename));
   writer.write(textContent);
   ....


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

No-break space and middle dot in String produced with WordExtractor

Reply via email to