Hello,

I am using the org.apache.poi.hwpf.extractor.WordExtractor class to extract the text from MS Word documents. The problem is that the output includes not only the text of interest, but also some keywords indicating the text format, e.g. TOC, HYPERLINK, REF, etc. Is there anyway to recognize and exclude these keywords?

I used the getIstd() function from org.apache.poi.hwpf.model.PAPX to access the sti codes of individual paragraphs. However, I did not find a similar class or function that can be applied to individual words.

Any help is much appreciated.

Regards,
Leila

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Reply via email to