On Mon, 8 Jan 2007, Leila Homaeian wrote:
I am using the org.apache.poi.hwpf.extractor.WordExtractor class to extract the text from MS Word documents. The problem is that the output includes not only the text of interest, but also some keywords indicating the text format, e.g. TOC, HYPERLINK, REF, etc. Is there anyway to recognize and exclude these keywords?
In theory, there ought to be. The trouble is that the person who wrote most of HWPF, Ryan Ackely, left to work for a firm that licensed the Microsoft file format documentation, so we no longer have an expert on the word file format.
If you can figure out how to identify these blocks of text, we'd love a patch!
I used the getIstd() function from org.apache.poi.hwpf.model.PAPX to access the sti codes of individual paragraphs. However, I did not find a similar class or function that can be applied to individual words.
A paragraph is made up of a number of CharacterRuns. You could try looking for a similar functon for CharacterRuns?
Nick --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
