Re: Extract pure text from MS Word documents

Nick Burch Tue, 09 Jan 2007 02:54:27 -0800

On Mon, 8 Jan 2007, Leila Homaeian wrote:

I am using the org.apache.poi.hwpf.extractor.WordExtractor class toextract the text from MS Word documents. The problem is that the outputincludes not only the text of interest, but also some keywordsindicating the text format, e.g. TOC, HYPERLINK, REF, etc. Is thereanyway to recognize and exclude these keywords?

In theory, there ought to be. The trouble is that the person who wrotemost of HWPF, Ryan Ackely, left to work for a firm that licensed theMicrosoft file format documentation, so we no longer have an expert on theword file format.

If you can figure out how to identify these blocks of text, we'd love apatch!

I used the getIstd() function from org.apache.poi.hwpf.model.PAPX toaccess the sti codes of individual paragraphs. However, I did not find asimilar class or function that can be applied to individual words.

A paragraph is made up of a number of CharacterRuns. You could try lookingfor a similar functon for CharacterRuns?


Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Extract pure text from MS Word documents

Reply via email to