Best way to extract text from a word file

Nick Burch Thu, 09 Feb 2006 05:44:04 -0800

Hi All

I'm thinking about adding a simple text extractor utility to hwpf, sinceeveryone is currently rolling their own, and that's not veryprogrammer efficient!


When I get text out, I normally use something like:
        StringBuffer text = new StringBuffer();
        Range r = wdoc.getRange();
        for(int i=0; i < r.numParagraphs(); i++) {
                Paragraph p = r.getParagraph(i);
                text.append(p.text());
        }

However, I've also seen people advocate an approach like:
        StringBuffer text = new StringBuffer();
        Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
        while (textPieces.hasNext()) {
                TextPiece piece = (TextPiece) textPieces.next();

                String encoding = "Cp1252";
                if (piece.usesUnicode()) {
                        encoding = "UTF-16LE";
                }
                text.append(new String(piece.getRawBytes(), encoding));
        }
(normally accompanied by some stripping out of macros)

Is there any reason why I shouldn't use the first version?

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Best way to extract text from a word file

Reply via email to