Nick Burch wrote:
You could try using org.apache.poi.hwpf.HWPFDocument, and getting the range, then the paragraphs, and grab the text from each paragraph. If there's interest, I could probably commit an extractor that does this to poi.

Yes, that's exactly what I'm doing. Having this in POI would benefit me a lot though, as I hardly understand the POI basics to be honest (my fault, not POI's).

This is my current code (adapted from Aperture code in CVS):

HWPFDocument doc = new HWPFDocument(poiFileSystem);
StringBuffer buffer = new StringBuffer(4096);

Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
while (textPieces.hasNext()) {
        TextPiece piece = (TextPiece) textPieces.next();

        // the following is derived from
        // http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406
        String encoding = "Cp1252";
        if (piece.usesUnicode()) {
                encoding = "UTF-16LE";
        }

        buffer.append(new String(piece.getRawBytes(), encoding));
}

// normalize end-of-line characters and remove any lines
// containing macros
BufferedReader reader = new BufferedReader(new
    StringReader(buffer.toString()));
buffer.setLength(0);

String line;
while ((line = reader.readLine()) != null) {
        if (line.indexOf("DOCPROPERTY") == -1) {
                buffer.append(line);
                buffer.append(END_OF_LINE);
        }
}

// fetch the extracted full-text
String text = buffer.toString();


Regards,

Chris
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to