Nick Burch wrote:
You could try using org.apache.poi.hwpf.HWPFDocument, and getting the range, then the paragraphs, and grab the text from each paragraph. If there's interest, I could probably commit an extractor that does this to poi.
Yes, that's exactly what I'm doing. Having this in POI would benefit me a lot though, as I hardly understand the POI basics to be honest (my fault, not POI's).
This is my current code (adapted from Aperture code in CVS): HWPFDocument doc = new HWPFDocument(poiFileSystem); StringBuffer buffer = new StringBuffer(4096); Iterator textPieces = doc.getTextTable().getTextPieces().iterator(); while (textPieces.hasNext()) { TextPiece piece = (TextPiece) textPieces.next(); // the following is derived from // http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406 String encoding = "Cp1252"; if (piece.usesUnicode()) { encoding = "UTF-16LE"; } buffer.append(new String(piece.getRawBytes(), encoding)); } // normalize end-of-line characters and remove any lines // containing macros BufferedReader reader = new BufferedReader(new StringReader(buffer.toString())); buffer.setLength(0); String line; while ((line = reader.readLine()) != null) { if (line.indexOf("DOCPROPERTY") == -1) { buffer.append(line); buffer.append(END_OF_LINE); } } // fetch the extracted full-text String text = buffer.toString(); Regards, Chris -- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]