Nick Burch wrote:
You could try using org.apache.poi.hwpf.HWPFDocument, and getting the
range, then the paragraphs, and grab the text from each paragraph. If
there's interest, I could probably commit an extractor that does this to
poi.
Yes, that's exactly what I'm doing. Having this in POI would benefit me
a lot though, as I hardly understand the POI basics to be honest (my
fault, not POI's).
This is my current code (adapted from Aperture code in CVS):
HWPFDocument doc = new HWPFDocument(poiFileSystem);
StringBuffer buffer = new StringBuffer(4096);
Iterator textPieces = doc.getTextTable().getTextPieces().iterator();
while (textPieces.hasNext()) {
TextPiece piece = (TextPiece) textPieces.next();
// the following is derived from
// http://article.gmane.org/gmane.comp.jakarta.poi.devel/7406
String encoding = "Cp1252";
if (piece.usesUnicode()) {
encoding = "UTF-16LE";
}
buffer.append(new String(piece.getRawBytes(), encoding));
}
// normalize end-of-line characters and remove any lines
// containing macros
BufferedReader reader = new BufferedReader(new
StringReader(buffer.toString()));
buffer.setLength(0);
String line;
while ((line = reader.readLine()) != null) {
if (line.indexOf("DOCPROPERTY") == -1) {
buffer.append(line);
buffer.append(END_OF_LINE);
}
}
// fetch the extracted full-text
String text = buffer.toString();
Regards,
Chris
--
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]