WordExtractor doesn't extract text from HWPFDocument
----------------------------------------------------
Key: TIKA-690
URL: https://issues.apache.org/jira/browse/TIKA-690
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.9, 1.0
Reporter: Joseph Vychtrle
If I use apache poi's HWPF component to create MS doc, and pass it to
tika.parseToString(is); it returns just carriage return "\n". I tested that
with tons of different input text. Adding paragraphs doesn't help.
{code}
private void createDOCDocument(String from, File file) throws Exception {
POIFSFileSystem fs = new
POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
HWPFDocument doc = new HWPFDocument(fs);
Range range = doc.getRange();
CharacterRun run1 = range.insertBefore(from);
run1.setFontSize(11);
DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
CustomProperties cp = dsi.getCustomProperties();
if (cp == null)
cp = new CustomProperties();
cp.put("myProperty", "foo bar baz");
dsi.setCustomProperties(cp);
doc.write(new FileOutputStream(file));
}
{code}
{code}
protected String extractText(InputStream is) throws SystemException {
Tika tika = new Tika();
tika.setMaxStringLength(new Long(maxCharCount).intValue());
String text;
try {
text = tika.parseToString(is);
} catch (IOException ioe) {
throw new SystemException(ioe.getMessage(), ioe);
} catch (TikaException te) {
throw new SystemException(te.getMessage(), te);
}
return text;
}
{code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira