WordExtractor doesn't extract text from HWPFDocument
----------------------------------------------------

                 Key: TIKA-690
                 URL: https://issues.apache.org/jira/browse/TIKA-690
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.9, 1.0
            Reporter: Joseph Vychtrle


If I use apache poi's HWPF component to create MS doc, and pass it to 
tika.parseToString(is);  it returns just carriage return "\n". I tested that 
with tons of different input text. Adding paragraphs doesn't help.


{code}
private void createDOCDocument(String from, File file) throws Exception {

    POIFSFileSystem fs = new 
POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
    HWPFDocument doc = new HWPFDocument(fs);

    Range range = doc.getRange();
    CharacterRun run1 = range.insertBefore(from);
    run1.setFontSize(11);

    DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
    CustomProperties cp = dsi.getCustomProperties();
    if (cp == null)
        cp = new CustomProperties();
    cp.put("myProperty", "foo bar baz");
    dsi.setCustomProperties(cp);

    doc.write(new FileOutputStream(file));
}
{code}

{code}
protected String extractText(InputStream is) throws SystemException {
        Tika tika = new Tika();
        tika.setMaxStringLength(new Long(maxCharCount).intValue());
        String text;
        try {
                text = tika.parseToString(is);
        } catch (IOException ioe) {
                throw new SystemException(ioe.getMessage(), ioe);
        } catch (TikaException te) {
                throw new SystemException(te.getMessage(), te);
        }
        return text;

}
{code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to