[jira] [Commented] (TIKA-690) WordExtractor doesn't extract text from HWPFDocument

Joseph Vychtrle (JIRA) Sun, 14 Aug 2011 10:33:50 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084869#comment-13084869
 ]


Joseph Vychtrle commented on TIKA-690:
--------------------------------------

I was using tika snapshot so that poi 3.8-beta3 ... Anyway, first of all tika 
WordExtractor doesn't extract anything from .doc unless it has paragraphs. I 
finally make it work like this  :
{code}
private void createDOCDocument(String from, File file) throws Exception {

        POIFSFileSystem fs = new 
POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
        HWPFDocument doc = new HWPFDocument(fs);

        Range range = doc.getRange();
        Paragraph par1 = range.getParagraph(0);

        CharacterRun run1 = par1.insertBefore(from, new CharacterProperties());
        run1.setFontSize(11);
        doc.write(new FileOutputStream(file));
{code}


So that even if you have exactly the same looking .doc, but the text goes 
directly into range.insertBefore(); WordExtractor doesn't extract it.

> WordExtractor doesn't extract text from HWPFDocument
> ----------------------------------------------------
>
>                 Key: TIKA-690
>                 URL: https://issues.apache.org/jira/browse/TIKA-690
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9, 1.0
>            Reporter: Joseph Vychtrle
>              Labels: parsing
>
> If I use apache poi's HWPF component to create MS doc, and pass it to 
> tika.parseToString(is);  it returns just carriage return "\n". I tested that 
> with tons of different input text. Adding paragraphs doesn't help.
> {code}
> private void createDOCDocument(String from, File file) throws Exception {
>     POIFSFileSystem fs = new 
> POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
>     HWPFDocument doc = new HWPFDocument(fs);
>     Range range = doc.getRange();
>     CharacterRun run1 = range.insertBefore(from);
>     run1.setFontSize(11);
>     DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
>     CustomProperties cp = dsi.getCustomProperties();
>     if (cp == null)
>         cp = new CustomProperties();
>     cp.put("myProperty", "foo bar baz");
>     dsi.setCustomProperties(cp);
>     doc.write(new FileOutputStream(file));
> }
> {code}
> {code}
> protected String extractText(InputStream is) throws SystemException {
>       Tika tika = new Tika();
>       tika.setMaxStringLength(new Long(maxCharCount).intValue());
>       String text;
>       try {
>               text = tika.parseToString(is);
>       } catch (IOException ioe) {
>               throw new SystemException(ioe.getMessage(), ioe);
>       } catch (TikaException te) {
>               throw new SystemException(te.getMessage(), te);
>       }
>       return text;
> }
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-690) WordExtractor doesn't extract text from HWPFDocument

Reply via email to