[ 
https://issues.apache.org/jira/browse/TIKA-690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084878#comment-13084878
 ] 

Joseph Vychtrle commented on TIKA-690:
--------------------------------------

Thank you Nick, I didn't know that the "Closing Issue" comment goes here :-)

> WordExtractor doesn't extract text from HWPFDocument
> ----------------------------------------------------
>
>                 Key: TIKA-690
>                 URL: https://issues.apache.org/jira/browse/TIKA-690
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.9, 1.0
>            Reporter: Joseph Vychtrle
>              Labels: parsing
>
> If I use apache poi's HWPF component to create MS doc, and pass it to 
> tika.parseToString(is);  it returns just carriage return "\n". I tested that 
> with tons of different input text. Adding paragraphs doesn't help.
> {code}
> private void createDOCDocument(String from, File file) throws Exception {
>     POIFSFileSystem fs = new 
> POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
>     HWPFDocument doc = new HWPFDocument(fs);
>     Range range = doc.getRange();
>     CharacterRun run1 = range.insertBefore(from);
>     run1.setFontSize(11);
>     DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
>     CustomProperties cp = dsi.getCustomProperties();
>     if (cp == null)
>         cp = new CustomProperties();
>     cp.put("myProperty", "foo bar baz");
>     dsi.setCustomProperties(cp);
>     doc.write(new FileOutputStream(file));
> }
> {code}
> {code}
> protected String extractText(InputStream is) throws SystemException {
>       Tika tika = new Tika();
>       tika.setMaxStringLength(new Long(maxCharCount).intValue());
>       String text;
>       try {
>               text = tika.parseToString(is);
>       } catch (IOException ioe) {
>               throw new SystemException(ioe.getMessage(), ioe);
>       } catch (TikaException te) {
>               throw new SystemException(te.getMessage(), te);
>       }
>       return text;
> }
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to