[
https://issues.apache.org/jira/browse/TIKA-690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph Vychtrle closed TIKA-690.
--------------------------------
Resolution: Not A Problem
WordExctractor requires HWPF document with paragraphs
> WordExtractor doesn't extract text from HWPFDocument
> ----------------------------------------------------
>
> Key: TIKA-690
> URL: https://issues.apache.org/jira/browse/TIKA-690
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9, 1.0
> Reporter: Joseph Vychtrle
> Labels: parsing
>
> If I use apache poi's HWPF component to create MS doc, and pass it to
> tika.parseToString(is); it returns just carriage return "\n". I tested that
> with tons of different input text. Adding paragraphs doesn't help.
> {code}
> private void createDOCDocument(String from, File file) throws Exception {
> POIFSFileSystem fs = new
> POIFSFileSystem(DOCGenerator.class.getClass().getResourceAsStream("/poi/template.doc"));
> HWPFDocument doc = new HWPFDocument(fs);
> Range range = doc.getRange();
> CharacterRun run1 = range.insertBefore(from);
> run1.setFontSize(11);
> DocumentSummaryInformation dsi = doc.getDocumentSummaryInformation();
> CustomProperties cp = dsi.getCustomProperties();
> if (cp == null)
> cp = new CustomProperties();
> cp.put("myProperty", "foo bar baz");
> dsi.setCustomProperties(cp);
> doc.write(new FileOutputStream(file));
> }
> {code}
> {code}
> protected String extractText(InputStream is) throws SystemException {
> Tika tika = new Tika();
> tika.setMaxStringLength(new Long(maxCharCount).intValue());
> String text;
> try {
> text = tika.parseToString(is);
> } catch (IOException ioe) {
> throw new SystemException(ioe.getMessage(), ioe);
> } catch (TikaException te) {
> throw new SystemException(te.getMessage(), te);
> }
> return text;
> }
> {code}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira