[ https://issues.apache.org/jira/browse/TIKA-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709485#comment-14709485 ]
Tim Allison commented on TIKA-1713: ----------------------------------- Y. Figured as much. Got it. Oh, Symantec EV...That helps. I might be able to find one of those. Thank you, again, for raising this issue and submitting a mock test rtf file. The fix is non-trivial so it may take a few weeks, but it will be good to add the ability to handle this type of file. Thank you! > RTF parser misses text content > ------------------------------- > > Key: TIKA-1713 > URL: https://issues.apache.org/jira/browse/TIKA-1713 > Project: Tika > Issue Type: Bug > Affects Versions: 1.10 > Reporter: Mike Cantrell > Assignee: Tim Allison > Attachments: no-text.rtf > > > We have a lot of Outlook msg files that have RTF body content. Tika is not > finding any text within these messages. It appears to be a mixture of RTF and > HTML. > I've extracted an example RTF body (see attachment) for use with the > following test case: > {code} > ByteArrayOutputStream bytes = new ByteArrayOutputStream() > rtfParser.parse( > this.class.getResourceAsStream("/problems/no-text.rtf"), > new EmbeddedContentHandler(new BodyContentHandler(bytes)), > new Metadata(), new ParseContext() > ); > assertTrue("Document is missing required text", bytes.toByteArray().length > > 0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)