[ https://issues.apache.org/jira/browse/TIKA-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709476#comment-14709476 ]
Mike Cantrell commented on TIKA-1713: ------------------------------------- Yeah, sorry.. I can't really give out the original msg file due to privacy concerns. Short story: We aren't really concerned about the embedded attachments/documents. Long Story: These messages appear to be generated by Symantec Enterprise Vault as a part of an archival process. The strange RTF message body isn't the only odd thing about these files. The attachments listed in the body aren't really there. There are attachments but they are hidden and have filename=@ and contain 0 byte content. > RTF parser misses text content > ------------------------------- > > Key: TIKA-1713 > URL: https://issues.apache.org/jira/browse/TIKA-1713 > Project: Tika > Issue Type: Bug > Affects Versions: 1.10 > Reporter: Mike Cantrell > Assignee: Tim Allison > Attachments: no-text.rtf > > > We have a lot of Outlook msg files that have RTF body content. Tika is not > finding any text within these messages. It appears to be a mixture of RTF and > HTML. > I've extracted an example RTF body (see attachment) for use with the > following test case: > {code} > ByteArrayOutputStream bytes = new ByteArrayOutputStream() > rtfParser.parse( > this.class.getResourceAsStream("/problems/no-text.rtf"), > new EmbeddedContentHandler(new BodyContentHandler(bytes)), > new Metadata(), new ParseContext() > ); > assertTrue("Document is missing required text", bytes.toByteArray().length > > 0) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)