[ 
https://issues.apache.org/jira/browse/TIKA-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709476#comment-14709476
 ] 

Mike Cantrell commented on TIKA-1713:
-------------------------------------

Yeah, sorry.. I can't really give out the original msg file due to privacy 
concerns. 

Short story: We aren't really concerned about the embedded 
attachments/documents.

Long Story: These messages appear to be generated by Symantec Enterprise Vault 
as a part of an archival process. The strange RTF message body isn't the only 
odd thing about these files. The attachments listed in the body aren't really 
there. There are attachments but they are hidden and have filename=@ and 
contain 0 byte content.

> RTF parser misses text content 
> -------------------------------
>
>                 Key: TIKA-1713
>                 URL: https://issues.apache.org/jira/browse/TIKA-1713
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Mike Cantrell
>            Assignee: Tim Allison
>         Attachments: no-text.rtf
>
>
> We have a lot of Outlook msg files that have RTF body content. Tika is not 
> finding any text within these messages. It appears to be a mixture of RTF and 
> HTML.
> I've extracted an example RTF body (see attachment) for use with the 
> following test case:
> {code}
> ByteArrayOutputStream bytes = new ByteArrayOutputStream()
> rtfParser.parse(
>         this.class.getResourceAsStream("/problems/no-text.rtf"),
>         new EmbeddedContentHandler(new BodyContentHandler(bytes)),
>         new Metadata(), new ParseContext()
> );
> assertTrue("Document is missing required text", bytes.toByteArray().length > 
> 0)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to