[
https://issues.apache.org/jira/browse/TIKA-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-1713.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.22
Fixed via TIKA-2883
> RTF parser misses text content
> -------------------------------
>
> Key: TIKA-1713
> URL: https://issues.apache.org/jira/browse/TIKA-1713
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.10
> Reporter: Mike Cantrell
> Assignee: Tim Allison
> Priority: Major
> Fix For: 1.22
>
> Attachments: no-text.rtf
>
>
> We have a lot of Outlook msg files that have RTF body content. Tika is not
> finding any text within these messages. It appears to be a mixture of RTF and
> HTML.
> I've extracted an example RTF body (see attachment) for use with the
> following test case:
> {code}
> ByteArrayOutputStream bytes = new ByteArrayOutputStream()
> rtfParser.parse(
> this.class.getResourceAsStream("/problems/no-text.rtf"),
> new EmbeddedContentHandler(new BodyContentHandler(bytes)),
> new Metadata(), new ParseContext()
> );
> assertTrue("Document is missing required text", bytes.toByteArray().length >
> 0)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)