[ 
https://issues.apache.org/jira/browse/TIKA-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850119#comment-16850119
 ] 

Ross Johnson commented on TIKA-2883:
------------------------------------

I know a bit about these types of files. Outlook / Exchange will often store 
messages as RTF-encapsulated HTML. This is a mixed-representation of the text & 
formatting, such that a conforming RTF reader sees it as just a normal RTF file 
and ignores the HTML tags, while a special RTF de-encapsulator reader can still 
read the original HTML tags and ignore the other RTF operators. The actual body 
text / content is only included a single time and is shared between both 
representations. A conforming RTF reader should not have to do anything special 
to get the text or ignore the HTML tags. There is also such a thing as 
RTF-encapsulated plain text, which is similar to RTF-encapsulated HTML.

If Tika is not giving any text output for this file, then there is probably a 
bug in the RTF reader that is being used. Perhaps it is getting hung up on the 
various HTML control words that is doesn't know how to handle, when it should 
instead be ignoring them.

Source: I wrote a (non-Java) RTF de-encapsulator for text and HTML 
[https://github.com/mazira/rtf-stream-parser]

> Text not extracted from RTF files
> ---------------------------------
>
>                 Key: TIKA-2883
>                 URL: https://issues.apache.org/jira/browse/TIKA-2883
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1, 1.21
>            Reporter: Luis Filipe Nassif
>            Priority: Major
>         Attachments: Message (5).rtf
>
>
> I have a number of RTF files (extracted fromĀ PST email bodies) which text is 
> not extracted currently. Sample file attached. [~talli...@apache.org], do you 
> have any ideia what is going on?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to