[ https://issues.apache.org/jira/browse/TIKA-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16850936#comment-16850936 ]
Tim Allison commented on TIKA-2883: ----------------------------------- I have a local fix that works for all four issues. I'll push that once I get a clean local build. There's still the remaining item for future improvements that we're pretty much guessing when we're at the end of the header by whether we see {{par}} or other text-y kinds of things. According to one RTF spec, this is what a header can look like, with ? for optional, obviously. {noformat} <header> \rtf <charset> <deffont> \deff? <fonttbl> <filetbl>? <colortbl>? <stylesheet>? <listtables>? <revtbl>? <rsidtable>? <generator>? {noformat} The obnoxious part is that there can be stuff in between those items, and I'm hesitant to trust that RTFs follow the spec and actually require that order, etc... > Text not extracted from RTF files > --------------------------------- > > Key: TIKA-2883 > URL: https://issues.apache.org/jira/browse/TIKA-2883 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.20, 1.19.1, 1.21 > Reporter: Luis Filipe Nassif > Assignee: Tim Allison > Priority: Major > Attachments: Message (5).rtf > > > I have a number of RTF files (extracted fromĀ PST email bodies) which text is > not extracted currently. Sample file attached. [~talli...@apache.org], do you > have any ideia what is going on? -- This message was sent by Atlassian JIRA (v7.6.3#76005)