[
https://issues.apache.org/jira/browse/TIKA-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aleksandr Dubinsky updated TIKA-1309:
-------------------------------------
Description: RTF files (such as those produced by WordPad) typically encode
consecutive linebreaks as simply consecutive \par commands. However,
org.apache.tika.parser.rtf.TextExtractor ignores the second \par. Solution is
very simple. See attached patch. (was: Some RTF files encode consecutive
linebreaks as simply consecutive \par commands. However,
org.apache.tika.parser.rtf.TextExtractor ignores the second \par.
Solution is to replace at line 1158:
} else if (equals("par")) {
if (!ignored) {
endParagraph(true);
}
}
with:
} else if (equals("par")) {
if (!ignored) {
lazyStartParagraph();
endParagraph(true);
}
})
> RTF TextExtractor can ignore consecutive linebreaks
> ---------------------------------------------------
>
> Key: TIKA-1309
> URL: https://issues.apache.org/jira/browse/TIKA-1309
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.5, 1.6
> Reporter: Aleksandr Dubinsky
> Attachments: 0001-fix-RTF-ignores-consecutive-newlines.patch, test.rtf
>
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> RTF files (such as those produced by WordPad) typically encode consecutive
> linebreaks as simply consecutive \par commands. However,
> org.apache.tika.parser.rtf.TextExtractor ignores the second \par. Solution is
> very simple. See attached patch.
--
This message was sent by Atlassian JIRA
(v6.2#6252)