Aleksandr Dubinsky created TIKA-1309:
----------------------------------------
Summary: RTF TextExtractor can ignore consecutive linebreaks
Key: TIKA-1309
URL: https://issues.apache.org/jira/browse/TIKA-1309
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.5, 1.6
Reporter: Aleksandr Dubinsky
Some RTF files encode consecutive linebreaks as simply consecutive \par
commands. However, org.apache.tika.parser.rtf.TextExtractor ignores the second
\par.
Solution is to replace at line 1158:
} else if (equals("par")) {
if (!ignored) {
endParagraph(true);
}
}
with:
} else if (equals("par")) {
if (!ignored) {
lazyStartParagraph();
endParagraph(true);
}
}
--
This message was sent by Atlassian JIRA
(v6.2#6252)