Aleksandr Dubinsky created TIKA-1309:
----------------------------------------

             Summary: RTF TextExtractor can ignore consecutive linebreaks
                 Key: TIKA-1309
                 URL: https://issues.apache.org/jira/browse/TIKA-1309
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.5, 1.6
            Reporter: Aleksandr Dubinsky


Some RTF files encode consecutive linebreaks as simply consecutive \par 
commands. However, org.apache.tika.parser.rtf.TextExtractor ignores the second 
\par.

Solution is to replace at line 1158:

        } else if (equals("par")) {
            if (!ignored) {
                endParagraph(true);
            }
        }

with:


        } else if (equals("par")) {
            if (!ignored) {
                lazyStartParagraph();
                endParagraph(true);
            }
        }



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to