[ 
https://issues.apache.org/jira/browse/TIKA-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksandr Dubinsky updated TIKA-1309:
-------------------------------------

    Description: RTF files (such as those produced by WordPad) typically encode 
consecutive linebreaks as simply consecutive \par commands. However, 
org.apache.tika.parser.rtf.TextExtractor ignores the second \par. Solution is 
very simple. See attached patch.  (was: Some RTF files encode consecutive 
linebreaks as simply consecutive \par commands. However, 
org.apache.tika.parser.rtf.TextExtractor ignores the second \par.

Solution is to replace at line 1158:

        } else if (equals("par")) {
            if (!ignored) {
                endParagraph(true);
            }
        }

with:


        } else if (equals("par")) {
            if (!ignored) {
                lazyStartParagraph();
                endParagraph(true);
            }
        })

> RTF TextExtractor can ignore consecutive linebreaks
> ---------------------------------------------------
>
>                 Key: TIKA-1309
>                 URL: https://issues.apache.org/jira/browse/TIKA-1309
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5, 1.6
>            Reporter: Aleksandr Dubinsky
>         Attachments: 0001-fix-RTF-ignores-consecutive-newlines.patch, test.rtf
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> RTF files (such as those produced by WordPad) typically encode consecutive 
> linebreaks as simply consecutive \par commands. However, 
> org.apache.tika.parser.rtf.TextExtractor ignores the second \par. Solution is 
> very simple. See attached patch.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to