[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

Tilman Hausherr (JIRA) Wed, 23 Sep 2015 09:44:30 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904787#comment-14904787
 ]


Tilman Hausherr commented on PDFBOX-2252:
-----------------------------------------

The problem is that some old code doesn't follow the conventions. The best is 
not to change the formatting of code that you don't touch. If you did so 
anyway, try using your IDE to revert the reformatting on parts where you know 
that you didn't touch anything.

The "only" problem left is that nobody (including me) has reviewed your code. 
My own reason is that I know almost nothing of the text extraction code. The 
changes I did in the class a few months ago didn't touch the core :-(

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: PDFTextStripper.java.patch, atest.pdf, overlap.jpg, 
> test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-2252) PDFTextStripper has problem with documents with mixed language directions

Reply via email to