[jira] [Comment Edited] (PDFBOX-2252) PDFTextStripper has problem with bilingual documents

Andreas Meier (JIRA) Thu, 16 Jul 2015 06:58:03 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629760#comment-14629760
 ]


Andreas Meier edited comment on PDFBOX-2252 at 7/16/15 1:56 PM:
----------------------------------------------------------------

I am currently reworking the PDFTextStripper and did some changes for the whole 
RTL/LTR problem.

Unfortunately I had some technical Problems with my IDE, so the rework of my 
code was in another style format and the requests of two tickets from PDFBox 
run into each other. For this reason I copied the RTL/LTR changes from my 
working space to a pulled instance of PDFTextStripper (from pdfbox svn)

Can you check if the patch I posted resolves this problem?

In my default workspace the extraction of the text works.

Regards

Andreas


was (Author: andreasmeier):
I am currently reworking the PDFTextStripper and did some changes for the whole 
RTL/LTR problem.

Unfortunately I had some technical Problems and a rework of my code. 
Furthermore I had to add the formatting style of the pdfbox later on. For this 
reason I copied the RTL/LTR changes from my working space to a pulled instance 
of PDFTextStripper (from pdfbox svn)

Can you check if the patch I posted resolves this problem?

In my default workspace the extraction of the text works.

Regards

Andreas

> PDFTextStripper has problem with bilingual documents
> ----------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: PDFTextStripper.java.patch, test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (PDFBOX-2252) PDFTextStripper has problem with bilingual documents

Reply via email to