[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905915#comment-14905915
 ] 

Andreas Meier commented on PDFBOX-2252:
---------------------------------------

Even if the patch doesn't look like it, I had a hard time reverting and merging 
the changes.
Would be great if you could try to format the old code so it follows the 
conventions.

Here is a short introduction to the patch I posted:

The reason for the patch is, that the old code did not respect the writing 
direction of words, when RTL and LTR words were mixed in one line.
The new code tries to address this problem.
Furthermore pdfdoes not care about the direction of mirrorable neutrals with 
their counterpart.
For this reason symbols like "{" or "[" will have the wrong direction, if 
embedded in RTL text
Therefore the BidiMirroring.txt ( which has to be placed in  
"org.apache.pdfbox.text.bidi" ) will Exchange mirrorable characters in RTL code 
so the neutral got the correct direction.


> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: BidiMirroring.txt, PDFTextStripper.java.patch, 
> PDFTextStripper.java.patch, atest.pdf, overlap.jpg, test.pdf, 
> wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to