[ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14905915#comment-14905915 ]
Andreas Meier commented on PDFBOX-2252: --------------------------------------- Even if the patch doesn't look like it, I had a hard time reverting and merging the changes. Would be great if you could try to format the old code so it follows the conventions. Here is a short introduction to the patch I posted: The reason for the patch is, that the old code did not respect the writing direction of words, when RTL and LTR words were mixed in one line. The new code tries to address this problem. Furthermore pdfdoes not care about the direction of mirrorable neutrals with their counterpart. For this reason symbols like "{" or "[" will have the wrong direction, if embedded in RTL text Therefore the BidiMirroring.txt ( which has to be placed in "org.apache.pdfbox.text.bidi" ) will Exchange mirrorable characters in RTL code so the neutral got the correct direction. > PDFTextStripper has problem with documents with mixed language directions > ------------------------------------------------------------------------- > > Key: PDFBOX-2252 > URL: https://issues.apache.org/jira/browse/PDFBOX-2252 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.6, 2.0.0 > Reporter: Amir > Priority: Critical > Fix For: 2.1.0 > > Attachments: BidiMirroring.txt, PDFTextStripper.java.patch, > PDFTextStripper.java.patch, atest.pdf, overlap.jpg, test.pdf, > wikipedia_dl_lyric_test.pdf > > > When the input document of PDFTextStripper is a combination of right-to-left > and left-to-right languages, the output characters of one language is > reversed. > A sample bilingual pdf document is attached. > PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which > is defined as follows: boolean isRtlDominant = rtlCount > ltrCount; > This class clearly count the number of rtl characters and decide if the whole > content should be revered or not. It's not true, it must operate on each > word, not the whole document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org