[ https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944575#comment-14944575 ]
Maruan Sahyoun commented on PDFBOX-2252: ---------------------------------------- [~tilman] [~talli...@apache.org] Thanks for the samples. I had a quick look at govdocs1/302/302975.pdf. The 1.8.10 extraction has this {code} صيخلت Abbreviation ... tnuomA مبلغ, كمية، قيمة {code} where the current 2.0.0-SNAPSHOT has this {code} تلخيص Abbreviation ... مبلغ, كمية، قيمة Amount {code} Although there is still some work to be done for Bidi it seems that the current snapshot at least has the text right where 1.8.10 has either the LTR correct and the RTL wrong or the other way around which makes the text extraction of 1.8.10 for Bidi text of limited use. I'll do some more testing later today. What I probably would like to do is keep the functionality at the current state for 2.0.0, create new tickets for possible enhancements and not do any further changes to the extraction if there are no regressions prior to releasing 2.0.0 - WDYT? > PDFTextStripper has problem with documents with mixed language directions > ------------------------------------------------------------------------- > > Key: PDFBOX-2252 > URL: https://issues.apache.org/jira/browse/PDFBOX-2252 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.8.6, 2.0.0 > Reporter: Amir > Assignee: Maruan Sahyoun > Priority: Critical > Fix For: 2.1.0 > > Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, > PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, > PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, > bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, test.pdf, > wikipedia_dl_lyric_test.pdf > > > When the input document of PDFTextStripper is a combination of right-to-left > and left-to-right languages, the output characters of one language is > reversed. > A sample bilingual pdf document is attached. > PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which > is defined as follows: boolean isRtlDominant = rtlCount > ltrCount; > This class clearly count the number of rtl characters and decide if the whole > content should be revered or not. It's not true, it must operate on each > word, not the whole document. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org