[
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-2252:
--------------------------------
Attachment: content_diffs.xlsx
This compares r1702171 (A) against trunk last time I pulled it (~ 6 October)
(B). I ran this against all docs with rtl and then more totaling ~100k
documents.
No diffs in exceptions, metadata or attachments. Some content diffs. There
are some apparent corrections in LTR words...let me know what you find and if
you have any questions.
> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
> Key: PDFBOX-2252
> URL: https://issues.apache.org/jira/browse/PDFBOX-2252
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.8.6, 2.0.0
> Reporter: Amir
> Assignee: Maruan Sahyoun
> Priority: Critical
> Fix For: 2.1.0
>
> Attachments: BidiMirroring.txt, IsMirroredDeviations.txt,
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch,
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf,
> bugzilla867751.pdf, content_diffs.xlsx, overlap.jpg,
> pdfs_directionality.xlsx, pdfs_directionality3.xlsx, test.pdf,
> wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left
> and left-to-right languages, the output characters of one language is
> reversed.
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which
> is defined as follows: boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole
> content should be revered or not. It's not true, it must operate on each
> word, not the whole document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]