[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944575#comment-14944575
 ] 

Maruan Sahyoun commented on PDFBOX-2252:
----------------------------------------

[~tilman] [~talli...@apache.org] Thanks for the samples. I had a quick look at 
govdocs1/302/302975.pdf.

The 1.8.10 extraction has this
{code}
صيخلت Abbreviation
...
tnuomA مبلغ, كمية، قيمة
{code}

where the current 2.0.0-SNAPSHOT has this
{code}
تلخيص Abbreviation
...
مبلغ, كمية، قيمة Amount
{code}

Although there is still some work to be done for Bidi it seems that the current 
snapshot at least has the text right where 1.8.10 has either the LTR correct 
and the RTL wrong or the other way around which makes the text extraction of 
1.8.10 for Bidi text of limited use. I'll do some more testing later today.

What I probably would like to do is keep the functionality at the current state 
for 2.0.0, create new tickets for possible enhancements and not do any further 
changes to the extraction if there are no regressions prior to releasing 2.0.0 
- WDYT?

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Assignee: Maruan Sahyoun
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper-201709271718.patch, PDFTextStripper-201709272018.patch, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> bugzilla867751.pdf, overlap.jpg, pdfs_directionality.xlsx, test.pdf, 
> wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to