[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632570#comment-14632570
 ] 

John Hewson edited comment on PDFBOX-2252 at 7/18/15 7:06 PM:
--------------------------------------------------------------

Ha, yes, though I wasn't suggesting we start selling something to the general 
public. Look at my sentence from  05/Aug/14, I managed it just fine. It just 
takes a little care, that's all.

To test out the LTL and RTL embedding levels, we really need a carefully 
crafted test PDF which demonstrates all of these cases. It has to be simple, 
else we won't be able to debug it. There's no need for large amounts of text - 
a few words is sufficient.


was (Author: jahewson):
Ha, yes, though I wasn't suggesting we start selling something to the general 
public. Look at my sentence from  05/Aug/14, I managed it just fine. It just 
takes a little care, that's all.

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: PDFTextStripper.java.patch, test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to