[ 
https://issues.apache.org/jira/browse/PDFBOX-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14906851#comment-14906851
 ] 

Maruan Sahyoun commented on PDFBOX-2252:
----------------------------------------

[~AndreasMeier] I uploaded a text file for you where the output for 
Characters.isMirrored() differs from the MIRRORING_PATTERN_ARRAY. Maybe you can 
have a look at it if that could be valid as this was quickly generated from

{code}
Pattern p = Pattern.compile(MIRRORING_PATTERN_ARRAY);

for (int i = 0; i< Character.MAX_CODE_POINT; i++)
{
    if (Character.isValidCodePoint(i))
    {
        Matcher m = p.matcher(new String(Character.toChars(i)));
        if (Character.isMirrored(i) || m.matches())
        {
            if (Character.isMirrored(i) == true && m.matches() == false || 
Character.isMirrored(i) == false && m.matches() == true)
            {
                System.out.printf("Character %s isMirrored %s matches pattern 
%s%n",new String(Character.toChars(i)), Character.isMirrored(i), m.matches());
            }
        }
    }
}
{code}

to get an idea of possible differences. 

> PDFTextStripper has problem with documents with mixed language directions
> -------------------------------------------------------------------------
>
>                 Key: PDFBOX-2252
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2252
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.6, 2.0.0
>            Reporter: Amir
>            Assignee: Maruan Sahyoun
>            Priority: Critical
>             Fix For: 2.1.0
>
>         Attachments: BidiMirroring.txt, IsMirroredDeviations.txt, 
> PDFTextStripper.java.patch, PDFTextStripper.java.patch, atest.pdf, 
> overlap.jpg, test.pdf, wikipedia_dl_lyric_test.pdf
>
>
> When the input document of PDFTextStripper is a combination of right-to-left 
> and left-to-right languages, the output characters of one language is 
> reversed. 
> A sample bilingual pdf document is attached.
> PDFTextStripper has a variable "isRtlDominant" in "writePage" function, which 
> is defined as follows:     boolean isRtlDominant = rtlCount > ltrCount;
> This class clearly count the number of rtl characters and decide if the whole 
> content should be revered or not. It's not true, it must operate on each 
> word, not the whole document.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to