Tilman Hausherr created PDFBOX-4531:
---------------------------------------

             Summary: Extraction of Arabic PDF has incorrect ordering of 
normalized ligatures
                 Key: PDFBOX-4531
                 URL: https://issues.apache.org/jira/browse/PDFBOX-4531
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.15
            Reporter: Tilman Hausherr


As reported by Elias Peterson in the mailing list:
{quote}
I think I'm seeing some issues concerning the handling of the Arabic 
lam-with-alef ligature.  I'm attempting to process the PDF here:
https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf

When I run the ExtractText command with 2.0.15 I get the following:
$ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 
RAND_PE122z1.arabic.pdf output.txt
$ head output.txt
C O R P O R A T I O N
منظور تحليلي
رؤى خبير بشأن قضايا السياسات اآلنية
االتفاق مع إيران
األيام التي تلي
...

The issue being with the last two lines in the above snippet where my 
understanding is that the ligature لا  was normalized but that the two letters 
that compose it are in the wrong order.  I was thinking that PDFBOX-684 sounded 
similar, and running the same PDF through 1.8.16 I see the ligature is 
normalized in the way I think is expected (although the interspersed 
English-language words are backwards here).

$ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 
RAND_PE122z1.arabic.pdf output.txt
...
$ head output.txt
N O I T A R O P R O C
منظور تحليلي
رؤى خبير بشأن قضايا السياسات الآنية
الاتفاق مع إيران
الأيام التي تلي
...
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to