[
https://issues.apache.org/jira/browse/PDFBOX-4531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr resolved PDFBOX-4531.
-------------------------------------
Fix Version/s: 2.0.28
3.0.0 PDFBox
Assignee: Tilman Hausherr
Resolution: Fixed
> Extraction of Arabic PDF has incorrect ordering of normalized ligatures
> -----------------------------------------------------------------------
>
> Key: PDFBOX-4531
> URL: https://issues.apache.org/jira/browse/PDFBOX-4531
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.15
> Reporter: Tilman Hausherr
> Assignee: Tilman Hausherr
> Priority: Major
> Labels: Arabic, regression
> Fix For: 2.0.28, 3.0.0 PDFBox
>
> Attachments: FES-GGArabisch-p112.pdf, PDFBOX-4531-reduced.pdf,
> PDFBOX-679-toobig.pdf, RAND_PE122z1.arabic.pdf, artikel1_20_arab.pdf,
> bidi-ligature-1.pdf, bidi-ligature-2.pdf, bidi-ligature.patch, diff-output.zip
>
>
> As reported by Elias Peterson in the mailing list:
> {quote}
> I think I'm seeing some issues concerning the handling of the Arabic
> lam-with-alef ligature. I'm attempting to process the PDF here:
> https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf
> When I run the ExtractText command with 2.0.15 I get the following:
> $ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8
> RAND_PE122z1.arabic.pdf output.txt
> $ head output.txt
> C O R P O R A T I O N
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات اآلنية
> االتفاق مع إيران
> األيام التي تلي
> ...
> The issue being with the last two lines in the above snippet where my
> understanding is that the ligature لا was normalized but that the two
> letters that compose it are in the wrong order. I was thinking that
> PDFBOX-684 sounded similar, and running the same PDF through 1.8.16 I see the
> ligature is normalized in the way I think is expected (although the
> interspersed English-language words are backwards here).
> $ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8
> RAND_PE122z1.arabic.pdf output.txt
> ...
> $ head output.txt
> N O I T A R O P R O C
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات الآنية
> الاتفاق مع إيران
> الأيام التي تلي
> ...
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]