[ 
https://issues.apache.org/jira/browse/PDFBOX-4531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691270#comment-17691270
 ] 

Mohamed M NourElDin commented on PDFBOX-4531:
---------------------------------------------

Hi Tilman and Masaki,
As expected, the extracted text is much better now however, there are still 
some problems that we can fix later one by one.

PR#154 is concerned with 2 types of issues only:
 # When there is a single code-point representing multiple letters like 'fi' 
(U+FB01) in english or 'ﻻ' (U+FEFB) in Arabic, the decomposed letters should be 
expanded in visual order (instead of logical order). In other words, before the 
fix words like "final" or "office" were extracted as "ifnal" and "oiffce".
 ** This issue was highlighted with red in meld[123].png
 # The second type is when Arabic diacritics are stored with the letter in the 
same TextPosition. For analogy, the letter Á with Acute (U+00C1) should be 
expanded to letter A (U+0041) followed by ◌́ (U+0301). Without my fix, it will 
be expanded to acute accent followed be the letter A.
 ** This issue was highlighted with blue in meld[123].png

Regarding Hebrew, I think it will suffer from this issue as well because it is 
RTL language but I don't know Hebrew so I can replace 
DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC with DIRECTIONALITY_RIGHT_TO_LEFT only if 
you agree.



Apart from that, I can easily spot the following issues in the 3 PDFs shared by 
Tilman
 * letter ك was not extracted correctly in FES-GGArabisch-p112.pdf
 * multiple beads are not detected when sorting is enabled
 * brackets are not mirrored correctly

> Extraction of Arabic PDF has incorrect ordering of normalized ligatures
> -----------------------------------------------------------------------
>
>                 Key: PDFBOX-4531
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4531
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.15
>            Reporter: Tilman Hausherr
>            Priority: Major
>              Labels: Arabic, regression
>         Attachments: FES-GGArabisch-p112.pdf, PDFBOX-4531-reduced.pdf, 
> PDFBOX-679-toobig.pdf, RAND_PE122z1.arabic.pdf, artikel1_20_arab.pdf, 
> bidi-ligature.patch, diff-output.zip
>
>
> As reported by Elias Peterson in the mailing list:
> {quote}
> I think I'm seeing some issues concerning the handling of the Arabic 
> lam-with-alef ligature.  I'm attempting to process the PDF here:
> https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf
> When I run the ExtractText command with 2.0.15 I get the following:
> $ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 
> RAND_PE122z1.arabic.pdf output.txt
> $ head output.txt
> C O R P O R A T I O N
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات اآلنية
> االتفاق مع إيران
> األيام التي تلي
> ...
> The issue being with the last two lines in the above snippet where my 
> understanding is that the ligature لا  was normalized but that the two 
> letters that compose it are in the wrong order.  I was thinking that 
> PDFBOX-684 sounded similar, and running the same PDF through 1.8.16 I see the 
> ligature is normalized in the way I think is expected (although the 
> interspersed English-language words are backwards here).
> $ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 
> RAND_PE122z1.arabic.pdf output.txt
> ...
> $ head output.txt
> N O I T A R O P R O C
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات الآنية
> الاتفاق مع إيران
> الأيام التي تلي
> ...
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to