[jira] [Commented] (PDFBOX-5487) extra whitespaces when extracting Arabic text

Mohamed M NourElDin (Jira) Sun, 19 Feb 2023 02:58:05 -0800


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690866#comment-17690866
 ]


Mohamed M NourElDin commented on PDFBOX-5487:
---------------------------------------------

Hi [~tilman] , I have just created a new pull request for this one too.
[PR#155 PDFBOX-5487: Remove all space characters if contained within the 
adjacent letters|https://github.com/apache/pdfbox/pull/155]

I still want to run this fix on the test set that you shared in PDFBOX-4531 but 
meanwhile here is an explanation for the current issue in the attached PDF:
 * There is a space character at the left of the last word in the first line 
(last word because Arabic is written from right to left).
 * This space actually overlaps with the adjacent Arabic letter 'ة'.
 * When sorting is enabled, this space gets shifted into the middle of the word 
between the last and before-last letters (i.e. 'ية' becomes 'ي ة')
 * The same issue exists again in the first word on the 9{^}th{^} line from the 
bottom of the first page ( 'فضلا' becomes 'ف ضلا')

I have attached here the extracted text before and after the fix as well as 
some screenshots drawn by {{DrawPrintTextLocations}} utility to illustrate the 
problem. Also, I can share with you a python script that can draw in the PDF 
file directly for debugging.



Pre-chage: [^Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR 
(withoutFixes).txt]

Post-change: [^Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.txt]

Regarding *meld[123].png* images, issues highlighted with
 * *{color:#00875a}green{color}* should be fixed by *PR#155*
 * {color:#de350b}*red* {color:#172b4d}and{color} *{color:#0747a6}blue{color}* 
{color:#172b4d}should be fixed by{color} *{color:#172b4d}PR#154{color}*{color}

{color:#de350b}{color:#172b4d}Thanks{color}{color}

> extra whitespaces when extracting Arabic text
> ---------------------------------------------
>
>                 Key: PDFBOX-5487
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5487
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Fatemeh Elyasi
>            Priority: Major
>              Labels: Arabic
>         Attachments: Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR 
> (withoutFixes).txt, Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.pdf, 
> Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.txt, PDFBOX-5487_ 
> اعلامية.png, PDFBOX-5487_ وفضلا.png, arabtest.pdf, meld1.png, meld2.png, 
> meld3.png, screenshot-1.png
>
>
> trying to extract text from an arabic PDF. You may notice that some of 
> whitespaces are extracted in wrong place.
> Example:
> Original word: العالمية
> Extracted word: العالمي ة
>  
> Pdf is attached, the example word is on the first line.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5487) extra whitespaces when extracting Arabic text

Reply via email to