[jira] [Commented] (PDFBOX-5487) extra whitespaces when extracting Arabic text

Tilman Hausherr (Jira) Sat, 20 Aug 2022 04:00:29 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17582194#comment-17582194
 ]


Tilman Hausherr commented on PDFBOX-5487:
-----------------------------------------

I made an attempt to look whats happening in normalizeWord and handleDirection 
but I have no clue how to interpret what I'm seeing. I also created a WORD file 
with "العالمية" and converted that to PDF  [^arabtest.pdf] but text extraction 
works properly.

What might help bring this forward is if anybody who has a PDF editor could 
reduce the original file to the minimal possible text that still produces the 
error in PDFBox. I'm not saying we'll be able to fix it, but it might make it 
easier to another person who wants to try.

> extra whitespaces when extracting Arabic text
> ---------------------------------------------
>
>                 Key: PDFBOX-5487
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5487
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Fatemeh Elyasi
>            Priority: Major
>              Labels: Arabic
>         Attachments: Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.pdf, 
> arabtest.pdf, screenshot-1.png
>
>
> trying to extract text from an arabic PDF. You may notice that some of 
> whitespaces are extracted in wrong place.
> Example:
> Original word: العالمية
> Extracted word: العالمي ة
>  
> Pdf is attached, the example word is on the first line.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-5487) extra whitespaces when extracting Arabic text

Reply via email to