[ https://issues.apache.org/jira/browse/PDFBOX-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905762#comment-17905762 ]
Mohamed M NourElDin commented on PDFBOX-5487: --------------------------------------------- Yes, that's exactly what I expected. In addition to {*}Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.pdf{*}, I tested the following four files: # *artikel1_20_arab.pdf* ** The space between the two words "النظامية مفتوحاً" was removed. However, this did not affect extraction accuracy, as the _teh marbuta_ (ة) only appears at the end of a word and is never connected to the letter on its left. # *FES-GGArabisch-p112.pdf* ** No changes were observed. # *PDFBOX-679-toobig.pdf* ** No changes were observed. # *RAND_PE122z1.arabic.pdf* ** One extra space was removed correctly ("مؤك د" becomes "مؤكد"). ** Eleven other extra spaces were removed, but these changes neither improved nor harmed extraction accuracy. > extra whitespaces when extracting Arabic text > --------------------------------------------- > > Key: PDFBOX-5487 > URL: https://issues.apache.org/jira/browse/PDFBOX-5487 > Project: PDFBox > Issue Type: Bug > Reporter: Fatemeh Elyasi > Priority: Major > Labels: Arabic > Attachments: Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR > (withoutFixes).txt, Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.pdf, > Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.txt, > PDFBOX-3774-reduced.pdf-sorted-diff.txt, > PDFBOX-5487-arabic.pdf-sorted-diff.txt, PDFBOX-5487_ اعلامية.png, > PDFBOX-5487_ وفضلا.png, arabtest.pdf, meld1.png, meld2.png, meld3.png, > screenshot-1.png > > > trying to extract text from an arabic PDF. You may notice that some of > whitespaces are extracted in wrong place. > Example: > Original word: العالمية > Extracted word: العالمي ة > > Pdf is attached, the example word is on the first line. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org