[ https://issues.apache.org/jira/browse/PDFBOX-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17905803#comment-17905803 ]
ASF subversion and git services commented on PDFBOX-5487: --------------------------------------------------------- Commit 1922512 from Tilman Hausherr in branch 'pdfbox/branches/2.0' [ https://svn.apache.org/r1922512 ] PDFBOX-5487: Remove all space characters if contained within the adjacent letters, by Mohamed M NourElDin; closes #155 > extra whitespaces when extracting Arabic text > --------------------------------------------- > > Key: PDFBOX-5487 > URL: https://issues.apache.org/jira/browse/PDFBOX-5487 > Project: PDFBox > Issue Type: Bug > Reporter: Fatemeh Elyasi > Priority: Major > Labels: Arabic > Attachments: Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR > (withoutFixes).txt, Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.pdf, > Malpass-at-the-G7-Leaders-Summit-Media-Briefing-AR.txt, > PDFBOX-3774-reduced.pdf-sorted-diff.txt, > PDFBOX-5487-arabic.pdf-sorted-diff.txt, PDFBOX-5487_ اعلامية.png, > PDFBOX-5487_ وفضلا.png, arabtest.pdf, meld1.png, meld2.png, meld3.png, > screenshot-1.png > > > trying to extract text from an arabic PDF. You may notice that some of > whitespaces are extracted in wrong place. > Example: > Original word: العالمية > Extracted word: العالمي ة > > Pdf is attached, the example word is on the first line. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org