[
https://issues.apache.org/jira/browse/PDFBOX-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986058#action_12986058
]
Pavel Pisarev commented on PDFBOX-939:
--------------------------------------
Hi Anton,
I've checked your example and found that each arabic line lost only two final
spaces. The reason is located in the method
PDFTextStripper#writeLine(List<String> line, boolean isRtlDominant).
The original source code is:
int numberOfStrings = line.size();
if (isRtlDominant) {
for(int i=numberOfStrings-1; i>=0; i--){
if (i > 1)
writeWordSeparator();
writeString(line.get(i));
}
}
else {
for(int i=0; i<numberOfStrings; i++){
writeString(line.get(i));
if (!isRtlDominant && i < numberOfStrings-1)
writeWordSeparator();
}
}
I f you change condition "if (i > 1)" to "if (i < numberOfStrings-1)" text
extraction will be correct. I suppose this is a bug.
> Lost whitespaces when extracting Arabic text
> --------------------------------------------
>
> Key: PDFBOX-939
> URL: https://issues.apache.org/jira/browse/PDFBOX-939
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Reporter: Anton
> Attachments: extracted.txt, test.pdf
>
>
> I tried to extract text from an arabic PDF. Result looks good for the first
> look, but if you look closer, you may notice that some of whitespaces is
> missing comparing to copy/pasted text from the same PDF.
> Copy/pasted line from attached PDF:
> بعد ما اكتشف حقيقة المثلث الغامض
> Extracted text:
> بعد ما اكتشف حقيقةالمثلثالغامض
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.