[
https://issues.apache.org/jira/browse/PDFBOX-4795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17054230#comment-17054230
]
Josh Burchard commented on PDFBOX-4795:
---------------------------------------
[~lehmi], thanks for taking a look at this issue. I'm a native English speaker
so it's also quite foreign for me to look at this text and make sense of what
I'm seeing. Usually I just CTRL+F search for this string "שמגר" in Adobe
Reader to locate my example section. Next time I have a bug report I'll try to
be more descriptive by including some landmarks. ;)
The output certainly does look better using the sort option! My only concern is
that it seems like a side-effect that I might not be able be able to rely on
forever as it's not called out specifically in the doc. Also the documentation
mentions some [performance concerns with
sorting|[https://pdfbox.apache.org/docs/2.0.7/javadocs/org/apache/pdfbox/text/PDFTextStripper.html]]
"By default PDFBox does *not* sort the text tokens before processing them due
to performance reasons." In some cases I'm processing thousands of PDFs so
that might impact me a lot.
Another thing is that I'm not certain if I can try the option myself because
the application I'm responsible for is using curl to pass PDFs to a resident
Apache Tika server, which is in turn making use of the PDFBox library.
Regardless of that, it seems like the situation I have is approachable with
current tools and I should move my questions to the forums to discuss
technique. You can feel free to close this as "not-a-bug" if you want.
> Hebrew words are extracted with no whitespace between
> -----------------------------------------------------
>
> Key: PDFBOX-4795
> URL: https://issues.apache.org/jira/browse/PDFBOX-4795
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 2.0.19
> Environment: Windows 10
> Reporter: Josh Burchard
> Priority: Major
> Attachments: PDFBOX-4795-hebrew_newsletter_sorted.txt,
> hebrew_newsletter.pdf
>
>
> When I extract Hebrew text from the included PDF, white space delimiting the
> words is not output.
> Example string of text as appears in the PDF:
> מאיר שמגר. ״ההלכות
> And the string as PDFBox extracts it:
> ״ההלכותשמגר.מאיר
> The words themselves are presented LTR, instead of RTL. It would be nice to
> have them RTL, but in my particular use case that doesn't matter as I'm
> creating an index. The spaces between matter a lot, however.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]