[
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365641#comment-14365641
]
Tim Allison commented on TIKA-1575:
-----------------------------------
We haven't yet integrated OCR with PDFParsing...it would make sense.
When I just ran a junit test on this document alone with 1.8.8 and 1.8.9, I got
complete junk for p. 14. There were paragraph markers in our xml markup, but
the characters were junk.
The batch runs were multithreaded. I wonder if caching in PDFont or another
static object (?!) happened to hold the crucial information for decoding that
page in 1.8.8, but we didn't get lucky with caching during the run with
1.8.9???
As I found with "monitoring", when I swipe/copy the full page in Acrobat
Reader, I get complete junk.
> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json,
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx,
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip,
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9. Let's use this issue to
> track discussions before the release and to track Tika's upgrade to PDFBox
> 1.8.9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)