[
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365641#comment-14365641
]
Tim Allison edited comment on TIKA-1575 at 3/17/15 5:51 PM:
------------------------------------------------------------
We haven't yet integrated OCR with PDFParsing...it would make sense.
When I just ran a junit test on this document alone with 1.8.8 and 1.8.9, I got
complete junk for p. 14. There were paragraph markers in our xml markup, but
the characters were junk.
The batch runs were multithreaded. I wonder if caching in PDFont or another
static object (?!) happened to hold the crucial information for decoding that
page during the batch run with 1.8.8, but we didn't get lucky with caching
during the run with 1.8.9???
As I found with "monitoring", when I swipe/copy the full page in Acrobat
Reader, I get complete junk.
The multithreading hypothesis might also explain why on this run comparing
1.8.8 against the updated 1.8.9, I found no difference in content in
147/147012.pdf
223/223704.pdf...I'd have to look more closely to make sure that Maruan's
AcroField fixes don't explain those differences.
was (Author: [email protected]):
We haven't yet integrated OCR with PDFParsing...it would make sense.
When I just ran a junit test on this document alone with 1.8.8 and 1.8.9, I got
complete junk for p. 14. There were paragraph markers in our xml markup, but
the characters were junk.
The batch runs were multithreaded. I wonder if caching in PDFont or another
static object (?!) happened to hold the crucial information for decoding that
page in 1.8.8, but we didn't get lucky with caching during the run with
1.8.9???
As I found with "monitoring", when I swipe/copy the full page in Acrobat
Reader, I get complete junk.
The multithreading hypothesis might also explain why on this run comparing
1.8.8 against the updated 1.8.9, I found no difference in content in
147/147012.pdf
223/223704.pdf...I'd have to look more closely to make sure that Maruan's
AcroField fixes don't explain those differences.
> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json,
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx,
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip,
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9. Let's use this issue to
> track discussions before the release and to track Tika's upgrade to PDFBox
> 1.8.9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)