[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365641#comment-14365641
 ] 

Tim Allison commented on TIKA-1575:
-----------------------------------

We haven't yet integrated OCR with PDFParsing...it would make sense.

When I just ran a junit test on this document alone with 1.8.8 and 1.8.9, I got 
complete junk for p. 14.  There were paragraph markers in our xml markup, but 
the characters were junk.

The batch runs were multithreaded.  I wonder if caching in PDFont or another 
static object (?!) happened to hold the crucial information for decoding that 
page in 1.8.8, but we didn't get lucky with caching during the run with 
1.8.9???  

As I found with "monitoring", when I swipe/copy the full page in Acrobat 
Reader, I get complete junk.

> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
>                 Key: TIKA-1575
>                 URL: https://issues.apache.org/jira/browse/TIKA-1575
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to