[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365641#comment-14365641
 ] 

Tim Allison edited comment on TIKA-1575 at 3/17/15 5:50 PM:
------------------------------------------------------------

We haven't yet integrated OCR with PDFParsing...it would make sense.

When I just ran a junit test on this document alone with 1.8.8 and 1.8.9, I got 
complete junk for p. 14.  There were paragraph markers in our xml markup, but 
the characters were junk.

The batch runs were multithreaded.  I wonder if caching in PDFont or another 
static object (?!) happened to hold the crucial information for decoding that 
page in 1.8.8, but we didn't get lucky with caching during the run with 
1.8.9???  

As I found with "monitoring", when I swipe/copy the full page in Acrobat 
Reader, I get complete junk.

The multithreading hypothesis might also explain why on this run comparing 
1.8.8 against the updated 1.8.9, I found no difference in content in 
147/147012.pdf
223/223704.pdf...I'd have to look more closely to make sure that Maruan's 
AcroField fixes don't explain those differences.



was (Author: [email protected]):
We haven't yet integrated OCR with PDFParsing...it would make sense.

When I just ran a junit test on this document alone with 1.8.8 and 1.8.9, I got 
complete junk for p. 14.  There were paragraph markers in our xml markup, but 
the characters were junk.

The batch runs were multithreaded.  I wonder if caching in PDFont or another 
static object (?!) happened to hold the crucial information for decoding that 
page in 1.8.8, but we didn't get lucky with caching during the run with 
1.8.9???  

As I found with "monitoring", when I swipe/copy the full page in Acrobat 
Reader, I get complete junk.

> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
>                 Key: TIKA-1575
>                 URL: https://issues.apache.org/jira/browse/TIKA-1575
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to