In case you have an interest, see below... Thank you, all, for all of the
improvements in the 1.8.7 release!
Best,
Tim
-----Original Message-----
From: Tim Allison (JIRA) [mailto:[email protected]]
Sent: Monday, September 22, 2014 2:31 PM
To: [email protected]
Subject: [jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7
[
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143588#comment-14143588
]
Tim Allison commented on TIKA-1419:
-----------------------------------
I just finished the run on 50,000 random pdfs from govdocs1. With the move to
PDFBox 1.8.7, we've gone from 53 exceptions down to 32. In manually reviewing
the handful of docs with a token overlap < 0.80, there are quite a few
improvements. It also looks like there may be some regressions in character
mapping in several of the files. I'll submit issues for these over on PDFBox.
Unless there are objections, I'll bump Tika to PDFBox 1.8.7.
Unfortunately, the individual file links don't seem to be working today on the
govdocs1 site.
> Upgrade to PDFBox 1.8.7
> -----------------------
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
> Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major
> regressions are found.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)