[
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1575:
------------------------------
Attachment: PDFBox_1_8_8Vs1_8_9_20150316.zip
content_diffs_20150316.xlsx
Results from rerunning after [~msahyoun] made the fixes. It looks like there
still may be a handful of regressions from a different cause(s?).
Important columns:
TOP_10_UNIQUE_TOKEN_DIFFS_A
TOP_10_UNIQUE_TOKEN_DIFFS_B,
These report which tokens are unique to A or B. For example, for
005/005937.pdf, PDFBox 1.8.8's extracted text included these words:
{noformat}global: 5 | sep: 4 | one: 4 | o: 4 | field: 4 | view: 3 | monitoring:
3 | support: 2 | real: 2 | pole: 2{noformat} but PDFBox 1.8.9-SNAPSHOT's
extracted text doesn't contain any of those words.
TOP_10_MORE_IN_A and TOP_10_MORE_IN_B show which tokens appear more commonly in
A than B and vice versa. For example, for 524/524276.pdf, PDFBox 1.8.8 was
able to extract 44 more "the", 35 more "in", 32 more "and" etc. than were
extracted with PDFBox 1.8.9-SNAPSHOT:
{noformat}
the: 44 | in: 35 | and: 32 | of: 32 | tanabe: 28 | a: 19 | tsunami: 18 | p: 14
| 1: 12 | 1700: 12 {noformat}
999/999680.pdf shows a clear improvement from PDFBox 1.8.8->PDFBox
1.8.9-SNAPSHOT. Compare 1.8.8's
{noformat}
h: 17 | hfirst: 2 | had: 1 | hanecdotal: 1 | hcapitalize: 1 | hchina: 1 |
hcreate: 1 | hdiversify: 1 | hemerging: 1 | hgenerate: 1
{noformat}
with 1.8.9-SNAPSHOT's:
{noformat}
first: 2 | ad: 1 | anecdotal: 1 | capitalize: 1 | china: 1 | create: 1 |
diversify: 1 | emerging: 1 | generate: 1 | growing: 1
{noformat}
but there are a handful of others that might be regressions.
> Upgrade to PDFBox 1.8.9 when available
> --------------------------------------
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: 10-814_Appendix B_v3.pdf,
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx,
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip,
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9. Let's use this issue to
> track discussions before the release and to track Tika's upgrade to PDFBox
> 1.8.9
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)