[
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-1352:
------------------------------
Attachment: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130
workarounds removed) on a random selection of 10k pdf files in govdocs1.
Both runs used the older "sequential" parser.
The table file is a tab-delimited UTF-16LE file.
This is a first go at the initial/raw output of comparison code for TIKA-1302.
Much more work remains.
The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds.
Other than that, we should probably look at the few hundred files that have
token overlap of < 98%.
To view the original files from gov docs (e.g. 765470), navigate to:
http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf
> Upgrade to PDFBox 1.8.6
> -----------------------
>
> Key: TIKA-1352
> URL: https://issues.apache.org/jira/browse/TIKA-1352
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
>
>
--
This message was sent by Atlassian JIRA
(v6.2#6252)