[
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040934#comment-14040934
]
Tim Allison edited comment on TIKA-1352 at 6/23/14 5:39 PM:
------------------------------------------------------------
Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130
workarounds removed) on a random selection of 10k pdf files in govdocs1.
Both runs used the older "sequential" parser.
The table file is a tab-delimited UTF-16LE file.
This is a first go at the initial/raw output of comparison code for TIKA-1302.
Much more work remains.
The ZipBomb exceptions are caused by my incorrect first attempt to remove
PDFBox-1130 workarounds. These will go away.
Other than that, we should probably look at the few hundred files that have
token overlap of < 98%.
To view the original files from gov docs (e.g. 765470), navigate to:
http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf
was (Author: [email protected]):
Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130
workarounds removed) on a random selection of 10k pdf files in govdocs1.
Both runs used the older "sequential" parser.
The table file is a tab-delimited UTF-16LE file.
This is a first go at the initial/raw output of comparison code for TIKA-1302.
Much more work remains.
The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds.
Other than that, we should probably look at the few hundred files that have
token overlap of < 98%.
To view the original files from gov docs (e.g. 765470), navigate to:
http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf
> Upgrade to PDFBox 1.8.6
> -----------------------
>
> Key: TIKA-1352
> URL: https://issues.apache.org/jira/browse/TIKA-1352
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Minor
> Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
>
>
> This is to track moving to PDFBox 1.8.6.
--
This message was sent by Atlassian JIRA
(v6.2#6252)