[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6

Tim Allison (JIRA) Mon, 23 Jun 2014 09:35:41 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison updated TIKA-1352:
------------------------------

    Attachment: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip

Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 
workarounds removed) on a random selection of 10k pdf files in govdocs1.

Both runs used the older "sequential" parser.

The table file is a tab-delimited UTF-16LE file.

This is a first go at the initial/raw output of comparison code for TIKA-1302.  
Much more work remains.

The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds. 

Other than that, we should probably look at the few hundred files that have 
token overlap of < 98%.

To view the original files from gov docs (e.g. 765470), navigate to:

http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf

> Upgrade to PDFBox 1.8.6
> -----------------------
>
>                 Key: TIKA-1352
>                 URL: https://issues.apache.org/jira/browse/TIKA-1352
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6

Reply via email to