[ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14040934#comment-14040934
 ] 

Tim Allison edited comment on TIKA-1352 at 6/23/14 5:39 PM:
------------------------------------------------------------

Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 
workarounds removed) on a random selection of 10k pdf files in govdocs1.

Both runs used the older "sequential" parser.

The table file is a tab-delimited UTF-16LE file.

This is a first go at the initial/raw output of comparison code for TIKA-1302.  
Much more work remains.

The ZipBomb exceptions are caused by my incorrect first attempt to remove 
PDFBox-1130 workarounds. These will go away.

Other than that, we should probably look at the few hundred files that have 
token overlap of < 98%.

To view the original files from gov docs (e.g. 765470), navigate to:

http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf


was (Author: [email protected]):
Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 
workarounds removed) on a random selection of 10k pdf files in govdocs1.

Both runs used the older "sequential" parser.

The table file is a tab-delimited UTF-16LE file.

This is a first go at the initial/raw output of comparison code for TIKA-1302.  
Much more work remains.

The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds. 

Other than that, we should probably look at the few hundred files that have 
token overlap of < 98%.

To view the original files from gov docs (e.g. 765470), navigate to:

http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf

> Upgrade to PDFBox 1.8.6
> -----------------------
>
>                 Key: TIKA-1352
>                 URL: https://issues.apache.org/jira/browse/TIKA-1352
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip
>
>
> This is to track moving to PDFBox 1.8.6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to