[ 
https://issues.apache.org/jira/browse/TIKA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215850#comment-15215850
 ] 

Tim Allison commented on TIKA-1912:
-----------------------------------

Overall, I see two options:

1. Improve PDFBox 2.0.x's handling of truncated files.
2. Shade 1.8.x and use that as a backoff parser.

As [~jahewson] pointed out, it would be far better to improve PDFBox 2.0.0's 
handling of truncated files, and I agree.  On this 
[thread|http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201603.mbox/%3C56F9466D.90305%40lehmi.de%3E],
 it looked like there may be some willingness on the PDFBox team to work on 
this.

For the second option, I've set up a standalone project on github that shades 
PDFBox 1.8.11 [here|https://github.com/tballison/tika-addons], and uses Tika's 
last pre-2.0.0 PDFParser.  It was mildly tricky because TextStripper loads 
classes from a .properties file that wasn't automatically shaded...I'm sure 
there is a more elegant solution than I what I did...advice is welcomed!

> Figure out how to parse truncated PDFs that were handled by PDFBox 1.8.x but 
> not by 2.0.0
> -----------------------------------------------------------------------------------------
>
>                 Key: TIKA-1912
>                 URL: https://issues.apache.org/jira/browse/TIKA-1912
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>
> While working on TIKA-1285, we found that PDFBox 2.0.0 is not able to handle 
> truncated files as well as PDFBox 1.8.11.  Let's figure out how to gain the 
> benefits from 2.0.0 without losing the ability to extract some content from 
> truncated files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to