[
https://issues.apache.org/jira/browse/TIKA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215850#comment-15215850
]
Tim Allison commented on TIKA-1912:
-----------------------------------
Overall, I see two options:
1. Improve PDFBox 2.0.x's handling of truncated files.
2. Shade 1.8.x and use that as a backoff parser.
As [~jahewson] pointed out, it would be far better to improve PDFBox 2.0.0's
handling of truncated files, and I agree. On this
[thread|http://mail-archives.apache.org/mod_mbox/pdfbox-dev/201603.mbox/%3C56F9466D.90305%40lehmi.de%3E],
it looked like there may be some willingness on the PDFBox team to work on
this.
For the second option, I've set up a standalone project on github that shades
PDFBox 1.8.11 [here|https://github.com/tballison/tika-addons], and uses Tika's
last pre-2.0.0 PDFParser. It was mildly tricky because TextStripper loads
classes from a .properties file that wasn't automatically shaded...I'm sure
there is a more elegant solution than I what I did...advice is welcomed!
> Figure out how to parse truncated PDFs that were handled by PDFBox 1.8.x but
> not by 2.0.0
> -----------------------------------------------------------------------------------------
>
> Key: TIKA-1912
> URL: https://issues.apache.org/jira/browse/TIKA-1912
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
>
> While working on TIKA-1285, we found that PDFBox 2.0.0 is not able to handle
> truncated files as well as PDFBox 1.8.11. Let's figure out how to gain the
> benefits from 2.0.0 without losing the ability to extract some content from
> truncated files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)