Am 25.03.2016 um 17:39 schrieb John Hewson:
On 23 Mar 2016, at 06:20, Allison, Timothy B. <[email protected]> wrote:
All,
We've upgraded to 2.0.0 on Tika. Many thanks again!
One of our users is interested in continuing to use the
classic/SequentialParser, or at least having it available as a back-off parser
for corrupt pdfs [0].
Using the old parser really isn’t a good idea, it’s known to be pretty broken.
I think that we would be much better off making sure the new parser can handle
truncated files. We already do a lot of repair in the new parser, so this
doesn’t seem like to much work? Maybe Andreas can comment further?
The biggest issue here is the truncated stream or dictionary. The current
version simply throws an exception when running into such constellations. We
have to implement some algorithm to ignore such incomplete parts of a pdf if
possible.
BR
Andreas
Do we have some JIRA issues which identify some of these cases?
— John
Would you be willing to distribute a shaded/relocated 1.8.x app so that we
could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, is
there a better solution?
I wouldn’t recommend doing that, because you’re going to be stuck with using
1.8 for everything, not just parsing, at least as far as corrupt/truncated
files are concerned.
— John
Thank you!
Cheers,
Tim
[0]
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]