> On 29 Mar 2016, at 04:11, Andreas Lehmkühler <[email protected]> wrote: > >> "Allison, Timothy B." <[email protected] <mailto:[email protected]>> hat >> am 28. März 2016 um 21:02 >> geschrieben: >> >> >> Oh, wow, so it really might be possible without too much work? I'm more than >> happy to supply examples. :) > Ups, it isn't as simply as it sounds. If we simply swallow the exception > pdfbox > most likel runs into a NPE. IMHO we have to implement some sort of an on > demand > parser which is able to handle null-values for specific parts of a pdf without > throwing any exception.
One thought: instead of null it might be possible to return an empty string, empty dictionary, empty array, empty stream, etc. That way we don’t have to look for null everywhere. — John > >> Should I open an issue? > Thanks, but I'm going to do that soon, as some other things should be done as > well. > > BR > Andreas >> >> >> -----Original Message----- >> From: Andreas Lehmkuehler [mailto:[email protected]] >> Sent: Monday, March 28, 2016 10:58 AM >> To: [email protected] >> Subject: Re: shading/relocating 1.8.x? >> >> Am 25.03.2016 um 17:39 schrieb John Hewson: >>> >>>> On 23 Mar 2016, at 06:20, Allison, Timothy B. <[email protected]> wrote: >>>> >>>> All, >>>> We've upgraded to 2.0.0 on Tika. Many thanks again! >>>> One of our users is interested in continuing to use the >>>> classic/SequentialParser, or at least having it available as a back-off >>>> parser for corrupt pdfs [0]. >>> >>> Using the old parser really isn’t a good idea, it’s known to be pretty >>> broken. I think that we would be much better off making sure the new parser >>> can handle truncated files. We already do a lot of repair in the new parser, >>> so this doesn’t seem like to much work? Maybe Andreas can comment further? >> The biggest issue here is the truncated stream or dictionary. The current >> version simply throws an exception when running into such constellations. We >> have to implement some algorithm to ignore such incomplete parts of a pdf if >> possible. >> >> BR >> Andreas >> >>> >>> Do we have some JIRA issues which identify some of these cases? >>> >>> — John >>> >>>> Would you be willing to distribute a shaded/relocated 1.8.x app so that >>>> we could load both 1.8.x and 2.0.0 in the same jvm without collisions? Or, >>>> is there a better solution? >>> >>> I wouldn’t recommend doing that, because you’re going to be stuck with using >>> 1.8 for everything, not just parsing, at least as far as corrupt/truncated >>> files are concerned. >>> >>> — John >>> >>>> Thank you! >>>> >>>> Cheers, >>>> >>>> Tim >>>> >>>> [0] >>>> https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208360#comment-15208360 >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > <mailto:[email protected]> > For additional commands, e-mail: [email protected] > <mailto:[email protected]>
