Hi John,
Normally, I'd agree. And, y, I've been extremely grateful for the effort put
into dealing with noisy PDFs in 2.0.0. However, I think that the Tika user
requesting this is interested in getting what he can from truncated and truly
broken files -- e.g. Common Crawl data which (I think) truncates files at 1MB
or may have had an interrupt during download. My basic rule for opening an
issue is if AR or another pdf parser can't parse it, I'm not going to ask for
help.
I wouldn't want to direct your all's efforts to dealing with the edge cases
of truncated files. If the old PDFParser is able to get something out because
it parsed sequentially, then it would be neat to be able to have that available
with very little effort. In Tika, we envision allowing users to configure
combinations of parsers for a given file, this would be the perfect case for
the back-off-on-exception strategy -- if there's an exception with 2.0.0, try
again with 1.8.x.
I'll try shading/relocating next week, and see whether that works as expected.
Thank you, all, again!
Cheers,
Tim
-----Original Message-----
From: John Hewson [mailto:[email protected]]
Sent: Friday, March 25, 2016 1:03 PM
To: [email protected]
Subject: Re: shading/relocating 1.8.x?
> On 25 Mar 2016, at 09:44, Tilman Hausherr <[email protected]> wrote:
>
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
>
> https://issues.apache.org/jira/browse/PDFBOX-3265
>
Great! Does anyone else have some others?
— John
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For
> additional commands, e-mail: [email protected]
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional
commands, e-mail: [email protected]