Hi John,

  Normally, I'd agree.  And, y, I've been extremely grateful for the effort put 
into dealing with noisy PDFs in 2.0.0.  However, I think that the Tika user 
requesting this is interested in getting what he can from truncated and truly 
broken files -- e.g. Common Crawl data which (I think) truncates files at 1MB 
or may have had an interrupt during download.  My basic rule for opening an 
issue is if AR or another pdf parser can't parse it, I'm not going to ask for 
help.
 
   I wouldn't want to direct your all's efforts to dealing with the edge cases 
of truncated files.  If the old PDFParser is able to get something out because 
it parsed sequentially, then it would be neat to be able to have that available 
with very little effort.  In Tika, we envision allowing users to configure 
combinations of parsers for a given file, this would be the perfect case for 
the back-off-on-exception strategy -- if there's an exception with 2.0.0, try 
again with 1.8.x.

  I'll try shading/relocating next week, and see whether that works as expected.

  Thank you, all, again!

              Cheers,

                        Tim


-----Original Message-----
From: John Hewson [mailto:[email protected]] 
Sent: Friday, March 25, 2016 1:03 PM
To: [email protected]
Subject: Re: shading/relocating 1.8.x?


> On 25 Mar 2016, at 09:44, Tilman Hausherr <[email protected]> wrote:
> 
> Am 25.03.2016 um 17:39 schrieb John Hewson:
>> Do we have some JIRA issues which identify some of these cases?
> 
> https://issues.apache.org/jira/browse/PDFBOX-3265
> 

Great! Does anyone else have some others?

— John

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected] For 
> additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected] For additional 
commands, e-mail: [email protected]

Reply via email to