Y, I think you have it right.

> Tika library has a big problem with crashes and freezes

I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but 
when you are processing millions/billions of files in the wild [1], they will 
happen.  We fix the problems or try to get our dependencies to fix the problems 
when we can, but given our past history, I have no reason to believe that these 
problems won't happen again.

Thank you, again!

Best,

            Tim

[1] Stuff on the internet or ... some of our users are forensics examiners 
dealing with broken/corrupted files

P.S./FTR  😊
1) We've gathered a TB of data from CommonCrawl and we run regression tests 
against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
these problems. 
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing 
for our low volume users 
4) We're trying to get the message out.  Thank you for working with us!!!

-----Original Message-----
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] 
Sent: Friday, September 22, 2017 12:48 PM
To: dev@beam.apache.org
Cc: d...@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
From what you're saying it sounds like the Tika library has a big problem with 
crashes and freezes, and when applying it at scale (eg. in the context of Beam) 
requires explicitly addressing this problem, eg. accepting the fact that in 
many realistic applications some documents will just need to be skipped because 
they are unprocessable? This would be first example of a Beam IO that has this 
concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org>
wrote:

> Reuven,
>
> Thank you!  This suggests to me that it is a good idea to integrate 
> Tika with Beam so that people don't have to 1) (re)discover the need 
> to make their wrappers robust and then 2) have to reinvent these 
> wheels for robustness.
>
> For kicks, see William Palmer's post on his toe-stubbing efforts with 
> Hadoop [1].  He and other Tika users independently have wound up 
> carrying out exactly your recommendation for 1) below.
>
> We have a MockParser that you can get to simulate regular exceptions, 
> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
>
> > However if processing the document causes the process to crash, then 
> > it
> will be retried.
> Any ideas on how to get around this?
>
> Thank you again.
>
> Cheers,
>
>            Tim
>
> [1]
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> eb-content-nanite/
> [2]
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
>

Reply via email to