Sure - with hundreds of different file formats and the abundance of weird / malformed / malicious files in the wild, it's quite expected that sometimes the library will crash.
Some kinds of issues are easier to address than others. We can catch exceptions and return a ParseResult representing a failure to parse this document. Addressing freezes and native JVM process crashes is much harder and probably not necessary in the first version. Sergey - I think, the moment you introduce ParseResult into the code, other changes I suggested will follow "by construction": - There'll be 1 ParseResult per document, containing filename, content and metadata, since per discussion above it probably doesn't make sense to deliver these in separate PCollection elements - Since you're returning a single value per document, there's no reason to use a BoundedReader - Likewise, there's no reason to use asynchronicity because you're not delivering the result incrementally I'd suggest to start the refactoring by removing the asynchronous codepath, then converting from BoundedReader to ParDo or MapElements, then converting from String to ParseResult. On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com> wrote: > Hi Tim, All > On 22/09/17 18:17, Allison, Timothy B. wrote: > > Y, I think you have it right. > > > >> Tika library has a big problem with crashes and freezes > > > > I wouldn't want to overstate it. Crashes and freezes are exceedingly > rare, but when you are processing millions/billions of files in the wild > [1], they will happen. We fix the problems or try to get our dependencies > to fix the problems when we can, > > I only would like to add to this that IMHO it would be more correct to > state it's not a Tika library's 'fault' that the crashes might occur. > Tika does its best to get the latest libraries helping it to parse the > files, but indeed there will always be some file there that might use > some incomplete format specific tag etc which may cause the specific > parser to spin - but Tika will include the updated parser library asap. > > And with Beam's help the crashes that can kill the Tika jobs completely > will probably become a history... > > Cheers, Sergey > > but given our past history, I have no reason to believe that these > problems won't happen again. > > > > Thank you, again! > > > > Best, > > > > Tim > > > > [1] Stuff on the internet or ... some of our users are forensics > examiners dealing with broken/corrupted files > > > > P.S./FTR 😊 > > 1) We've gathered a TB of data from CommonCrawl and we run regression > tests against this TB (thank you, Rackspace for hosting our vm!) to try to > identify these problems. > > 2) We've started a fuzzing effort to try to identify problems. > > 3) We added "tika-batch" for robust single box fileshare/fileshare > processing for our low volume users > > 4) We're trying to get the message out. Thank you for working with us!!! > > > > -----Original Message----- > > From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] > > Sent: Friday, September 22, 2017 12:48 PM > > To: dev@beam.apache.org > > Cc: d...@tika.apache.org > > Subject: Re: TikaIO concerns > > > > Hi Tim, > > From what you're saying it sounds like the Tika library has a big > problem with crashes and freezes, and when applying it at scale (eg. in the > context of Beam) requires explicitly addressing this problem, eg. accepting > the fact that in many realistic applications some documents will just need > to be skipped because they are unprocessable? This would be first example > of a Beam IO that has this concern, so I'd like to confirm that my > understanding is correct. > > > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org> > > wrote: > > > >> Reuven, > >> > >> Thank you! This suggests to me that it is a good idea to integrate > >> Tika with Beam so that people don't have to 1) (re)discover the need > >> to make their wrappers robust and then 2) have to reinvent these > >> wheels for robustness. > >> > >> For kicks, see William Palmer's post on his toe-stubbing efforts with > >> Hadoop [1]. He and other Tika users independently have wound up > >> carrying out exactly your recommendation for 1) below. > >> > >> We have a MockParser that you can get to simulate regular exceptions, > >> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2]. > >> > >>> However if processing the document causes the process to crash, then > >>> it > >> will be retried. > >> Any ideas on how to get around this? > >> > >> Thank you again. > >> > >> Cheers, > >> > >> Tim > >> > >> [1] > >> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > >> eb-content-nanite/ > >> [2] > >> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml > >> >