On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin <sberyoz...@gmail.com> wrote:
> Hi, > On 22/09/17 22:02, Eugene Kirpichov wrote: > > Sure - with hundreds of different file formats and the abundance of > weird / > > malformed / malicious files in the wild, it's quite expected that > sometimes > > the library will crash. > > > > Some kinds of issues are easier to address than others. We can catch > > exceptions and return a ParseResult representing a failure to parse this > > document. Addressing freezes and native JVM process crashes is much > harder > > and probably not necessary in the first version. > > > > Sergey - I think, the moment you introduce ParseResult into the code, > other > > changes I suggested will follow "by construction": > > - There'll be 1 ParseResult per document, containing filename, content > and > > metadata, since per discussion above it probably doesn't make sense to > > deliver these in separate PCollection elements > > I was still harboring the hope that may be using a container bean like > ParseResult (with the other changes you proposed) can somehow let us > stream from Tika into the pipeline. > > If it is 1 ParseResult per document then it means that until Tika has > parsed all the document the pipeline will not see it. > This is correct, and this is the API I'm suggesting to start with, because it's simple and sufficiently useful. I suggest to get into this state first, and then deal with creating a separate API that allows to not hold the entire parse result as a single PCollection element in memory. This should work fine for cases when each document's parse result (not the input document itself!) is up to a few hundred megabytes in size. > > I'm sorry if I may be starting to go in circles. But let me ask this. > How can a Beam user write a Beam function which will ensure the Tika > content pieces are seen ordered by the pipeline, without TikaIO ? > To answer this, I'd need you to clarify what you mean by "seen ordered by the pipeline" - order is a very vague term when it comes to parallel processing. What would you like the pipeline to compute that requires order within a document, but does NOT require having the contents of a document as a single String? Or are you asking simply how can users use Tika for arbitrary use cases without TikaIO? > > May be knowing that will help coming up with the idea how to generalize > somehow with the help of TikaIO ? > > > - Since you're returning a single value per document, there's no reason > to > > use a BoundedReader > > - Likewise, there's no reason to use asynchronicity because you're not > > delivering the result incrementally > > > > I'd suggest to start the refactoring by removing the asynchronous > codepath, > > then converting from BoundedReader to ParDo or MapElements, then > converting > > from String to ParseResult. > This is a good plan, thanks, I guess at least for small documents it > should work well (unless I've misunderstood a ParseResult idea) > > Thanks, Sergey > > > > On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com> > > wrote: > > > >> Hi Tim, All > >> On 22/09/17 18:17, Allison, Timothy B. wrote: > >>> Y, I think you have it right. > >>> > >>>> Tika library has a big problem with crashes and freezes > >>> > >>> I wouldn't want to overstate it. Crashes and freezes are exceedingly > >> rare, but when you are processing millions/billions of files in the wild > >> [1], they will happen. We fix the problems or try to get our > dependencies > >> to fix the problems when we can, > >> > >> I only would like to add to this that IMHO it would be more correct to > >> state it's not a Tika library's 'fault' that the crashes might occur. > >> Tika does its best to get the latest libraries helping it to parse the > >> files, but indeed there will always be some file there that might use > >> some incomplete format specific tag etc which may cause the specific > >> parser to spin - but Tika will include the updated parser library asap. > >> > >> And with Beam's help the crashes that can kill the Tika jobs completely > >> will probably become a history... > >> > >> Cheers, Sergey > >>> but given our past history, I have no reason to believe that these > >> problems won't happen again. > >>> > >>> Thank you, again! > >>> > >>> Best, > >>> > >>> Tim > >>> > >>> [1] Stuff on the internet or ... some of our users are forensics > >> examiners dealing with broken/corrupted files > >>> > >>> P.S./FTR 😊 > >>> 1) We've gathered a TB of data from CommonCrawl and we run regression > >> tests against this TB (thank you, Rackspace for hosting our vm!) to try > to > >> identify these problems. > >>> 2) We've started a fuzzing effort to try to identify problems. > >>> 3) We added "tika-batch" for robust single box fileshare/fileshare > >> processing for our low volume users > >>> 4) We're trying to get the message out. Thank you for working with > us!!! > >>> > >>> -----Original Message----- > >>> From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID] > >>> Sent: Friday, September 22, 2017 12:48 PM > >>> To: dev@beam.apache.org > >>> Cc: d...@tika.apache.org > >>> Subject: Re: TikaIO concerns > >>> > >>> Hi Tim, > >>> From what you're saying it sounds like the Tika library has a big > >> problem with crashes and freezes, and when applying it at scale (eg. in > the > >> context of Beam) requires explicitly addressing this problem, eg. > accepting > >> the fact that in many realistic applications some documents will just > need > >> to be skipped because they are unprocessable? This would be first > example > >> of a Beam IO that has this concern, so I'd like to confirm that my > >> understanding is correct. > >>> > >>> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. < > talli...@mitre.org> > >>> wrote: > >>> > >>>> Reuven, > >>>> > >>>> Thank you! This suggests to me that it is a good idea to integrate > >>>> Tika with Beam so that people don't have to 1) (re)discover the need > >>>> to make their wrappers robust and then 2) have to reinvent these > >>>> wheels for robustness. > >>>> > >>>> For kicks, see William Palmer's post on his toe-stubbing efforts with > >>>> Hadoop [1]. He and other Tika users independently have wound up > >>>> carrying out exactly your recommendation for 1) below. > >>>> > >>>> We have a MockParser that you can get to simulate regular exceptions, > >>>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2]. > >>>> > >>>>> However if processing the document causes the process to crash, then > >>>>> it > >>>> will be retried. > >>>> Any ideas on how to get around this? > >>>> > >>>> Thank you again. > >>>> > >>>> Cheers, > >>>> > >>>> Tim > >>>> > >>>> [1] > >>>> > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > >>>> eb-content-nanite/ > >>>> [2] > >>>> > >> > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml > >>>> > >> > > >