Re: TikaIO concerns

Eugene Kirpichov Fri, 22 Sep 2017 14:51:13 -0700

On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:


> Hi,
> On 22/09/17 22:02, Eugene Kirpichov wrote:
> > Sure - with hundreds of different file formats and the abundance of
> weird /
> > malformed / malicious files in the wild, it's quite expected that
> sometimes
> > the library will crash.
> >
> > Some kinds of issues are easier to address than others. We can catch
> > exceptions and return a ParseResult representing a failure to parse this
> > document. Addressing freezes and native JVM process crashes is much
> harder
> > and probably not necessary in the first version.
> >
> > Sergey - I think, the moment you introduce ParseResult into the code,
> other
> > changes I suggested will follow "by construction":
> > - There'll be 1 ParseResult per document, containing filename, content
> and
> > metadata, since per discussion above it probably doesn't make sense to
> > deliver these in separate PCollection elements
>
> I was still harboring the hope that may be using a container bean like
> ParseResult (with the other changes you proposed) can somehow let us
> stream from Tika into the pipeline.
>
> If it is 1 ParseResult per document then it means that until Tika has
> parsed all the document the pipeline will not see it.
>
This is correct, and this is the API I'm suggesting to start with, because
it's simple and sufficiently useful. I suggest to get into this state
first, and then deal with creating a separate API that allows to not hold
the entire parse result as a single PCollection element in memory. This
should work fine for cases when each document's parse result (not the input
document itself!) is up to a few hundred megabytes in size.


>
> I'm sorry if I may be starting to go in circles. But let me ask this.
> How can a Beam user write a Beam function which will ensure the Tika
> content pieces are seen ordered by the pipeline, without TikaIO ?
>
To answer this, I'd need you to clarify what you mean by "seen ordered by
the pipeline" - order is a very vague term when it comes to parallel
processing. What would you like the pipeline to compute that requires order
within a document, but does NOT require having the contents of a document
as a single String?
Or are you asking simply how can users use Tika for arbitrary use cases
without TikaIO?


>
> May be knowing that will help coming up with the idea how to generalize
> somehow with the help of TikaIO ?
>
> > - Since you're returning a single value per document, there's no reason
> to
> > use a BoundedReader
> > - Likewise, there's no reason to use asynchronicity because you're not
> > delivering the result incrementally
> >
> > I'd suggest to start the refactoring by removing the asynchronous
> codepath,
> > then converting from BoundedReader to ParDo or MapElements, then
> converting
> > from String to ParseResult.
> This is a good plan, thanks, I guess at least for small documents it
> should work well (unless I've misunderstood a ParseResult idea)
>
> Thanks, Sergey
> >
> > On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com>
> > wrote:
> >
> >> Hi Tim, All
> >> On 22/09/17 18:17, Allison, Timothy B. wrote:
> >>> Y, I think you have it right.
> >>>
> >>>> Tika library has a big problem with crashes and freezes
> >>>
> >>> I wouldn't want to overstate it.  Crashes and freezes are exceedingly
> >> rare, but when you are processing millions/billions of files in the wild
> >> [1], they will happen.  We fix the problems or try to get our
> dependencies
> >> to fix the problems when we can,
> >>
> >> I only would like to add to this that IMHO it would be more correct to
> >> state it's not a Tika library's 'fault' that the crashes might occur.
> >> Tika does its best to get the latest libraries helping it to parse the
> >> files, but indeed there will always be some file there that might use
> >> some incomplete format specific tag etc which may cause the specific
> >> parser to spin - but Tika will include the updated parser library asap.
> >>
> >> And with Beam's help the crashes that can kill the Tika jobs completely
> >> will probably become a history...
> >>
> >> Cheers, Sergey
> >>> but given our past history, I have no reason to believe that these
> >> problems won't happen again.
> >>>
> >>> Thank you, again!
> >>>
> >>> Best,
> >>>
> >>>               Tim
> >>>
> >>> [1] Stuff on the internet or ... some of our users are forensics
> >> examiners dealing with broken/corrupted files
> >>>
> >>> P.S./FTR  😊
> >>> 1) We've gathered a TB of data from CommonCrawl and we run regression
> >> tests against this TB (thank you, Rackspace for hosting our vm!) to try
> to
> >> identify these problems.
> >>> 2) We've started a fuzzing effort to try to identify problems.
> >>> 3) We added "tika-batch" for robust single box fileshare/fileshare
> >> processing for our low volume users
> >>> 4) We're trying to get the message out.  Thank you for working with
> us!!!
> >>>
> >>> -----Original Message-----
> >>> From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
> >>> Sent: Friday, September 22, 2017 12:48 PM
> >>> To: dev@beam.apache.org
> >>> Cc: d...@tika.apache.org
> >>> Subject: Re: TikaIO concerns
> >>>
> >>> Hi Tim,
> >>>   From what you're saying it sounds like the Tika library has a big
> >> problem with crashes and freezes, and when applying it at scale (eg. in
> the
> >> context of Beam) requires explicitly addressing this problem, eg.
> accepting
> >> the fact that in many realistic applications some documents will just
> need
> >> to be skipped because they are unprocessable? This would be first
> example
> >> of a Beam IO that has this concern, so I'd like to confirm that my
> >> understanding is correct.
> >>>
> >>> On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <
> talli...@mitre.org>
> >>> wrote:
> >>>
> >>>> Reuven,
> >>>>
> >>>> Thank you!  This suggests to me that it is a good idea to integrate
> >>>> Tika with Beam so that people don't have to 1) (re)discover the need
> >>>> to make their wrappers robust and then 2) have to reinvent these
> >>>> wheels for robustness.
> >>>>
> >>>> For kicks, see William Palmer's post on his toe-stubbing efforts with
> >>>> Hadoop [1].  He and other Tika users independently have wound up
> >>>> carrying out exactly your recommendation for 1) below.
> >>>>
> >>>> We have a MockParser that you can get to simulate regular exceptions,
> >>>> OOMs and permanent hangs by asking Tika to parse a <mock> xml [2].
> >>>>
> >>>>> However if processing the document causes the process to crash, then
> >>>>> it
> >>>> will be retried.
> >>>> Any ideas on how to get around this?
> >>>>
> >>>> Thank you again.
> >>>>
> >>>> Cheers,
> >>>>
> >>>>              Tim
> >>>>
> >>>> [1]
> >>>>
> http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> >>>> eb-content-nanite/
> >>>> [2]
> >>>>
> >>
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
> >>>>
> >>
> >
>

Re: TikaIO concerns

Reply via email to