This is similar to what I suggested. This will not work well to handle
crashes and freezes however.

On Fri, Sep 22, 2017 at 10:24 AM, Ben Chambers <bchamb...@apache.org> wrote:

> BigQueryIO allows a side-output for elements that failed to be inserted
> when using the Streaming BigQuery sink:
>
> https://github.com/apache/beam/blob/master/sdks/java/io/
> google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/
> StreamingWriteTables.java#L92
>
> This follows the pattern of a DoFn with multiple outputs, as described here
> https://cloud.google.com/blog/big-data/2016/01/handling-
> invalid-inputs-in-dataflow
>
> So, the DoFn that runs the Tika code could be configured in terms of how
> different failures should be handled, with the option of just outputting
> them to a different PCollection that is then processed in some other way.
>
> On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <talli...@mitre.org>
> wrote:
>
> > Do tell...
> >
> > Interesting.  Any pointers?
> >
> > -----Original Message-----
> > From: Ben Chambers [mailto:bchamb...@google.com.INVALID]
> > Sent: Friday, September 22, 2017 12:50 PM
> > To: dev@beam.apache.org
> > Cc: d...@tika.apache.org
> > Subject: Re: TikaIO concerns
> >
> > Regarding specifically elements that are failing -- I believe some other
> > IO has used the concept of a "Dead Letter" side-output,, where documents
> > that failed to process are side-output so the user can handle them
> > appropriately.
> >
> > On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov
> > <kirpic...@google.com.invalid> wrote:
> >
> > > Hi Tim,
> > > From what you're saying it sounds like the Tika library has a big
> > > problem with crashes and freezes, and when applying it at scale (eg.
> > > in the context of Beam) requires explicitly addressing this problem,
> > > eg. accepting the fact that in many realistic applications some
> > > documents will just need to be skipped because they are unprocessable?
> > > This would be first example of a Beam IO that has this concern, so I'd
> > > like to confirm that my understanding is correct.
> > >
> > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B.
> > > <talli...@mitre.org>
> > > wrote:
> > >
> > > > Reuven,
> > > >
> > > > Thank you!  This suggests to me that it is a good idea to integrate
> > > > Tika with Beam so that people don't have to 1) (re)discover the need
> > > > to make their wrappers robust and then 2) have to reinvent these
> > > > wheels for robustness.
> > > >
> > > > For kicks, see William Palmer's post on his toe-stubbing efforts
> > > > with Hadoop [1].  He and other Tika users independently have wound
> > > > up carrying out exactly your recommendation for 1) below.
> > > >
> > > > We have a MockParser that you can get to simulate regular
> > > > exceptions,
> > > OOMs
> > > > and permanent hangs by asking Tika to parse a <mock> xml [2].
> > > >
> > > > > However if processing the document causes the process to crash,
> > > > > then it
> > > > will be retried.
> > > > Any ideas on how to get around this?
> > > >
> > > > Thank you again.
> > > >
> > > > Cheers,
> > > >
> > > >            Tim
> > > >
> > > > [1]
> > > >
> > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
> > > eb-content-nanite/
> > > > [2]
> > > >
> > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou
> > > rces/test-documents/mock/example.xml
> > > >
> > >
> >
>

Reply via email to