This is similar to what I suggested. This will not work well to handle crashes and freezes however.
On Fri, Sep 22, 2017 at 10:24 AM, Ben Chambers <bchamb...@apache.org> wrote: > BigQueryIO allows a side-output for elements that failed to be inserted > when using the Streaming BigQuery sink: > > https://github.com/apache/beam/blob/master/sdks/java/io/ > google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/ > StreamingWriteTables.java#L92 > > This follows the pattern of a DoFn with multiple outputs, as described here > https://cloud.google.com/blog/big-data/2016/01/handling- > invalid-inputs-in-dataflow > > So, the DoFn that runs the Tika code could be configured in terms of how > different failures should be handled, with the option of just outputting > them to a different PCollection that is then processed in some other way. > > On Fri, Sep 22, 2017 at 10:18 AM Allison, Timothy B. <talli...@mitre.org> > wrote: > > > Do tell... > > > > Interesting. Any pointers? > > > > -----Original Message----- > > From: Ben Chambers [mailto:bchamb...@google.com.INVALID] > > Sent: Friday, September 22, 2017 12:50 PM > > To: dev@beam.apache.org > > Cc: d...@tika.apache.org > > Subject: Re: TikaIO concerns > > > > Regarding specifically elements that are failing -- I believe some other > > IO has used the concept of a "Dead Letter" side-output,, where documents > > that failed to process are side-output so the user can handle them > > appropriately. > > > > On Fri, Sep 22, 2017 at 9:47 AM Eugene Kirpichov > > <kirpic...@google.com.invalid> wrote: > > > > > Hi Tim, > > > From what you're saying it sounds like the Tika library has a big > > > problem with crashes and freezes, and when applying it at scale (eg. > > > in the context of Beam) requires explicitly addressing this problem, > > > eg. accepting the fact that in many realistic applications some > > > documents will just need to be skipped because they are unprocessable? > > > This would be first example of a Beam IO that has this concern, so I'd > > > like to confirm that my understanding is correct. > > > > > > On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. > > > <talli...@mitre.org> > > > wrote: > > > > > > > Reuven, > > > > > > > > Thank you! This suggests to me that it is a good idea to integrate > > > > Tika with Beam so that people don't have to 1) (re)discover the need > > > > to make their wrappers robust and then 2) have to reinvent these > > > > wheels for robustness. > > > > > > > > For kicks, see William Palmer's post on his toe-stubbing efforts > > > > with Hadoop [1]. He and other Tika users independently have wound > > > > up carrying out exactly your recommendation for 1) below. > > > > > > > > We have a MockParser that you can get to simulate regular > > > > exceptions, > > > OOMs > > > > and permanent hangs by asking Tika to parse a <mock> xml [2]. > > > > > > > > > However if processing the document causes the process to crash, > > > > > then it > > > > will be retried. > > > > Any ideas on how to get around this? > > > > > > > > Thank you again. > > > > > > > > Cheers, > > > > > > > > Tim > > > > > > > > [1] > > > > > > > http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w > > > eb-content-nanite/ > > > > [2] > > > > > > > https://github.com/apache/tika/blob/master/tika-parsers/src/test/resou > > > rces/test-documents/mock/example.xml > > > > > > > > > >