Reuven,

Thank you!  This suggests to me that it is a good idea to integrate Tika with 
Beam so that people don't have to 1) (re)discover the need to make their 
wrappers robust and then 2) have to reinvent these wheels for robustness.  

For kicks, see William Palmer's post on his toe-stubbing efforts with Hadoop 
[1].  He and other Tika users independently have wound up carrying out exactly 
your recommendation for 1) below. 

We have a MockParser that you can get to simulate regular exceptions, OOMs and 
permanent hangs by asking Tika to parse a <mock> xml [2]. 

> However if processing the document causes the process to crash, then it will 
> be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

           Tim

[1] 
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-web-content-nanite/
[2] 
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml
 

Reply via email to