TikaIO concerns

Sergey Beryozkin Tue, 19 Sep 2017 03:41:33 -0700

Hi All

This is my first post the the dev list, I work for Talend, I'm a Beamnovice, Apache Tika fan, and thought it would be really great to try andlink both projects together, which led me to opening [1] where I typedsome early thoughts, followed by PR [2].

I noticed yesterday I had the robust :-) (but useful and helpful) newerreview comments from Eugene pending, so I'd like to summarize a bit whyI did TikaIO (reader) the way I did, and then decide, based on thefeedback from the experts, what to do next.

Apache Tika Parsers report the text content in chunks, via SaxParserevents. It's not possible with Tika to take a file and read it bit bybit at the 'initiative' of the Beam Reader, line by line, the only wayis to handle the SAXParser callbacks which report the data chunks. Someparsers may report the complete lines, some individual words, with somebeing able report the data only after the completely parse the document.

All depends on the data format.

At the moment TikaIO's TikaReader does not use the Beam threads to parsethe files, Beam threads will only collect the data from the internalqueue where the internal TikaReader's thread will put the data into(note the data chunks are ordered even though the tests might suggestotherwise).


The reason I did it was because I thought

1) it would make the individual data chunks available faster to thepipeline - the parser will continue working via the binary/video etcfile while the data will already start flowing - I agree there should besome tests data available confirming it - but I'm positive at the momentthis approach might yield some performance gains with the large sets. Ifthe file is large, if it has the embedded attachments/videos to dealwith, then it may be more effective not to get the Beam thread deal withit...

2) As I commented at the end of [2], having an option to concatenate thedata chunks first before making them available to the pipeline isuseful, and I guess doing the same in ParDo would introduce somesynchronization issues (though not exactly sure yet)

One of valid concerns there is that the reader is polling the internalqueue so, in theory at least, and perhaps in some rare cases too, we mayhave a case where the max polling time has been reached, the parser isstill busy, and TikaIO fails to report all the file data. I think thatit can be solved by either 2a) configuring the max polling time to avery large number which will never be reached for a practical case, or2b) simply use a blocking queue without the time limits - in the worstcase, if TikaParser spins and fails to report the end of the document,then, Bean can heal itself if the pipeline blocks.

I propose to follow 2b).


Please let me know what you think.
My plan so far is:

1) start addressing most of Eugene's comments which would require someminor TikaIO updates2) work on removing the TikaSource internal code dealing with Filepatterns which I copied from TextIO at the next stage3) If needed - mark TikaIO Experimental to give Tika and Beam users sometime to try it with some real complex files and also decide if TikaIOcan continue implemented as a BoundedSource/Reader or not


Eugene, all, will it work if I start with 1) ?

Thanks, Sergey

[1] https://issues.apache.org/jira/browse/BEAM-2328
[2] https://github.com/apache/beam/pull/3378

TikaIO concerns

Reply via email to