Hi All
This is my first post the the dev list, I work for Talend, I'm a Beam
novice, Apache Tika fan, and thought it would be really great to try and
link both projects together, which led me to opening [1] where I typed
some early thoughts, followed by PR [2].
I noticed yesterday I had the robust :-) (but useful and helpful) newer
review comments from Eugene pending, so I'd like to summarize a bit why
I did TikaIO (reader) the way I did, and then decide, based on the
feedback from the experts, what to do next.
Apache Tika Parsers report the text content in chunks, via SaxParser
events. It's not possible with Tika to take a file and read it bit by
bit at the 'initiative' of the Beam Reader, line by line, the only way
is to handle the SAXParser callbacks which report the data chunks. Some
parsers may report the complete lines, some individual words, with some
being able report the data only after the completely parse the document.
All depends on the data format.
At the moment TikaIO's TikaReader does not use the Beam threads to parse
the files, Beam threads will only collect the data from the internal
queue where the internal TikaReader's thread will put the data into
(note the data chunks are ordered even though the tests might suggest
otherwise).
The reason I did it was because I thought
1) it would make the individual data chunks available faster to the
pipeline - the parser will continue working via the binary/video etc
file while the data will already start flowing - I agree there should be
some tests data available confirming it - but I'm positive at the moment
this approach might yield some performance gains with the large sets. If
the file is large, if it has the embedded attachments/videos to deal
with, then it may be more effective not to get the Beam thread deal with
it...
2) As I commented at the end of [2], having an option to concatenate the
data chunks first before making them available to the pipeline is
useful, and I guess doing the same in ParDo would introduce some
synchronization issues (though not exactly sure yet)
One of valid concerns there is that the reader is polling the internal
queue so, in theory at least, and perhaps in some rare cases too, we may
have a case where the max polling time has been reached, the parser is
still busy, and TikaIO fails to report all the file data. I think that
it can be solved by either 2a) configuring the max polling time to a
very large number which will never be reached for a practical case, or
2b) simply use a blocking queue without the time limits - in the worst
case, if TikaParser spins and fails to report the end of the document,
then, Bean can heal itself if the pipeline blocks.
I propose to follow 2b).
Please let me know what you think.
My plan so far is:
1) start addressing most of Eugene's comments which would require some
minor TikaIO updates
2) work on removing the TikaSource internal code dealing with File
patterns which I copied from TextIO at the next stage
3) If needed - mark TikaIO Experimental to give Tika and Beam users some
time to try it with some real complex files and also decide if TikaIO
can continue implemented as a BoundedSource/Reader or not
Eugene, all, will it work if I start with 1) ?
Thanks, Sergey
[1] https://issues.apache.org/jira/browse/BEAM-2328
[2] https://github.com/apache/beam/pull/3378