I would be happy to help on TikaIO as I did during the first review round
;)
Regards
JB
On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:
Hi All
This is my first post the the dev list, I work for Talend, I'm a Beam
novice,
Apache Tika fan, and thought it would be really great to try and link
both
projects together, which led me to opening [1] where I typed some early
thoughts, followed by PR [2].
I noticed yesterday I had the robust :-) (but useful and helpful) newer
review
comments from Eugene pending, so I'd like to summarize a bit why I did
TikaIO
(reader) the way I did, and then decide, based on the feedback from the
experts,
what to do next.
Apache Tika Parsers report the text content in chunks, via SaxParser
events.
It's not possible with Tika to take a file and read it bit by bit at the
'initiative' of the Beam Reader, line by line, the only way is to handle
the
SAXParser callbacks which report the data chunks. Some parsers may
report the
complete lines, some individual words, with some being able report the
data only
after the completely parse the document.
All depends on the data format.
At the moment TikaIO's TikaReader does not use the Beam threads to parse
the
files, Beam threads will only collect the data from the internal queue
where the
internal TikaReader's thread will put the data into
(note the data chunks are ordered even though the tests might suggest
otherwise).
The reason I did it was because I thought
1) it would make the individual data chunks available faster to the
pipeline -
the parser will continue working via the binary/video etc file while the
data
will already start flowing - I agree there should be some tests data
available
confirming it - but I'm positive at the moment this approach might yield
some
performance gains with the large sets. If the file is large, if it has
the
embedded attachments/videos to deal with, then it may be more effective
not to
get the Beam thread deal with it...
2) As I commented at the end of [2], having an option to concatenate the
data
chunks first before making them available to the pipeline is useful, and
I guess
doing the same in ParDo would introduce some synchronization issues
(though not
exactly sure yet)
One of valid concerns there is that the reader is polling the internal
queue so,
in theory at least, and perhaps in some rare cases too, we may have a
case where
the max polling time has been reached, the parser is still busy, and
TikaIO
fails to report all the file data. I think that it can be solved by
either 2a)
configuring the max polling time to a very large number which will never
be
reached for a practical case, or 2b) simply use a blocking queue without
the
time limits - in the worst case, if TikaParser spins and fails to report
the end
of the document, then, Bean can heal itself if the pipeline blocks.
I propose to follow 2b).
Please let me know what you think.
My plan so far is:
1) start addressing most of Eugene's comments which would require some
minor
TikaIO updates
2) work on removing the TikaSource internal code dealing with File
patterns
which I copied from TextIO at the next stage
3) If needed - mark TikaIO Experimental to give Tika and Beam users some
time to
try it with some real complex files and also decide if TikaIO can
continue
implemented as a BoundedSource/Reader or not
Eugene, all, will it work if I start with 1) ?
Thanks, Sergey
[1] https://issues.apache.org/jira/browse/BEAM-2328
[2] https://github.com/apache/beam/pull/3378
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com