Re: TikaIO concerns

Sergey Beryozkin Wed, 20 Sep 2017 02:16:32 -0700

Eugene,

As far as I was concerned, I was quite happy with the initial code I did:

It does work, at least I think it works, I run TikaIO example (not partof the distro - can provide a link) and I see a bunch of files created(with TextIO.Write linking with TikaIO) and to me it looks 'perfect'.

Given that I thought I would see the PR succeeding. At no point of timethere was a question of me, being a JB's co-worker, relying on JB'sfriendship just to get it in, even though I was keen to see PR mergedasap :-) - as long as JB had any questions/review comments for me I'daddress them.


I've no problems with getting and cleaning up the code with more PRs,
I'd really prefer that before re-writing it completely.

Does the ordering matter ? Perhaps for some cases it does, and for someit does not. May be it makes sense to support running TikaIO as both thebounded reader/source and ParDo, with getting the common code reused.

At this stage IMHO it might make sense to clean up first the currentcode before making possibly bigger decisions ?


Sergey

On 20/09/17 00:18, Eugene Kirpichov wrote:

On Tue, Sep 19, 2017 at 5:13 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi Sergey,

as discussed together during the review, I fully understand the choices
you did.

Your plan sounds reasonable. Thanks !

Generally speaking, in order to give visibility and encourage
contribution, I
think it would make sense to accept a PR if it's basically right (even if
it's
not yet perfect) and doesn't break the build.

This is a wider discussion than the current thread, but I don't think I
agree with this approach.

We have followed a much stricter standard in the past, and thanks to that,
Beam currently has (in my opinion) an extremely high-quality library of
IOs, and Beam can take pride in not being one of "those" open-source
projects that advertise everything but guarantee nothing and are
frustrating to work with, because everything is slightly broken in some way
or another.

I can recall at most 1 or 2 cases where a contributor gave up on a PR due
to the amount of issues pointed out during review - and in those cases, the
PR was usually in a state where Beam would not have benefitted from merging
the issue-ridden code anyway. Basically, a thorough review in all cases
I've seen so far has been a good idea in retrospect.

There may be trivial fixups best done by a committer rather than author
(e.g. javadoc typos), but I think nontrivial, high-level issues should be
reviewed rigorously.

People trust Beam (especially Beam IOs) with their data, and at least the
correctness-critical stuff *must* be done right. Beam also generally
promises a stable API, so API mistakes are forever, and can not be fixed
iteratively [this can be addressed by marking in-progress work as
@Experimental] - so APIs must be done right as well. On the other hand,
performance, documentation, lack of support for certain features etc. can
be fixed iteratively - I agree that we shouldn't push too hard on that
during review.

There's also the mentorship aspect: I think it is valuable for new Beam
contributors to get thorough review, especially for their first
contributions, as a kick-start to learning the best practices - they are
going to need them repeatedly in their future contributions. Merging code
without sufficient review gives them the immediate gratification of "having
contributed", but denies the mentorship. Moreover, most contributions are
made by a relatively small number of prolific "serial contributors" (you
being a prime example!) who are responsive to feedback and eager to learn,
so the immediate gratification I think is not very important.

I think the best way to handle code reviews for Beam is to give it our best
as reviewers, especially for first-time contributors; and if it feels like
the amount of required changes is too large for the contributor to handle,
then work with them to prioritize the changes, or start small and decompose
the contribution into more manageable pieces, but each merged piece must be
high-quality.

I would be happy to help on TikaIO as I did during the first review round
;)

Regards
JB

On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:

Hi All

This is my first post the the dev list, I work for Talend, I'm a Beam

novice,

Apache Tika fan, and thought it would be really great to try and link

both

projects together, which led me to opening [1] where I typed some early
thoughts, followed by PR [2].

I noticed yesterday I had the robust :-) (but useful and helpful) newer

review

comments from Eugene pending, so I'd like to summarize a bit why I did

TikaIO

(reader) the way I did, and then decide, based on the feedback from the

experts,

what to do next.

Apache Tika Parsers report the text content in chunks, via SaxParser

events.

It's not possible with Tika to take a file and read it bit by bit at the
'initiative' of the Beam Reader, line by line, the only way is to handle

the

SAXParser callbacks which report the data chunks. Some parsers may

report the

complete lines, some individual words, with some being able report the

data only

after the completely parse the document.
All depends on the data format.

At the moment TikaIO's TikaReader does not use the Beam threads to parse

the

files, Beam threads will only collect the data from the internal queue

where the

internal TikaReader's thread will put the data into
(note the data chunks are ordered even though the tests might suggest

otherwise).


The reason I did it was because I thought

1) it would make the individual data chunks available faster to the

pipeline -

the parser will continue working via the binary/video etc file while the

data

will already start flowing - I agree there should be some tests data

available

confirming it - but I'm positive at the moment this approach might yield

some

performance gains with the large sets. If the file is large, if it has

the

embedded attachments/videos to deal with, then it may be more effective

not to

get the Beam thread deal with it...

2) As I commented at the end of [2], having an option to concatenate the

data

chunks first before making them available to the pipeline is useful, and

I guess

doing the same in ParDo would introduce some synchronization issues

(though not

exactly sure yet)

One of valid concerns there is that the reader is polling the internal

queue so,

in theory at least, and perhaps in some rare cases too, we may have a

case where

the max polling time has been reached, the parser is still busy, and

TikaIO

fails to report all the file data. I think that it can be solved by

either 2a)

configuring the max polling time to a very large number which will never

be

reached for a practical case, or 2b) simply use a blocking queue without

the

time limits - in the worst case, if TikaParser spins and fails to report

the end

of the document, then, Bean can heal itself if the pipeline blocks.
I propose to follow 2b).


Please let me know what you think.
My plan so far is:
1) start addressing most of Eugene's comments which would require some

minor

TikaIO updates
2) work on removing the TikaSource internal code dealing with File

patterns

which I copied from TextIO at the next stage
3) If needed - mark TikaIO Experimental to give Tika and Beam users some

time to

try it with some real complex files and also decide if TikaIO can

continue

implemented as a BoundedSource/Reader or not

Eugene, all, will it work if I start with 1) ?

Thanks, Sergey

[1] https://issues.apache.org/jira/browse/BEAM-2328
[2] https://github.com/apache/beam/pull/3378


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: TikaIO concerns

Reply via email to