Eugene, As far as I was concerned, I was quite happy with the initial code I did:
It does work, at least I think it works, I run TikaIO example (not part of the distro - can provide a link) and I see a bunch of files created (with TextIO.Write linking with TikaIO) and to me it looks 'perfect'.
Given that I thought I would see the PR succeeding. At no point of time there was a question of me, being a JB's co-worker, relying on JB's friendship just to get it in, even though I was keen to see PR merged asap :-) - as long as JB had any questions/review comments for me I'd address them.
I've no problems with getting and cleaning up the code with more PRs, I'd really prefer that before re-writing it completely.Does the ordering matter ? Perhaps for some cases it does, and for some it does not. May be it makes sense to support running TikaIO as both the bounded reader/source and ParDo, with getting the common code reused.
At this stage IMHO it might make sense to clean up first the current code before making possibly bigger decisions ?
Sergey On 20/09/17 00:18, Eugene Kirpichov wrote:
On Tue, Sep 19, 2017 at 5:13 AM Jean-Baptiste Onofré <j...@nanthrax.net> wrote:Hi Sergey, as discussed together during the review, I fully understand the choices you did. Your plan sounds reasonable. Thanks ! Generally speaking, in order to give visibility and encourage contribution, I think it would make sense to accept a PR if it's basically right (even if it's not yet perfect) and doesn't break the build.This is a wider discussion than the current thread, but I don't think I agree with this approach. We have followed a much stricter standard in the past, and thanks to that, Beam currently has (in my opinion) an extremely high-quality library of IOs, and Beam can take pride in not being one of "those" open-source projects that advertise everything but guarantee nothing and are frustrating to work with, because everything is slightly broken in some way or another. I can recall at most 1 or 2 cases where a contributor gave up on a PR due to the amount of issues pointed out during review - and in those cases, the PR was usually in a state where Beam would not have benefitted from merging the issue-ridden code anyway. Basically, a thorough review in all cases I've seen so far has been a good idea in retrospect. There may be trivial fixups best done by a committer rather than author (e.g. javadoc typos), but I think nontrivial, high-level issues should be reviewed rigorously. People trust Beam (especially Beam IOs) with their data, and at least the correctness-critical stuff *must* be done right. Beam also generally promises a stable API, so API mistakes are forever, and can not be fixed iteratively [this can be addressed by marking in-progress work as @Experimental] - so APIs must be done right as well. On the other hand, performance, documentation, lack of support for certain features etc. can be fixed iteratively - I agree that we shouldn't push too hard on that during review. There's also the mentorship aspect: I think it is valuable for new Beam contributors to get thorough review, especially for their first contributions, as a kick-start to learning the best practices - they are going to need them repeatedly in their future contributions. Merging code without sufficient review gives them the immediate gratification of "having contributed", but denies the mentorship. Moreover, most contributions are made by a relatively small number of prolific "serial contributors" (you being a prime example!) who are responsive to feedback and eager to learn, so the immediate gratification I think is not very important. I think the best way to handle code reviews for Beam is to give it our best as reviewers, especially for first-time contributors; and if it feels like the amount of required changes is too large for the contributor to handle, then work with them to prioritize the changes, or start small and decompose the contribution into more manageable pieces, but each merged piece must be high-quality.I would be happy to help on TikaIO as I did during the first review round ;) Regards JB On 09/19/2017 12:41 PM, Sergey Beryozkin wrote:Hi All This is my first post the the dev list, I work for Talend, I'm a Beamnovice,Apache Tika fan, and thought it would be really great to try and linkbothprojects together, which led me to opening [1] where I typed some early thoughts, followed by PR [2]. I noticed yesterday I had the robust :-) (but useful and helpful) newerreviewcomments from Eugene pending, so I'd like to summarize a bit why I didTikaIO(reader) the way I did, and then decide, based on the feedback from theexperts,what to do next. Apache Tika Parsers report the text content in chunks, via SaxParserevents.It's not possible with Tika to take a file and read it bit by bit at the 'initiative' of the Beam Reader, line by line, the only way is to handletheSAXParser callbacks which report the data chunks. Some parsers mayreport thecomplete lines, some individual words, with some being able report thedata onlyafter the completely parse the document. All depends on the data format. At the moment TikaIO's TikaReader does not use the Beam threads to parsethefiles, Beam threads will only collect the data from the internal queuewhere theinternal TikaReader's thread will put the data into (note the data chunks are ordered even though the tests might suggestotherwise).The reason I did it was because I thought 1) it would make the individual data chunks available faster to thepipeline -the parser will continue working via the binary/video etc file while thedatawill already start flowing - I agree there should be some tests dataavailableconfirming it - but I'm positive at the moment this approach might yieldsomeperformance gains with the large sets. If the file is large, if it hastheembedded attachments/videos to deal with, then it may be more effectivenot toget the Beam thread deal with it... 2) As I commented at the end of [2], having an option to concatenate thedatachunks first before making them available to the pipeline is useful, andI guessdoing the same in ParDo would introduce some synchronization issues(though notexactly sure yet) One of valid concerns there is that the reader is polling the internalqueue so,in theory at least, and perhaps in some rare cases too, we may have acase wherethe max polling time has been reached, the parser is still busy, andTikaIOfails to report all the file data. I think that it can be solved byeither 2a)configuring the max polling time to a very large number which will neverbereached for a practical case, or 2b) simply use a blocking queue withoutthetime limits - in the worst case, if TikaParser spins and fails to reportthe endof the document, then, Bean can heal itself if the pipeline blocks. I propose to follow 2b). Please let me know what you think. My plan so far is: 1) start addressing most of Eugene's comments which would require someminorTikaIO updates 2) work on removing the TikaSource internal code dealing with Filepatternswhich I copied from TextIO at the next stage 3) If needed - mark TikaIO Experimental to give Tika and Beam users sometime totry it with some real complex files and also decide if TikaIO cancontinueimplemented as a BoundedSource/Reader or not Eugene, all, will it work if I start with 1) ? Thanks, Sergey [1] https://issues.apache.org/jira/browse/BEAM-2328 [2] https://github.com/apache/beam/pull/3378-- Jean-Baptiste Onofré jbono...@apache.org http://blog.nanthrax.net Talend - http://www.talend.com