Thank you Dan. Adding support for unbounded data is on the roadmap and it will be added to Python SDK soon.
Thank you all again, I will start the official voting thread. Thank you, Ahmet On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin <dhalp...@google.com.invalid> wrote: > I do not think that Python SDK yet meets the bar [1] for implementing the > Beam model -- supporting Unbounded data is very important. That said, given > the committed and sustained set of contributors, it generally makes sense > to me to make an exception in anticipation of these features being fleshed > out soon; including potentially new users/contributors that would arrive > once in master. > > [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y > k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com > > On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <al...@google.com.invalid> > wrote: > > > Thank you all for the comments so far. I would follow the process as > > suggested by Davor and others in this thread. > > > > Ahmet > > > > On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wik...@apache.org> > > wrote: > > > > > Hi > > > > > > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid > > > > > wrote: > > > > > > > > tl;dr: I would like to start a discussion about merging python-sdk > > branch > > > > to master branch. Python SDK is mature enough and merging it to > master > > > will > > > > accelerate its development and adoption. > > > > > > > > > > Good point, Ahmet! > > > > > > I've following closed the development since it was imported in June. > For > > > the prototypes I've implemented so far it works quite well; I guess > we'd > > > just need to focus the next months in bringing more runners support. > > > > > > With a great effort from a lot of contributors(*), Python SDK [1] is > now > > a > > > > mostly complete, tested, performant Python implementation of the Beam > > > > model. Since June, when we first started with Python SDK in Apache > Beam > > > we > > > > have been continuously improving it. > > > > > > > > > > I wouldn't merge during the preparation of 0.5.0 release, but after > that > > > could be a good time to merge back into master. > > > > > > > > > ** Python SDK currently supports: > > > > > > > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing > > > etc.). > > > > * IO: There are extensible APIs for writing new bounded sources and > > > sinks. > > > > Implementations are provided for Text, Avro, BigQuery, and Datastore. > > > > * Runners: Python SDK has an extensible base runner module that > allows > > > > building specific runners on top of it. The SDK comes with two > pipeline > > > > runners: DirectRunner and DataflowRunner; and it is possible to add > > more. > > > > The existing runners are currently limited to bounded execution and > > > > otherwise equivalent to their Java SDK counterparts in functionality. > > > > > > > > > > What would the effort of porting, and maintaining, parallel versions of > > the > > > Java runners? I guess I'd need to dig deeper in the model, but this may > > > represent a major effort for the project, right? > > > > > > > It is somewhat higher for DirectRunner because DirectRunner also > implements > > the code for execution. It is not that high for DataflowRunner because > the > > base runner module has a lot of helpers with the right hooks for > > implementing a generic runner. I would _expect_ the experience in general > > would be similar to the latter. > > > > > > > > > > > > > > > > > * Testing: Python SDK implements ValidatesRunner test framework for > > > > implementing integration test for current and future runners. There > is > > > unit > > > > test coverage for all modules, and a number of integrations test for > > > > validating existing runners. > > > > * Documentation and examples: Documentation work has started on > Python > > > SDK. > > > > Beam Programming Guide page has been updated to include Python [2]. > The > > > > code comes with many ready to use examples and we are in a good place > > to > > > > start documenting those on the website. > > > > > > > > ** We are not done yet, next on the roadmap we have: > > > > > > > > * Streaming: Both of the existing runners lack support for streaming > > > > execution, and currently there is work going on for adding streaming > > > > support to DirectRunner [3]. > > > > * Documentation: Filling the rest of the Beam documentations with > > Python > > > > SDK specific information and examples. > > > > * SDK consistency: Making Python SDK consistent with the Java SDK. We > > > have > > > > come a long way on this and have only a few items left [4]. > > > > * Beamifying: We have been working on removing Dataflow-specific > > > references > > > > both from the documentation and from the code. There is some work > left, > > > and > > > > we are currently working on those as well [5]. > > > > > > > > ** Steps and implications of merging to master: > > > > > > > > * Master branch is merged to python-sdk branch at regular intervals > and > > > the > > > > last merge was on 12/22. All the past merges were uneventful because > > > there > > > > is a minimal overlap in modified files between branches. Integrating > > > > python-sdk to master will similarly touch a small number of existing > > > files. > > > > > > > > * Python SDK is using the same tools for building and testing. It is > > > > already integrated with Maven, Jenkins and Travis. Specifically the > > > impact > > > > to the testing infrastructure would be: > > > > - There will be two additional test configurations in Travis. Since > > > Travis > > > > runs all configurations in parallel there should not be a noticeable > > > change > > > > in the Travis run time. > > > > - Jenkins pre-commit test will start running the Python SDK tests. It > > > will > > > > add an additional 5 minutes to the completion time of pre-commit > test. > > > > Historically Python SDK tests were not flaky and did not cause any > > random > > > > failures. > > > > - Jenkins Python post-commit test is already separated from the other > > > > post-commit tests and will continue to exist. It would not change the > > > > testing time for any other test. > > > > > > > > * The release process needs to be updated to accommodate releasing > > Python > > > > artifacts. Python SDK would fit in the existing release schedule and > > > could > > > > be released along with the Java SDK. The additional steps would > > include: > > > > - Generating Python artifacts. This could be done with a single > command > > > > using Maven today. > > > > - Publishing the artifacts to a central repository such as PyPI. > > > > > > > > > > I'm more than happy to help on this. We left on purpose some things > open > > > when we added Maven support to the Python build. > > > > > > > That would be awesome. We can coordinate on that post-merge. > > > > > > > > > > > > > > > > > - Updating the release guide to reflect the changes above. > > > > > > > > * Users: There are existing users using the Python SDK. To give a > rough > > > > estimate, a distribution of the Beam Python SDK had a total of 23K > > > > downloads in the past 6 months [6]. Some of those users are already > > > engaged > > > > with the community (e.g. [7]). There might be an increased amount > > > > engagement from the rest of them after the merge. > > > > > > > > > > Python 3 support is something we definitively need to look ahead. I'd > try > > > to make the codebase compatible with both 2.7.x and 3.6.x, rather than > > > using other solutions like 2to3. > > > > > > > I agree with you. I think it makes more sense to make codebase compatible > > with both. As you mentioned Python 3 support is not a short-term goal in > > the roadmap, and we can discuss it more as we approach that. > > > > > > > > > > > > > Looking forward to hearing your thoughts and comments on “graduating” > > > > python-sdk to the master. > > > > > > > > Thank you, > > > > Ahmet > > > > > > > > (*) Python SDK branch currently has a diverse group of contributors. > > > > Regular contributors include Charles Chen, Chamikara Jayalath, María > > > García > > > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC), > > > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions > > from > > > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and > > > > Younghee Kwon. > > > > > > > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python > > > > [2] https://beam.apache.org/documentation/programming-guide/ > > > > [3] https://issues.apache.org/jira/browse/BEAM-1265 > > > > [4] > > > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op > > > > en%20AND%20labels%20%3D%20sdk-consistency > > > > [5] https://issues.apache.org/jira/browse/BEAM-1218 > > > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json > > > > [7] https://issues.apache.org/jira/browse/BEAM-1251 > > > > > > > > > > > > > Great summary, Ahmet. Thanks. > > > > > > Cheers, > > > > > > -- > > > Sergio Fernández > > > Partner Technology Manager > > > Redlink GmbH > > > m: +43 6602747925 > > > e: sergio.fernan...@redlink.co > > > w: http://redlink.co > > > > > >