On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin
<dhalp...@google.com.invalid> wrote:
> I do not think that Python SDK yet meets the bar [1] for implementing the
> Beam model -- supporting Unbounded data is very important. That said, given
> the committed and sustained set of contributors, it generally makes sense
> to me to make an exception in anticipation of these features being fleshed
> out soon; including potentially new users/contributors that would arrive
> once in master.
>
> [1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
> k0plmm3f5e5bqwjz4+c5doruclnxo...@mail.gmail.com

That is a valid point. The Python SDK supports all the unbounded parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.

> On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay <al...@google.com.invalid>
> wrote:
>
>> Thank you all for the comments so far. I would follow the process as
>> suggested by Davor and others in this thread.
>>
>> Ahmet
>>
>> On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <wik...@apache.org>
>> wrote:
>>
>> > Hi
>> >
>> > On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay <al...@google.com.invalid>
>> > wrote:
>> > >
>> > > tl;dr: I would like to start a discussion about merging python-sdk
>> branch
>> > > to master branch. Python SDK is mature enough and merging it to master
>> > will
>> > > accelerate its development and adoption.
>> > >
>> >
>> > Good point, Ahmet!
>> >
>> > I've following closed the development since it was imported in June. For
>> > the prototypes I've implemented so far it works quite well; I guess we'd
>> > just need to focus the next months in bringing more runners support.
>> >
>> > With a great effort from a lot of contributors(*), Python SDK [1] is now
>> a
>> > > mostly complete, tested, performant Python implementation of the Beam
>> > > model. Since June, when we first started with Python SDK in Apache Beam
>> > we
>> > > have been continuously improving it.
>> > >
>> >
>> > I wouldn't merge during the preparation of 0.5.0 release, but after that
>> > could be a good time to merge back into master.
>> >
>> >
>> > ** Python SDK currently supports:
>> > >
>> > > * Model: All main concepts are present (ParDo, GroupByKey, Windowing
>> > etc.).
>> > > * IO: There are extensible APIs for writing new bounded sources and
>> > sinks.
>> > > Implementations are provided for Text, Avro, BigQuery, and Datastore.
>> > > * Runners: Python SDK has an extensible base runner module that allows
>> > > building specific runners on top of it. The SDK comes with two pipeline
>> > > runners: DirectRunner and DataflowRunner; and it is possible to add
>> more.
>> > > The existing runners are currently limited to bounded execution and
>> > > otherwise equivalent to their Java SDK counterparts in functionality.
>> > >
>> >
>> > What would the effort of porting, and maintaining, parallel versions of
>> the
>> > Java runners? I guess I'd need to dig deeper in the model, but this may
>> > represent a major effort for the project, right?
>> >
>>
>> It is somewhat higher for DirectRunner because DirectRunner also implements
>> the code for execution. It is not that high for DataflowRunner because the
>> base runner module has a lot of helpers with the right hooks for
>> implementing a generic runner. I would _expect_ the experience in general
>> would be similar to the latter.
>>
>>
>> >
>> >
>> >
>> > > * Testing: Python SDK implements ValidatesRunner test framework for
>> > > implementing integration test for current and future runners. There is
>> > unit
>> > > test coverage for all modules, and a number of integrations test for
>> > > validating existing runners.
>> > > * Documentation and examples: Documentation work has started on Python
>> > SDK.
>> > > Beam Programming Guide page has been updated to include Python [2]. The
>> > > code comes with many ready to use examples and we are in a good place
>> to
>> > > start documenting those on the website.
>> > >
>> > > ** We are not done yet, next on the roadmap we have:
>> > >
>> > > * Streaming: Both of the existing runners lack support for streaming
>> > > execution, and currently there is work going on for adding streaming
>> > > support to DirectRunner [3].
>> > > * Documentation: Filling the rest of the Beam documentations with
>> Python
>> > > SDK specific information and examples.
>> > > * SDK consistency: Making Python SDK consistent with the Java SDK. We
>> > have
>> > > come a long way on this and have only a few items left [4].
>> > > * Beamifying: We have been working on removing Dataflow-specific
>> > references
>> > > both from the documentation and from the code. There is some work left,
>> > and
>> > > we are currently working on those as well [5].
>> > >
>> > > ** Steps and implications of merging to master:
>> > >
>> > > * Master branch is merged to python-sdk branch at regular intervals and
>> > the
>> > > last merge was on 12/22. All the past merges were uneventful because
>> > there
>> > > is a minimal overlap in modified files between branches. Integrating
>> > > python-sdk to master will similarly touch a small number of existing
>> > files.
>> > >
>> > > * Python SDK is using the same tools for building and testing. It is
>> > > already integrated with Maven, Jenkins and Travis. Specifically the
>> > impact
>> > > to the testing infrastructure would be:
>> > > - There will be two additional test configurations in Travis. Since
>> > Travis
>> > > runs all configurations in parallel there should not be a noticeable
>> > change
>> > > in the Travis run time.
>> > > - Jenkins pre-commit test will start running the Python SDK tests. It
>> > will
>> > > add an additional 5 minutes to the completion time of pre-commit test.
>> > > Historically Python SDK tests were not flaky and did not cause any
>> random
>> > > failures.
>> > > - Jenkins Python post-commit test is already separated from the other
>> > > post-commit tests and will continue to exist. It would not change the
>> > > testing time for any other test.
>> > >
>> > > * The release process needs to be updated to accommodate releasing
>> Python
>> > > artifacts. Python SDK would fit in the existing release schedule and
>> > could
>> > > be released along with the Java SDK. The additional steps would
>> include:
>> > > - Generating Python artifacts. This could be done with a single command
>> > > using Maven today.
>> > > - Publishing the artifacts to a central repository such as PyPI.
>> > >
>> >
>> > I'm more than happy to help on this. We left on purpose some things open
>> > when we added Maven support to the Python build.
>> >
>>
>> That would be awesome. We can coordinate on that post-merge.
>>
>>
>> >
>> >
>> >
>> > > - Updating the release guide to reflect the changes above.
>> > >
>> > > * Users: There are existing users using the Python SDK. To give a rough
>> > > estimate, a distribution of the Beam Python SDK had a total of 23K
>> > > downloads in the past 6 months [6]. Some of those users are already
>> > engaged
>> > > with the community (e.g. [7]). There might be an increased amount
>> > > engagement from the rest of them after the merge.
>> > >
>> >
>> > Python 3 support is something we definitively need to look ahead. I'd try
>> > to make the codebase compatible with both 2.7.x and 3.6.x, rather than
>> > using other  solutions like 2to3.
>> >
>>
>> I agree with you. I think it makes more sense to make codebase compatible
>> with both. As you mentioned Python 3 support is not a short-term goal in
>> the roadmap, and we can discuss it more as we approach that.
>>
>>
>> >
>> >
>> > Looking forward to hearing your thoughts and comments on “graduating”
>> > > python-sdk to the master.
>> > >
>> > > Thank you,
>> > > Ahmet
>> > >
>> > > (*) Python SDK branch currently has a diverse group of contributors.
>> > > Regular contributors include Charles Chen, Chamikara Jayalath, María
>> > García
>> > > Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
>> > > Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
>> from
>> > > Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
>> > > Younghee Kwon.
>> > >
>> > > [1] https://github.com/apache/beam/tree/python-sdk/sdks/python
>> > > [2] https://beam.apache.org/documentation/programming-guide/
>> > > [3] https://issues.apache.org/jira/browse/BEAM-1265
>> > > [4]
>> > > https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
>> > > en%20AND%20labels%20%3D%20sdk-consistency
>> > > [5] https://issues.apache.org/jira/browse/BEAM-1218
>> > > [6] https://pypi.python.org/pypi/google-cloud-dataflow/json
>> > > [7] https://issues.apache.org/jira/browse/BEAM-1251
>> > >
>> >
>> >
>> > Great summary, Ahmet. Thanks.
>> >
>> > Cheers,
>> >
>> > --
>> > Sergio Fernández
>> > Partner Technology Manager
>> > Redlink GmbH
>> > m: +43 6602747925
>> > e: sergio.fernan...@redlink.co
>> > w: http://redlink.co
>> >
>>

Reply via email to