It makes sense to merge after 0.5.0 release.

Good point Davor: +1

Regards
JB

On 01/17/2017 03:34 PM, Davor Bonaci wrote:
+1. I think merging to master would be an awesome next step for the Python
SDK.

And, thanks for a great summary of the current state, roadmap, and impact
to the project as a whole -- awesome!

Process-wise, I'd suggest starting a formal vote once this discussion seems
to be trending towards a conclusion, and complete the merge as soon as the
next release (0.5.0) is cut. This would enable additional time before 0.6.0
to figure out compliance, release process impact, etc.

Great work everyone!

On Tue, Jan 17, 2017 at 8:26 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Hi

I didn't try the Python SDK recently but you provided a clear "state of
the art". Anyway I'm in favor of merging things as quick as possible
(assuming it's in a good shape in term of build, test, ...): it would
potentially grow up the "external" contributions.

So +1 from my side.

Regards
JB⁣​

On Jan 17, 2017, 08:22, at 08:22, Ahmet Altay <al...@google.com.INVALID>
wrote:
Hi all,

tl;dr: I would like to start a discussion about merging python-sdk
branch
to master branch. Python SDK is mature enough and merging it to master
will
accelerate its development and adoption.

With a great effort from a lot of contributors(*), Python SDK [1] is
now a
mostly complete, tested, performant Python implementation of the Beam
model. Since June, when we first started with Python SDK in Apache Beam
we
have been continuously improving it.

** Python SDK currently supports:

* Model: All main concepts are present (ParDo, GroupByKey, Windowing
etc.).
* IO: There are extensible APIs for writing new bounded sources and
sinks.
Implementations are provided for Text, Avro, BigQuery, and Datastore.
* Runners: Python SDK has an extensible base runner module that allows
building specific runners on top of it. The SDK comes with two pipeline
runners: DirectRunner and DataflowRunner; and it is possible to add
more.
The existing runners are currently limited to bounded execution and
otherwise equivalent to their Java SDK counterparts in functionality.
* Testing: Python SDK implements ValidatesRunner test framework for
implementing integration test for current and future runners. There is
unit
test coverage for all modules, and a number of integrations test for
validating existing runners.
* Documentation and examples: Documentation work has started on Python
SDK.
Beam Programming Guide page has been updated to include Python [2]. The
code comes with many ready to use examples and we are in a good place
to
start documenting those on the website.

** We are not done yet, next on the roadmap we have:

* Streaming: Both of the existing runners lack support for streaming
execution, and currently there is work going on for adding streaming
support to DirectRunner [3].
* Documentation: Filling the rest of the Beam documentations with
Python
SDK specific information and examples.
* SDK consistency: Making Python SDK consistent with the Java SDK. We
have
come a long way on this and have only a few items left [4].
* Beamifying: We have been working on removing Dataflow-specific
references
both from the documentation and from the code. There is some work left,
and
we are currently working on those as well [5].

** Steps and implications of merging to master:

* Master branch is merged to python-sdk branch at regular intervals and
the
last merge was on 12/22. All the past merges were uneventful because
there
is a minimal overlap in modified files between branches. Integrating
python-sdk to master will similarly touch a small number of existing
files.

* Python SDK is using the same tools for building and testing. It is
already integrated with Maven, Jenkins and Travis. Specifically the
impact
to the testing infrastructure would be:
- There will be two additional test configurations in Travis. Since
Travis
runs all configurations in parallel there should not be a noticeable
change
in the Travis run time.
- Jenkins pre-commit test will start running the Python SDK tests. It
will
add an additional 5 minutes to the completion time of pre-commit test.
Historically Python SDK tests were not flaky and did not cause any
random
failures.
- Jenkins Python post-commit test is already separated from the other
post-commit tests and will continue to exist. It would not change the
testing time for any other test.

* The release process needs to be updated to accommodate releasing
Python
artifacts. Python SDK would fit in the existing release schedule and
could
be released along with the Java SDK. The additional steps would
include:
- Generating Python artifacts. This could be done with a single command
using Maven today.
- Publishing the artifacts to a central repository such as PyPI.
- Updating the release guide to reflect the changes above.

* Users: There are existing users using the Python SDK. To give a rough
estimate, a distribution of the Beam Python SDK had a total of 23K
downloads in the past 6 months [6]. Some of those users are already
engaged
with the community (e.g. [7]). There might be an increased amount
engagement from the rest of them after the merge.

Looking forward to hearing your thoughts and comments on “graduating”
python-sdk to the master.

Thank you,
Ahmet

(*) Python SDK branch currently has a diverse group of contributors.
Regular contributors include Charles Chen, Chamikara Jayalath, María
García
Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam PMC),
Sourabh Bajaj, and Vikas Kedigehalli. We have also had contributions
from
Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee, and
Younghee Kwon.

[1] https://github.com/apache/beam/tree/python-sdk/sdks/python
[2] https://beam.apache.org/documentation/programming-guide/
[3] https://issues.apache.org/jira/browse/BEAM-1265
[4]
https://issues.apache.org/jira/issues/?jql=status%20%3D%
20Open%20AND%20labels%20%3D%20sdk-consistency
[5] https://issues.apache.org/jira/browse/BEAM-1218
[6] https://pypi.python.org/pypi/google-cloud-dataflow/json
[7] https://issues.apache.org/jira/browse/BEAM-1251



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to