Re: [DISCUSS] Python SDK status and next steps

Jean-Baptiste Onofré Tue, 31 Jan 2017 04:34:58 -0800

No, that's fine as soon as we clearly document the prerequisite for thebuild. IMHO, we should provide quick BUILDING instructions in the README.md.


Regards
JB


On 01/31/2017 01:24 PM, Sergio Fernández wrote:

Originally we integrate the build in Maven with the default profile.
Do you feel like it'd be better to have it under a separated profile or so?

On Tue, Jan 31, 2017 at 11:07 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

Just to be clear, the prerequisite to be able to build the Python SDK are:

apt-get install python-setuptools
apt-get install python-pip

It's also required by the default "regular" build.

Regards
JB


On 01/31/2017 11:02 AM, Jean-Baptiste Onofré wrote:

Just one thing I noticed (and can be helpful for others): to build Beam
we now need python setuptools installed.

For instance, on Ubuntu, you have to do:

apt-get install python-setuptools

Same for the pip distribution.

I guess (if not already done), we have to update README/Building
instructions.

Correct ?

Regards
JB

On 01/31/2017 08:10 AM, Ahmet Altay wrote:

Hi all,

This merge is completed. Python SDK is now officially part of the master
branch! Thank you all for the support. Please open an issue, if you
notice
a reference to the now obsolete python-sdk branch in the documentation.

There will not be any more merges to the python-sdk branch. Going forward
please use the master branch for Python SDK development. There are a few
existing open PRs to the python-sdk [1]. If you are the author of one of
those PRs, please rebase them on top of master.

Thank you,
Ahmet

[1] https://github.com/pulls?utf8=✓&q=is%3Aopen+is%3Apr+base%
<https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr+base%25>
3Apython-sdk+repo%3Aapache%2Fbeam+
<https://github.com/pulls?utf8=%E2%9C%93&q=is%3Aopen+is%3Apr
+base%3Apython-sdk+repo%3Aapache%2Fbeam+>


On Fri, Jan 20, 2017 at 10:06 AM, Kenneth Knowles
<[email protected]>
wrote:

To clarify the implied criteria of that last exchange, it is "An SDK

should
have at least one runner that can execute the complete model (may be a
direct runner)"

I want to highlight this, because whether an _SDK_ supports unbounded
data
is not particularly well-defined, and will evolve:

 - With the Runner API, an SDK will need to support building a graph
with
unbounded constructs, as today with probably minimal changes.

 - With the Fn API, if any part of the Fn API is specific to unbounded
data, the SDK will need to implement it. I think right now there is
no such
thing, and we don't want such a thing, so SDKs implementing the Fn API
automatically support unbounded data.

 - There will also likely be an SDK-specific shim just as there is
today,
to leverage idiomatic deserialized representations. The richness of this
shim will decrease so that it will need to "support" unbounded data but
that will be a ~one liner.

Getting the Python SDK on master will accelerate our progress towards
the
Fn API - partly technical, partly community - which is the best path
towards support for unbounded data across multiple runners. I think the
criteria are written with the completed portability framework in
mind. So
this exchange makes me actually more convinced we should merge
python-sdk
to master.

On Fri, Jan 20, 2017 at 9:53 AM, Robert Bradshaw <
[email protected]> wrote:

On Thu, Jan 19, 2017 at 11:56 PM, Dan Halperin

<[email protected]> wrote:

I do not think that Python SDK yet meets the bar [1] for implementing

the

Beam model -- supporting Unbounded data is very important. That said,

given

the committed and sustained set of contributors, it generally makes

sense

to me to make an exception in anticipation of these features being

fleshed

out soon; including potentially new users/contributors that would

arrive

once in master.


[1] https://lists.apache.org/thread.html/CAAzyFAxcmexUQnbF=Y
[email protected]


That is a valid point. The Python SDK supports all the unbounded parts
of the model except for unbounded sources, which was deferred while
seeing how https://s.apache.org/splittable-do-fn played out. I've been
working with the team and merging/reviewing most of their code, and
have full confidence this will be coming (and on that note can vouch
for a healthy community and support which are much harder to add
later).

In short, I think it has the required maturity, and I'm in favor of
merging soonish.

On Wed, Jan 18, 2017 at 12:24 AM, Ahmet Altay

<[email protected]


wrote:


Thank you all for the comments so far. I would follow the process as

suggested by Davor and others in this thread.

Ahmet

On Tue, Jan 17, 2017 at 11:47 PM, Sergio Fernández <
[email protected]

wrote:

Hi


On Tue, Jan 17, 2017 at 5:22 PM, Ahmet Altay

<[email protected]

wrote:


tl;dr: I would like to start a discussion about merging python-sdk

branch

to master branch. Python SDK is mature enough and merging it to

master

will

accelerate its development and adoption.

Good point, Ahmet!

I've following closed the development since it was imported in June.

For

the prototypes I've implemented so far it works quite well; I guess

we'd

just need to focus the next months in bringing more runners support.


With a great effort from a lot of contributors(*), Python SDK [1] is

now

mostly complete, tested, performant Python implementation of the

Beam

model. Since June, when we first started with Python SDK in Apache

Beam

we

have been continuously improving it.

I wouldn't merge during the preparation of 0.5.0 release, but after

that

could be a good time to merge back into master.



** Python SDK currently supports:


* Model: All main concepts are present (ParDo, GroupByKey,

Windowing

etc.).

* IO: There are extensible APIs for writing new bounded sources

and

sinks.

Implementations are provided for Text, Avro, BigQuery, and

Datastore.

* Runners: Python SDK has an extensible base runner module that

allows

building specific runners on top of it. The SDK comes with two

pipeline

runners: DirectRunner and DataflowRunner; and it is possible to

add

more.

The existing runners are currently limited to bounded execution

and

otherwise equivalent to their Java SDK counterparts in

functionality.

What would the effort of porting, and maintaining, parallel versions

of

the

Java runners? I guess I'd need to dig deeper in the model, but this

may

represent a major effort for the project, right?

It is somewhat higher for DirectRunner because DirectRunner also

implements

the code for execution. It is not that high for DataflowRunner

because

the

base runner module has a lot of helpers with the right hooks for

implementing a generic runner. I would _expect_ the experience in

general

would be similar to the latter.



* Testing: Python SDK implements ValidatesRunner test framework

for

implementing integration test for current and future runners.

There

is

unit

test coverage for all modules, and a number of integrations test

for

validating existing runners.

* Documentation and examples: Documentation work has started on

Python

SDK.

Beam Programming Guide page has been updated to include Python

[2].

The

code comes with many ready to use examples and we are in a good

place

to

start documenting those on the website.


** We are not done yet, next on the roadmap we have:

* Streaming: Both of the existing runners lack support for

streaming

execution, and currently there is work going on for adding

streaming

support to DirectRunner [3].

* Documentation: Filling the rest of the Beam documentations with

Python

SDK specific information and examples.

* SDK consistency: Making Python SDK consistent with the Java SDK.

We

have

come a long way on this and have only a few items left [4].
* Beamifying: We have been working on removing Dataflow-specific

references

both from the documentation and from the code. There is some work

left,

and

we are currently working on those as well [5].

** Steps and implications of merging to master:

* Master branch is merged to python-sdk branch at regular

intervals

and

the

last merge was on 12/22. All the past merges were uneventful

because

there

is a minimal overlap in modified files between branches.

Integrating

python-sdk to master will similarly touch a small number of

existing

files.


* Python SDK is using the same tools for building and testing. It

is

already integrated with Maven, Jenkins and Travis. Specifically

the

impact

to the testing infrastructure would be:
- There will be two additional test configurations in Travis.

Since

Travis

runs all configurations in parallel there should not be a

noticeable

change

in the Travis run time.
- Jenkins pre-commit test will start running the Python SDK tests.

It

will

add an additional 5 minutes to the completion time of pre-commit

test.

Historically Python SDK tests were not flaky and did not cause any

random

failures.

- Jenkins Python post-commit test is already separated from the

other

post-commit tests and will continue to exist. It would not change

the

testing time for any other test.


* The release process needs to be updated to accommodate releasing

Python

artifacts. Python SDK would fit in the existing release schedule

and

could

be released along with the Java SDK. The additional steps would

include:

- Generating Python artifacts. This could be done with a single

command

using Maven today.

- Publishing the artifacts to a central repository such as PyPI.

I'm more than happy to help on this. We left on purpose some things

open

when we added Maven support to the Python build.

That would be awesome. We can coordinate on that post-merge.



- Updating the release guide to reflect the changes above.


* Users: There are existing users using the Python SDK. To give a

rough

estimate, a distribution of the Beam Python SDK had a total of 23K

downloads in the past 6 months [6]. Some of those users are

already

engaged

with the community (e.g. [7]). There might be an increased amount
engagement from the rest of them after the merge.

Python 3 support is something we definitively need to look ahead.

I'd

try

to make the codebase compatible with both 2.7.x and 3.6.x, rather

than

using other  solutions like 2to3.

I agree with you. I think it makes more sense to make codebase

compatible

with both. As you mentioned Python 3 support is not a short-term goal

in

the roadmap, and we can discuss it more as we approach that.


Looking forward to hearing your thoughts and comments on

“graduating”

python-sdk to the master.


Thank you,
Ahmet

(*) Python SDK branch currently has a diverse group of

contributors.

Regular contributors include Charles Chen, Chamikara Jayalath,

María

García

Herrero, Mark Liu, Pablo Estrada, Robert Bradshaw (Apache Beam

PMC),

Sourabh Bajaj, and Vikas Kedigehalli. We have also had

contributions

from

Abdullah Bashir, Marco Buccini, Sergio Fernández, Seunghyun Lee,

and

Younghee Kwon.


[1] https://github.com/apache/beam/tree/python-sdk/sdks/python
[2] https://beam.apache.org/documentation/programming-guide/
[3] https://issues.apache.org/jira/browse/BEAM-1265
[4]
https://issues.apache.org/jira/issues/?jql=status%20%3D%20Op
en%20AND%20labels%20%3D%20sdk-consistency
[5] https://issues.apache.org/jira/browse/BEAM-1218
[6] https://pypi.python.org/pypi/google-cloud-dataflow/json
[7] https://issues.apache.org/jira/browse/BEAM-1251


Great summary, Ahmet. Thanks.

Cheers,

--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: [email protected]
w: http://redlink.co

--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Python SDK status and next steps

Reply via email to