Re: Apache Beam for Python

Silviu Calinoiu Fri, 03 Jun 2016 06:35:49 -0700

Hi JB,
Thanks for the welcome! I come from the Python land so  I am not quite
familiar with Maven. What do you mean by a Maven module? You mean an
artifact so you can install things? In Python, people are used to packages
downloaded from PyPI (pypi.python.org -- which is sort of Maven for
Python). Whatever is the standard way of doing things in Apache we'll do
it. Just asking for clarifications.


By the way this discussion is very useful since we will have to iron out
several details like this.
Thanks,
Silviu

On Fri, Jun 3, 2016 at 6:19 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Hi Silviu,
>
> thanks for detailed update and great work !
>
> I would advice to create a:
>
> sdks/python
>
> Maven module to store the Python SDK.
>
> WDYT ?
>
> By the way, welcome aboard and great to have you all guys in the team !
>
> Regards
> JB
>
> On 06/03/2016 03:13 PM, Silviu Calinoiu wrote:
>
>> Hi all,
>>
>> My name is Silviu Calinoiu and I am a member of the Cloud Dataflow team
>> working on the Python SDK.  As the original Beam proposal (
>> https://wiki.apache.org/incubator/BeamProposal) mentioned, we have been
>> planning to merge the Python SDK into Beam. The Python SDK is in an early
>> stage of development (alpha milestone) and so this is a good time to move
>> the code without causing too much disruption to our customers.
>> Additionally, this enables the Beam community to contribute as soon as
>> possible.
>>
>> The current state of the SDK is as follows:
>>
>>     -
>>
>>     Open-sourced at
>> https://github.com/GoogleCloudPlatform/DataflowPythonSDK/
>>
>>
>>     -
>>
>>     Model: All main concepts are present.
>>     -
>>
>>     I/O: SDK supports text (Google Cloud Storage) and BigQuery connectors
>>     and has a framework for adding additional sources and sinks.
>>     -
>>
>>     Runners: SDK has two pipeline runners: direct runner (in process,
>> local
>>     execution) and Cloud Dataflow runner for batch pipelines (submit job
>> to
>>     Google Dataflow service). The current direct runner is bounded only
>> (batch
>>     execution) but there is work in progress to support unbounded (as in
>> Java).
>>     -
>>
>>     Testing: The code base has unit test coverage for all the modules and
>>     several integration and end to end tests (similar in coverage to the
>> Java
>>     SDK). Streaming is not well tested end to end yet since Cloud Dataflow
>>     focused first on batch.
>>     -
>>
>>     Docs: We have matching Python documentation for the features currently
>>     supported by Cloud Dataflow. The docs are on cloud.google.com (access
>>     only by whitelist due to the alpha stage of the project). Devin is
>> working
>>     on the transition of all docs to Apache.
>>
>>
>> In the next days/weeks we would like to prepare and start migrating the
>> code and you should start seeing some pull requests. We also hope that the
>> Beam community will shape the SDK going forward. In particular, all the
>> model improvements implemented for Java (Runner API, etc.) will have
>> equivalents in Python once they stabilize. If you have any advice before
>> we
>> start the journey please let us know.
>>
>> The team that will join the Beam effort consists of me (Silviu Calinoiu),
>> Charles Chen, Ahmet Altay, Chamikara Jayalath, and last but not least
>> Robert Bradshaw (who is already an Apache Beam committer).
>>
>> So let us know what you think!
>>
>> Best regards,
>>
>> Silviu
>>
>>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: Apache Beam for Python

Reply via email to