Re: Unbounded source translation for portable pipelines

Eugene Kirpichov Mon, 02 Jul 2018 16:40:51 -0700

Updated the Flink section.

To run a basic Python wordcount (sent to you in a separate thread, but
repeating here too for others to play with):


Step 1: Run once to build a container: "./gradlew -p sdks/python/container
docker"
Step 2: ./gradlew :beam-runners-flink_2.11-job-server:runShadow - this
starts up a local Flink portable JobService endpoint on localhost:8099
Step 3: run things using PortableRunner pointed at this endpoint - see e.g.
https://github.com/apache/beam/pull/5824/files

On Thu, Jun 28, 2018 at 1:37 AM Thomas Weise <t...@apache.org> wrote:

> Ankur/Eugene,
>
> When you have a chance, please also update the Flink section of:
> https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0
>
> Thanks!
>
> On Thu, Jun 28, 2018 at 10:29 AM Thomas Weise <t...@apache.org> wrote:
>
>> The command to run the job server appears to be: ./gradlew -p
>> runners/flink/job-server runShadow
>>
>> Can you please provide the equivalent of the super basic Python example
>> from the prototype:
>>
>>
>> https://github.com/bsidhom/beam/blob/hacking-job-server/sdks/python/flink-example.py
>>
>> Looks as if the Python side runner changed:
>>
>> Traceback (most recent call last):
>>   File "flink-example.py", line 7, in <module>
>>     from apache_beam.runners.portability import universal_local_runner
>> ImportError: cannot import name universal_local_runner
>>
>> Thanks,
>> Thomas
>>
>>
>> On Wed, Jun 27, 2018 at 9:34 PM Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> Hi!
>>>
>>> Those instructions are not current and I think should be discarded as
>>> they referred to a particular effort that is over - +Ankur Goenka
>>> <goe...@google.com> is, I believe, working on the remaining finishing
>>> touches for running from a clean clone of Beam master and documenting how
>>> to do that; could you help Thomas so we can start looking at what the
>>> streaming runner is missing?
>>>
>>> We'll need to document this in a more prominent place. When we get to a
>>> state where we can run Python WordCount from master, we'll need to document
>>> it somewhere on the main portability page and/or the getting started guide;
>>> when we can run something more serious, e.g. Tensorflow pipelines, that
>>> will be worth a Beam blog post and worth documenting in the TFX
>>> documentation.
>>>
>>> On Wed, Jun 27, 2018 at 5:35 AM Thomas Weise <t...@apache.org> wrote:
>>>
>>>> Hi Eugene,
>>>>
>>>> The basic streaming translation is already in place from the prototype,
>>>> though I have not verified it on the master branch yet.
>>>>
>>>> Are the user instructions for the portable Flink runner at
>>>> https://s.apache.org/beam-portability-team-doc current?
>>>>
>>>> (I don't have a dependency on SDF since we are going to use custom
>>>> native Flink sources/sinks at this time.)
>>>>
>>>> Thanks,
>>>> Thomas
>>>>
>>>>
>>>> On Tue, Jun 26, 2018 at 2:13 AM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>>> Hi!
>>>>>
>>>>> Wanted to let you know that I've just merged the PR that adds
>>>>> checkpointable SDF support to the portable reference runner (ULR) and the
>>>>> Java SDK harness:
>>>>>
>>>>> https://github.com/apache/beam/pull/5566
>>>>>
>>>>> So now we have a reference implementation of SDF support in a portable
>>>>> runner, and a reference implementation of SDF support in a portable SDK
>>>>> harness.
>>>>> From here on, we need to replicate this support in other portable
>>>>> runners and other harnesses. The obvious targets are Flink and Python
>>>>> respectively.
>>>>>
>>>>> Chamikara was going to work on the Python harness. +Thomas Weise
>>>>> <t...@apache.org> Would you be interested in the Flink portable
>>>>> streaming runner side? It is of course blocked by having the rest of that
>>>>> runner working in streaming mode though (the batch mode is practically 
>>>>> done
>>>>> - will send you a separate note about the status of that).
>>>>>
>>>>> On Fri, Mar 23, 2018 at 12:20 PM Eugene Kirpichov <
>>>>> kirpic...@google.com> wrote:
>>>>>
>>>>>> Luke is right - unbounded sources should go through SDF. I am
>>>>>> currently working on adding such support to Fn API.
>>>>>> The relevant document is s.apache.org/beam-breaking-fusion (note: it
>>>>>> focuses on a much more general case, but also considers in detail the
>>>>>> specific case of running unbounded sources on Fn API), and the first
>>>>>> related PR is https://github.com/apache/beam/pull/4743 .
>>>>>>
>>>>>> Ways you can help speed up this effort:
>>>>>> - Make necessary changes to Apex runner per se to support regular
>>>>>> SDFs in streaming (without portability). They will likely largely carry
>>>>>> over to portable world. I recall that the Apex runner had some level of
>>>>>> support of SDFs, but didn't pass the ValidatesRunner tests yet.
>>>>>> - (general to Beam, not Apex-related per se) Implement the
>>>>>> translation of Read.from(UnboundedSource) via impulse, which will require
>>>>>> implementing an SDF that reads from a given UnboundedSource (taking the
>>>>>> UnboundedSource as an element). This should be fairly straightforward and
>>>>>> will allow all portable runners to take advantage of existing
>>>>>> UnboundedSource's.
>>>>>>
>>>>>>
>>>>>> On Fri, Mar 23, 2018 at 3:08 PM Lukasz Cwik <lc...@google.com> wrote:
>>>>>>
>>>>>>> Using impulse is a precursor for both bounded and unbounded SDF.
>>>>>>>
>>>>>>> This JIRA represents the work that would be to add support for
>>>>>>> unbounded SDF using portability APIs:
>>>>>>> https://issues.apache.org/jira/browse/BEAM-2939
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 11:46 AM Thomas Weise <t...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> So for streaming, we will need the Impulse translation for bounded
>>>>>>>> input, identical with batch, and then in addition to that support for 
>>>>>>>> SDF?
>>>>>>>>
>>>>>>>> Any pointers what's involved in adding the SDF support? Is it
>>>>>>>> runner specific? Does the ULR cover it?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 11:26 AM, Lukasz Cwik <lc...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> All "sources" in portability will use splittable DoFns for
>>>>>>>>> execution.
>>>>>>>>>
>>>>>>>>> Specifically, runners will need to be able to checkpoint unbounded
>>>>>>>>> sources to get a minimum viable pipeline working.
>>>>>>>>> For bounded pipelines, a DoFn can read the contents of a bounded
>>>>>>>>> source.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Mar 23, 2018 at 10:52 AM Thomas Weise <t...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I'm looking at the portable pipeline translation for streaming. I
>>>>>>>>>> understand that for batch pipelines, it is sufficient to translate 
>>>>>>>>>> Impulse.
>>>>>>>>>>
>>>>>>>>>> What is the intended path to support unbounded sources?
>>>>>>>>>>
>>>>>>>>>> The goal here is to get a minimum translation working that will
>>>>>>>>>> allow streaming wordcount execution.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Thomas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>

Re: Unbounded source translation for portable pipelines

Reply via email to