Re: Support non-keyed stateful ParDo

Kenneth Knowles Wed, 25 Apr 2018 18:29:41 -0700

I think if the concept makes sense we can add it and runners can reject
unsupported pipelines. Starting from the use case is a good way to go.


Kenn

On Wed, Apr 25, 2018, 18:23 Reuven Lax <[email protected]> wrote:

> A few questions:
>
> Do you want the state to still be associated with the window? In Beam the
> window is per key only. While some window types (e.g. fixed windows) assign
> the same windows to all keys giving the illusion that there a single window
> across keys, in actuality each key has its own separate set of windows.
>
> Given that execution of a ParDo is spread across many workers and every
> worker can read and write state, how would you prevent state from being
> corrupted? Beam runners today ensure that a single key is processed on a
> single worker, so at any point in time there is only one writer.
>
> How do you imagine this to be implemented? The current Beam runners (Spark
> Flink Dataflow) all support per-key state natively. Flink has extremely
> limited support for operator state, but it's only useful in certain cases.
> I'm not sure any of current runners can easily model this.
>
> What exactly are the use cases people are trying to code up? Simply
> porting another system's programming model onto Beam probably won't work
> very well. It would be better to understand what problems people are trying
> to solve, and to understand how to solve those inside the Beam model.
>
> Reuven
>
> On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <[email protected]> wrote:
>
>> Hi,
>>
>> I am working on adding the stateful ParDo to the upcoming BEAM Samza
>> runner, and realized that the state for each ParDo processElement() is not
>> only associated with the window of the element, but also the key of the
>> element. Chatted with Kenneth over email about this design decision, which
>> has the following benefits for keyed state:
>>
>> 1) No synchronization
>> 2) Simple programming model
>> 3) No communication between works
>>
>> The current design doesn't support accessing the state across different
>> keys, which seems to be a more general use case. This use case is also very
>> common inside LinkedIn where the users have access to the entire state of
>> an operator/task, and performing lookups and computations on top of it.
>> It's quite hard to make every user here aware that the state is also
>> tightly associated with key of the element.. From the stateful ParDo API
>> the state looks pretty general too. I am wondering is it possible to extend
>> the current API to support both keyed and non-keyed state? Even internally
>> BEAM assigns a dummy key for to associate the state with all the elements.
>> It will be very beneficial to existing Samza users and help them adopt BEAM.
>>
>> Thanks,
>> Xinyu
>>
>

Re: Support non-keyed stateful ParDo

Reply via email to