Re: Support non-keyed stateful ParDo

Reuven Lax Wed, 25 Apr 2018 18:24:05 -0700

A few questions:

Do you want the state to still be associated with the window? In Beam the
window is per key only. While some window types (e.g. fixed windows) assign
the same windows to all keys giving the illusion that there a single window
across keys, in actuality each key has its own separate set of windows.

Given that execution of a ParDo is spread across many workers and every
worker can read and write state, how would you prevent state from being
corrupted? Beam runners today ensure that a single key is processed on a
single worker, so at any point in time there is only one writer.

How do you imagine this to be implemented? The current Beam runners (Spark
Flink Dataflow) all support per-key state natively. Flink has extremely
limited support for operator state, but it's only useful in certain cases.
I'm not sure any of current runners can easily model this.

What exactly are the use cases people are trying to code up? Simply porting
another system's programming model onto Beam probably won't work very well.
It would be better to understand what problems people are trying to solve,
and to understand how to solve those inside the Beam model.

Reuven

On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <[email protected]> wrote:

> Hi,
>
> I am working on adding the stateful ParDo to the upcoming BEAM Samza
> runner, and realized that the state for each ParDo processElement() is not
> only associated with the window of the element, but also the key of the
> element. Chatted with Kenneth over email about this design decision, which
> has the following benefits for keyed state:
>
> 1) No synchronization
> 2) Simple programming model
> 3) No communication between works
>
> The current design doesn't support accessing the state across different
> keys, which seems to be a more general use case. This use case is also very
> common inside LinkedIn where the users have access to the entire state of
> an operator/task, and performing lookups and computations on top of it.
> It's quite hard to make every user here aware that the state is also
> tightly associated with key of the element.. From the stateful ParDo API
> the state looks pretty general too. I am wondering is it possible to extend
> the current API to support both keyed and non-keyed state? Even internally
> BEAM assigns a dummy key for to associate the state with all the elements.
> It will be very beneficial to existing Samza users and help them adopt BEAM.
>
> Thanks,
> Xinyu
>

Re: Support non-keyed stateful ParDo

Reply via email to