Re: Support non-keyed stateful ParDo

Xinyu Liu Wed, 25 Apr 2018 18:29:51 -0700

@Robert: for your questions:

1) Side input won't work for us since it returns the whole collection. We
use rocksDb and usually the state is too big to fit in memory.


2) One way to achieve our use cases is to assign a single key to all the
elements so they will be associated with the same keyed state. The state
will belong to the element window as it is. Kenneth mentioned this solution
too. It does meet our use case, but it's not very convenient to our users.

3) Sorry if I wasn't clear about the use case. For our usage, it's pretty
common to store the elements in the states, and look them up later and do
some computation. The elements will be in the same window, but doesn't need
to be of the same key.

Thanks,
Xinyu

On Wed, Apr 25, 2018 at 6:02 PM, Robert Bradshaw <rober...@google.com>
wrote:

> On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <xinyuliu...@gmail.com> wrote:
>
> > Hi,
>
> > I am working on adding the stateful ParDo to the upcoming BEAM Samza
> runner, and realized that the state for each ParDo processElement() is not
> only associated with the window of the element, but also the key of the
> element. Chatted with Kenneth over email about this design decision, which
> has the following benefits for keyed state:
>
> > 1) No synchronization
> > 2) Simple programming model
> > 3) No communication between works
>
> > The current design doesn't support accessing the state across different
> keys, which seems to be a more general use case. This use case is also very
> common inside LinkedIn where the users have access to the entire state of
> an operator/task, and performing lookups and computations on top of it.
> It's quite hard to make every user here aware that the state is also
> tightly associated with key of the element..
>
> Would side inputs be applicable here? (They're read-only, but other than
> that seem to fit the need.)
>
> >  From the stateful ParDo API the state looks pretty general too. I am
> wondering is it possible to extend the current API to support both keyed
> and non-keyed state? Even internally BEAM assigns a dummy key for to
> associate the state with all the elements. It will be very beneficial to
> existing Samza users and help them adopt BEAM.
>
> Could you clarify how you would use this dummy key? You could manually add
> a random key, but in that case it's unlikely that any state stored would
> get observed again. Across what scope were you thinking state would be
> stored? The lifetime of the bundle? The worker? The job? How are
> conflicting writes resolved?
>
> Perhaps rather than describing the mechanism (state) that you're trying to
> use, it'd be helpful to describe the kinds of computations you're trying to
> perform, to figure out how the model should be adapted/extended if it
> doesn't meet those needs.
>

Re: Support non-keyed stateful ParDo

Reply via email to