Re: Support non-keyed stateful ParDo

Reuven Lax Wed, 25 Apr 2018 19:07:20 -0700

In that case I agree with Ken. It would be trivial to write a wrapper that
did this.


On Wed, Apr 25, 2018 at 7:01 PM Xinyu Liu <[email protected]> wrote:

> @Reuven: if the state is non-keyed (or assigned to a single key), then I
> would expect it to be executed in a single worker, otherwise there can be
> state corruptions as you mentioned. Our use case is to store elements in
> the state regardless of the keys, and then do computations on top of them.
> An example can be data lookup: we can store user data elements in the
> state, and do a look up of all the relevant user data needed for a incoming
> event. This seems to be a quite general use case to me.
>
> @Kenneth: it will be great to support it as a convenience composite!
>
> Thanks,
> Xinyu
>
> On Wed, Apr 25, 2018 at 6:31 PM, Kenneth Knowles <[email protected]> wrote:
>
>> #2 could be accomplished with a convenience composite, yes?
>>
>> On Wed, Apr 25, 2018, 18:28 Xinyu Liu <[email protected]> wrote:
>>
>>> @Robert: for your questions:
>>>
>>> 1) Side input won't work for us since it returns the whole collection.
>>> We use rocksDb and usually the state is too big to fit in memory.
>>>
>>> 2) One way to achieve our use cases is to assign a single key to all the
>>> elements so they will be associated with the same keyed state. The state
>>> will belong to the element window as it is. Kenneth mentioned this solution
>>> too. It does meet our use case, but it's not very convenient to our users.
>>>
>>> 3) Sorry if I wasn't clear about the use case. For our usage, it's
>>> pretty common to store the elements in the states, and look them up later
>>> and do some computation. The elements will be in the same window, but
>>> doesn't need to be of the same key.
>>>
>>> Thanks,
>>> Xinyu
>>>
>>> On Wed, Apr 25, 2018 at 6:02 PM, Robert Bradshaw <[email protected]>
>>> wrote:
>>>
>>>> On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <[email protected]>
>>>> wrote:
>>>>
>>>> > Hi,
>>>>
>>>> > I am working on adding the stateful ParDo to the upcoming BEAM Samza
>>>> runner, and realized that the state for each ParDo processElement() is
>>>> not
>>>> only associated with the window of the element, but also the key of the
>>>> element. Chatted with Kenneth over email about this design decision,
>>>> which
>>>> has the following benefits for keyed state:
>>>>
>>>> > 1) No synchronization
>>>> > 2) Simple programming model
>>>> > 3) No communication between works
>>>>
>>>> > The current design doesn't support accessing the state across
>>>> different
>>>> keys, which seems to be a more general use case. This use case is also
>>>> very
>>>> common inside LinkedIn where the users have access to the entire state
>>>> of
>>>> an operator/task, and performing lookups and computations on top of it.
>>>> It's quite hard to make every user here aware that the state is also
>>>> tightly associated with key of the element..
>>>>
>>>> Would side inputs be applicable here? (They're read-only, but other than
>>>> that seem to fit the need.)
>>>>
>>>> >  From the stateful ParDo API the state looks pretty general too. I am
>>>> wondering is it possible to extend the current API to support both keyed
>>>> and non-keyed state? Even internally BEAM assigns a dummy key for to
>>>> associate the state with all the elements. It will be very beneficial to
>>>> existing Samza users and help them adopt BEAM.
>>>>
>>>> Could you clarify how you would use this dummy key? You could manually
>>>> add
>>>> a random key, but in that case it's unlikely that any state stored would
>>>> get observed again. Across what scope were you thinking state would be
>>>> stored? The lifetime of the bundle? The worker? The job? How are
>>>> conflicting writes resolved?
>>>>
>>>> Perhaps rather than describing the mechanism (state) that you're trying
>>>> to
>>>> use, it'd be helpful to describe the kinds of computations you're
>>>> trying to
>>>> perform, to figure out how the model should be adapted/extended if it
>>>> doesn't meet those needs.
>>>>
>>>
>>>
>

Re: Support non-keyed stateful ParDo

Reply via email to