In that case I agree with Ken. It would be trivial to write a wrapper that did this.
On Wed, Apr 25, 2018 at 7:01 PM Xinyu Liu <[email protected]> wrote: > @Reuven: if the state is non-keyed (or assigned to a single key), then I > would expect it to be executed in a single worker, otherwise there can be > state corruptions as you mentioned. Our use case is to store elements in > the state regardless of the keys, and then do computations on top of them. > An example can be data lookup: we can store user data elements in the > state, and do a look up of all the relevant user data needed for a incoming > event. This seems to be a quite general use case to me. > > @Kenneth: it will be great to support it as a convenience composite! > > Thanks, > Xinyu > > On Wed, Apr 25, 2018 at 6:31 PM, Kenneth Knowles <[email protected]> wrote: > >> #2 could be accomplished with a convenience composite, yes? >> >> On Wed, Apr 25, 2018, 18:28 Xinyu Liu <[email protected]> wrote: >> >>> @Robert: for your questions: >>> >>> 1) Side input won't work for us since it returns the whole collection. >>> We use rocksDb and usually the state is too big to fit in memory. >>> >>> 2) One way to achieve our use cases is to assign a single key to all the >>> elements so they will be associated with the same keyed state. The state >>> will belong to the element window as it is. Kenneth mentioned this solution >>> too. It does meet our use case, but it's not very convenient to our users. >>> >>> 3) Sorry if I wasn't clear about the use case. For our usage, it's >>> pretty common to store the elements in the states, and look them up later >>> and do some computation. The elements will be in the same window, but >>> doesn't need to be of the same key. >>> >>> Thanks, >>> Xinyu >>> >>> On Wed, Apr 25, 2018 at 6:02 PM, Robert Bradshaw <[email protected]> >>> wrote: >>> >>>> On Wed, Apr 25, 2018 at 5:45 PM Xinyu Liu <[email protected]> >>>> wrote: >>>> >>>> > Hi, >>>> >>>> > I am working on adding the stateful ParDo to the upcoming BEAM Samza >>>> runner, and realized that the state for each ParDo processElement() is >>>> not >>>> only associated with the window of the element, but also the key of the >>>> element. Chatted with Kenneth over email about this design decision, >>>> which >>>> has the following benefits for keyed state: >>>> >>>> > 1) No synchronization >>>> > 2) Simple programming model >>>> > 3) No communication between works >>>> >>>> > The current design doesn't support accessing the state across >>>> different >>>> keys, which seems to be a more general use case. This use case is also >>>> very >>>> common inside LinkedIn where the users have access to the entire state >>>> of >>>> an operator/task, and performing lookups and computations on top of it. >>>> It's quite hard to make every user here aware that the state is also >>>> tightly associated with key of the element.. >>>> >>>> Would side inputs be applicable here? (They're read-only, but other than >>>> that seem to fit the need.) >>>> >>>> > From the stateful ParDo API the state looks pretty general too. I am >>>> wondering is it possible to extend the current API to support both keyed >>>> and non-keyed state? Even internally BEAM assigns a dummy key for to >>>> associate the state with all the elements. It will be very beneficial to >>>> existing Samza users and help them adopt BEAM. >>>> >>>> Could you clarify how you would use this dummy key? You could manually >>>> add >>>> a random key, but in that case it's unlikely that any state stored would >>>> get observed again. Across what scope were you thinking state would be >>>> stored? The lifetime of the bundle? The worker? The job? How are >>>> conflicting writes resolved? >>>> >>>> Perhaps rather than describing the mechanism (state) that you're trying >>>> to >>>> use, it'd be helpful to describe the kinds of computations you're >>>> trying to >>>> perform, to figure out how the model should be adapted/extended if it >>>> doesn't meet those needs. >>>> >>> >>> >
