I think the answer to your questions might be StateNamespace. The lowest level of state is always key-scoped, while the StateNamespace indicates whether it is global to the key, further scoped to a particular window, or even scoped to a particular trigger. When the DoFn needs a side input, the key might actually be gone from the user's point of view. It is up to the StepContext to provide an appropriately-scoped StateInternals, usually by some consistent sharding key such as the key from the upstream GBK.
I don't want to go too much into state accessed in the DoFn as I haven't yet got a chance to prepare and publish the design doc for that, and I want everyone to have access to it for any discussion. Does this help? On Tue, May 3, 2016 at 1:58 AM, Aljoscha Krettek <[email protected]> wrote: > I'm afraid I have yet another question. What's the interplay between the > state that holds the buffered main-input elements and possible per-key > state that might be used by the DoFn. I guess I'm not seeing all the parts > but my problem is that one part (the buffering) requires a different type > of state scope as the other part (key-scoped state access in the DoFn) > while they both seem to be using the same StateInternals form the step > context. How does that work? > > Cheers, > Aljoscha > > On Thu, 28 Apr 2016 at 20:05 Kenneth Knowles <[email protected]> > wrote: > > > On Thu, Apr 28, 2016 at 10:19 AM, Aljoscha Krettek <[email protected]> > > wrote: > > > > > No worries :-) and thanks for the detailed answers! > > > > > > I still have one question, though: you wrote that "The side input is > > > considered ready when there has been any data output/added to the > > > PCollection that it is being read as a side input. So the upstream > > trigger > > > controls this." How does this work with side inputs that consist of > > > multiple elements, i.e. ListPCollectionView and MapPCollectionView. For > > > them, do we also consider the side input as ready once the first > element > > > arrives? That's why I was wondering about the triggers being > responsible > > > for deciding when a side input is ready. > > > > > > > Yes, just as you describe. The side input window becomes ready once it > has > > any data. So, combining your items 2.5 and 3, you have a situation where > > main input elements may be combined with only a speculative subset of the > > side input data. They will not be reprocessed once more up-to-date side > > input values become known. Beyond this initial period of waiting for the > > very first firing of the side input window, there are no consistency > > restrictions/guarantees on main input vs side input windows or > triggerings. > > It may be that for a given runner updating the side input with the new > > value happens at high latency so all the main input elements are > processed > > and gone before the update goes through. It is a bit of a dangerous area > > for users. I'm pretty interested in ideas in this space. > > > > Kenn > > >
