Re: [DISCUSS] Adding Some Sort of SideInputRunner

Aljoscha Krettek Tue, 03 May 2016 12:41:49 -0700

Maybe, I'll try and figure something out. :-)

My problem was that the doc for StateInternals explicitly states that
access to state is always implicitly scoped to the key being processed. In
my understanding this was always the key of an element but it seems that it
can also be a more abstract key, such as the sharding key. The fact that
this could be the case was hidden away in code outside the SDK, it seems.


Thanks for your help!

On Tue, 3 May 2016 at 19:40 Kenneth Knowles <[email protected]> wrote:

> I think the answer to your questions might be StateNamespace.
>
> The lowest level of state is always key-scoped, while the StateNamespace
> indicates whether it is global to the key, further scoped to a particular
> window, or even scoped to a particular trigger. When the DoFn needs a side
> input, the key might actually be gone from the user's point of view. It is
> up to the StepContext to provide an appropriately-scoped StateInternals,
> usually by some consistent sharding key such as the key from the upstream
> GBK.
>
> I don't want to go too much into state accessed in the DoFn as I haven't
> yet got a chance to prepare and publish the design doc for that, and I want
> everyone to have access to it for any discussion.
>
> Does this help?
>
> On Tue, May 3, 2016 at 1:58 AM, Aljoscha Krettek <[email protected]>
> wrote:
>
> > I'm afraid I have yet another question. What's the interplay between the
> > state that holds the buffered main-input elements and possible per-key
> > state that might be used by the DoFn. I guess I'm not seeing all the
> parts
> > but my problem is that one part (the buffering) requires a different type
> > of state scope as the other part (key-scoped state access in the DoFn)
> > while they both seem to be using the same StateInternals form the step
> > context. How does that work?
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 28 Apr 2016 at 20:05 Kenneth Knowles <[email protected]>
> > wrote:
> >
> > > On Thu, Apr 28, 2016 at 10:19 AM, Aljoscha Krettek <
> [email protected]>
> > > wrote:
> > >
> > > > No worries :-) and thanks for the detailed answers!
> > > >
> > > > I still have one question, though: you wrote that "The side input is
> > > > considered ready when there has been any data output/added to the
> > > > PCollection that it is being read as a side input. So the upstream
> > > trigger
> > > > controls this." How does this work with side inputs that consist of
> > > > multiple elements, i.e. ListPCollectionView and MapPCollectionView.
> For
> > > > them, do we also consider the side input as ready once the first
> > element
> > > > arrives? That's why I was wondering about the triggers being
> > responsible
> > > > for deciding when a side input is ready.
> > > >
> > >
> > > Yes, just as you describe. The side input window becomes ready once it
> > has
> > > any data. So, combining your items 2.5 and 3, you have a situation
> where
> > > main input elements may be combined with only a speculative subset of
> the
> > > side input data. They will not be reprocessed once more up-to-date side
> > > input values become known. Beyond this initial period of waiting for
> the
> > > very first firing of the side input window, there are no consistency
> > > restrictions/guarantees on main input vs side input windows or
> > triggerings.
> > > It may be that for a given runner updating the side input with the new
> > > value happens at high latency so all the main input elements are
> > processed
> > > and gone before the update goes through. It is a bit of a dangerous
> area
> > > for users. I'm pretty interested in ideas in this space.
> > >
> > > Kenn
> > >
> >
>

Re: [DISCUSS] Adding Some Sort of SideInputRunner

Reply via email to