It is a good point that stateful computations are generally conceptualized
as getting events in order. Initially, I intended to rely on the user to
properly deal with out-of-order data in the design of their state machine.
This will deliver some value early, and then orthogonal features may be
able to give them ordering later.

There was thread on users@ about the feature, where I mentioned that the
state machine still needs to be commutative, so it is very similar to a
streaming Combine. A user funnel, for example, will need to be essentially
monotonically progressing through a lattice, not proceeding step-by-step.

The additional power of the State & Timers API comes from:

 - addition of timers (IMO this is the biggest)
 - no associativity requirement, and ValueState
 - ability to update state cells independently (similar to a composed
Combine, but more efficient)
 - maybe an intuitive way of expressing some computations

To support event time ordered state machines, I think we need:

 - Ordering by timestamp for the iterable coming out of a GBK (there have
been a couple good threads on this, so I won't go into it)
 - The user must choose a trigger that outputs exactly once, preferably at
window expiration time (default trigger with zero allowed lateness does
this)

Then post-GBK The user can then run their state machine over the elements
in the iterable ("exploding" it will likely work, too, but since the model
does not require order-preserving transport it is technically not to spec).
Today if the user knows that windows are small, they can sort the iterable,
or they may have other approaches specific to their use case (for just a
"start" event, maybe a scan for that is OK since it will be O(lateness)
into the iterable).

Just a quick reply; I'm very open to ideas.

Kenn

On Fri, Jul 29, 2016 at 10:59 AM, Aljoscha Krettek <[email protected]>
wrote:

> +1 Very nice proposal and the API already looks very good. I guess the only
> thing people still like to discuss on this is naming of things. :-)
>
> I just have one general remark about giving users access to state and
> timers. The Beam model takes great care to mostly shield users from the
> reality of out-of-order events. The windowing mostly deals with this
> internally and the watermarks provide some level of completeness
> guarantees. If users directly modify their state based on each arriving
> element they might run into problems if they don't take into account that
> elements can (will) arrive out-of-order. For example, let's say they have
> three types of event: "start", "in-between", and "end". In the state
> machine they probably assume that the "start" event will arrive first and
> that the "end" event will arrive last. Due to slowdowns anywhere in the
> system they might not arrive in that order, however, and the state machine
> will trip up. This is an artificial example but I imagine there could be
> real-world cases where this plays a role.
>
> Do we have any ideas on mitigating those kinds of problems or will we rely
> on users properly understanding that this could happen in their pipeline?
>
> Cheers,
> Aljoscha
>
> On Wed, 27 Jul 2016 at 05:20 Kenneth Knowles <[email protected]>
> wrote:
>
> > Hi everyone,
> >
> >
> > I would like to offer a proposal for a much-requested feature in Beam:
> > Stateful processing in a DoFn. Please check out and comment on the
> proposal
> > at this URL:
> >
> >
> >   https://s.apache.org/beam-state
> >
> >
> > This proposal includes user-facing APIs for persistent state and timers.
> > Together, these provide rich capabilities that have been called "per-key
> > workflows", the subject of [BEAM-23].
> >
> >
> > Note that this proposal has an important prerequisite: a new design for
> > DoFn. The new DoFn is strongly motivated by this design for state and
> > timers, but we should discuss it separately. I will start a separate
> thread
> > for that.
> >
> >
> > On this email thread, I'd like to try to focus the discussion on state &
> > timers. And of course, please do comment on the particulars in the
> > document.
> >
> >
> > Kenn
> >
> >
> > [BEAM-23] https://issues.apache.org/jira/browse/BEAM-23
> >
>

Reply via email to