Re: DISCUSS: Sorted MapState API

Kenneth Knowles Thu, 30 May 2019 16:40:54 -0700

On Tue, May 28, 2019 at 2:59 AM Robert Bradshaw <rober...@google.com> wrote:


> On Fri, May 24, 2019 at 6:57 PM Kenneth Knowles <k...@apache.org> wrote:
> >
> > On Fri, May 24, 2019 at 9:51 AM Kenneth Knowles <k...@apache.org> wrote:
> >>
> >> On Fri, May 24, 2019 at 8:14 AM Reuven Lax <re...@google.com> wrote:
> >>>
> >>> Some great comments!
> >>>
> >>> Aljoscha: absolutely this would have to be implemented by runners to
> be efficient. We can of course provide a default (inefficient)
> implementation, but ideally runners would provide better ones.
> >>>
> >>> Jan Exactly. I think MapState can be dropped or backed by this. E.g.
> >>>
> >>> Robert Great point about standard coders not satisfying this. That's
> why I suggested that we provide a way to tag the coders that do preserve
> order, and only accept those as key coders Alternatively we could present a
> more limited API - e.g. only allowing a hard-coded set of types to be used
> as keys - but that seems counter to the direction Beam usually goes.
> >>
> >>
> >> I think we got it right with GroupByKey: the encoded form of a key is
> authoritative/portable.
> >>
> >> Instead of seeing the in-language type as the "real" value and the
> coder a way to serialize it, the portable encoded bytestring is the "real"
> value and the representation in a particular SDK is the responsibility of
> the SDK.
> >>
> >> This is very important, because many in-language representations,
> especially low-level representations, do not have the desired equality. For
> example, Java arrays. Coder.structuralValue(...) is required to have
> equality that matches the equality of the encoded form. It can be a noop if
> the in-language equality already matches. Or it can be a full encoding if
> there is not a more efficient option. I think we could add another method
> "lexicalValue" or add the requirement that structuralValue also sort
> equivalently to the wire format.
> >>
> >> Now, since many of our wire formats do not sort in the natural
> mathematical order, SDKs should help users avoid the pitfall of using
> these, as we do for GBK by checking "determinism" of the coder. Note that I
> am *not* referring to the order they would sort in any particular
> programming language implementation.
> >
> >
> > Another related features request. Today we do this:
> >
> > (1) infer any coder for a type
> > (2) crash if it is not suitable for GBK
> >
> > I have proposed in the past that instead we:
> >
> > (1) notice that a PCollection is input to a GBK
> > (2) infer an appropriate coder for a type that will work with GBK
> >
> > This generalizes to the idea of registering multiple coders for
> different purposes, particularly inferring a coder that has good lexical
> sorting.
> >
> > I don't recall that there was any objection, but neither I nor anyone
> else has gotten around to working on this. It does have the problem that it
> could change the coders and impact pipeline update. I suggest that update
> is fragile enough that we should develop pipeline options that allow opt-in
> to improvements without a major version bump.
>
> It also has the difficulty that coders cannot be inferred until one
> knows all their consumers (which breaks a transform being able to
> inspect or act on the coder(s) of its input(s).
>
> Possibly, if we move to something more generic, like specifying
> element types/schemas on collections and then doing
> whole-pipeline-analysis to infer all the coders, this would be
> solvable (but potentially backwards incompatible, and fraught with
> difficulties if users (aka transform authors) specify their own coders
> (e.g. when inference fails, or as part of inference code)).
>
> Another way to look at this is that there are certain transformations
> ("upgrades") one can do to coders when they need certain properties,
> which change the encoded form but preserve the types. Upgrading a
> coder C to its length-prefixed one is one such operation. Upgrading a
> coder to a deterministic version, or a natural-order-preserving one,
> are two other possible transformations (which should be idempotent,
> may be the identity, and may be an error).
>

You pose a good problem, and a reasonable solution. Since on most (all?)
runners this coder conversion will occur in the middle of a fused stage, it
should have near-zero cost. And if it has a standard URN + payload it could
even be optimized away entirely and/or inform optimization decisions.

Ideally, there's be less of transforms acting on the coders of their input.
Today I think it is mostly to push them around when coder inference fails.
In that case, I've hypothesized that we could do elementary constraint
solving like any type inferencer does, keeping unspecified coders as
"variables" until after construction. But that's a sizable change now that
we've gone down the current path, and not enough clear benefit I suppose.
If we started from scratch I'd start with that probably.

Kenn

Re: DISCUSS: Sorted MapState API

Reply via email to