Re: Should you always have a separate PTransform class for a new transform?

Eugene Kirpichov Tue, 07 Feb 2017 20:44:30 -0800

I like the idea of "if you're really just a Combine, then expose (only) the
CombineFn".


However, in case of Count, another argument is that Count.perElement()
already returns a PerElement transform, and it'd be awkward if globally()
and perKey() were only exposed as CombineFn's. In a sense, if you are or
might be developing a family of transforms, creating transform wrapper
classes is a common denominator.

So we have 3 options for Count.globally() and .perKey():
1. Return Combine.Globally, Combine.PerKey transforms (status quo)
2. Create wrapper classes Count.Globally, Count.PerKey
3. Expose only the respective CombineFn and have users apply it themselves

It sounds like I am in favor of a mix of 2 and 3; Dan (so far) is in favor
of 1; Kenn and Ben and you in favor of 3? I must admit I didn't quite
understand the option of "implements CombiningTransform".

On Tue, Feb 7, 2017 at 8:38 PM Robert Bradshaw <rober...@google.com.invalid>
wrote:

> On Tue, Feb 7, 2017 at 7:49 PM, Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > I am +0.7 on this idea. My rationale is contained in this thread, but I
> > thought I would paraphrase it anyhow:
> >
> > "You automatically get all the features of Combine" / "If you add a
> feature
> > to Combine you have to update all wrappers"
> >
> > 0. I have been in some of the mentioned historical discussions and was
> > swayed enough to not bother trying to change things, but I don't actually
> > think the abstraction breakage is worth the benefits. I think the status
> > quo is hanging on because it is the status quo.
> >
> > 1. Any Combine wrapper should also be providing its CombineFn, for use
> with
> > state. So the user already has a good way to get the same result without
> > the wrapper, e.g. Combine.globally(Mean.of()). There are reasons to
> provide
> > a factory method, such as abstracting away tweaking of defaults, but I
> > don't know of any cases of this happening.
> >
>
> +1, we need to also be exposing pure CombineFns.
>
>
> > 2. Most interesting transforms will not just be Combine wrappers - we
> have
> > a lot of them because we try to write pithy examples so I bet this issue
> is
> > a bigger deal to us than to our users. I don't have data to back this up;
> > maybe users do write a bunch of Combine wrappers, but IMO they should
> spend
> > their time elsewhere. I certainly would. So the burden of people updating
> > their Combine wrappers is not compelling to me.
> >
>
> Agree here too, I don't think it's worth encouraging creating an extra
> suite of classes just to avoid the Combine.perKey(...) and
> Combine.globally(...). Also, as options are added (like hot key fanout)
> it's likely plumbing these options through will be inconsistently done (as
> well as a lot of effort).
>
>
> > 2a. Consider also Count.perElement(). It really is a glorified combine,
> but
> > none of the supposed benefits of automatic propagation of tuning knobs
> can
> > by accrue to it, because it is not _just_ a Combine.
> >
> > 3. Those transforms that really are just Combine wrappers will not
> require
> > update when there is a new feature added to Combine. A new performance
> > tuning knob, perhaps, but as Eugene points out, new features like side
> > inputs or access to pipeline options aren't automatically applicable even
> > if your transform is a glorified CombineFn.
> >
> > 3a. And since the argument is only about universally applicable builder
> > methods on Combine, how many more do you imagine we will be adding?
> >
> > 3b. You can use type hackery to ensure you don't forget a knob, along the
> > lines of `CountPerElement implements HasACombineInsideIt<
> > CountPerElement>`.
> > This would actually be a step towards enabling e.g. hot key counting in
> > Count.PerElement
> >
>
> Yes, more below.
>
>
> > 4. If a transform that is just a wrapper needs to evolve into more, then
> > you have to make a backwards incompatible change because you broke your
> > abstractions, and you incur all the manual programming cost anyhow.
> >
>
> I was thinking that the Java-esque way of doing things is to provide an
> interface, CombiningTransform<PIC, PCO>, which the CombinePerKey (and
> CombineGlobally) implement. CombiningTransform would have methods like
> withFanout that return a CombiningTransform of the same type. PTransforms
> like PerElement and RemoveDuplicates could implement this as well, despite
> not being subclasses of the Combine ones, and delegate to their inner
> Combine(s) as makes sense. Factory methods like Sum.integersGlobally()
> would be declared to return a CombiningTransform of the right type, and the
> default implementation would be to return a CombineGlobally but there is no
> lock in (other than to support the CombiningTransform methods).
>
> If I were to write a style guide, I might summarize as:
> >
> > (a) If you are writing an interesting CombineFn, give it some public
> > factory methods.
> >
> > (b) If you are writing a transform that _by definition, forever and
> always_
> > is a wrapper on Combine, stop after (a)
> >
> > (c) If you are writing a transform that has conceivably multiple
> > implementation choices, you need an abstraction boundary to protect
> against
> > the current decision.
> >
> > (d) Only return a class more specific than PTransform if your transform
> has
> > more-specific methods to call.
> >
> >
> > All that said, I'm not sure how much influence this particular collection
> > of guidelines will have, hence I don't feel a full +1. (except (a) which
> is
> > very important).
> >
>
> Your summary sounds more like a -1 to the proposal.
>
>
> > On Tue, Feb 7, 2017 at 5:59 PM, Eugene Kirpichov <
> > kirpic...@google.com.invalid> wrote:
> >
> > > There's 2 points here
> > >
> > > 1. If Count.globally() is implemented via Combine.globally(), then
> should
> > > Count.globally() return a Combine.Globally, or should it wrap it into a
> > new
> > > class Count.Globally? (that's what I'm wondering in this thread)
> > >
> > > I think the "least visibility" argument here would lead us to saying we
> > > should wrap into a new class, because returning a Combine.Globally
> leaks
> > > the implementation detail that Count is always implemented via Combine.
> > >
> > > 2. If Thumbs.twiddle() transforms Foo to Bar, should it return a
> > > PTransform<Foo, Bar>, or should it return Thumbs.Twiddle (which extends
> > > PTransform<Foo, Bar>)? This is where your "least visibility" argument
> > from
> > > that thread applies more - we're exposing a more specific type - but I
> > > think by exposing this type we are not imposing any additional
> > obligations
> > > on ourselves and not leaking any details: saying "twiddle() returns a
> > > Twiddle" is as good as saying nothing.
> > >
> > > In case of Combine, the ability to additionally configure it for the
> user
> > > is a strong argument. But I think we've been through a similar
> situation
> > > and made a different decision: bounded reads from unbounded sources. We
> > > started off with exposing UnboundedSource because a user might want to
> > > specify .withMaxNumRecords() and .withMaxReadTime() on it, but changed
> > our
> > > minds to saying "if you're developing an unbounded connector, and if
> you
> > > want to make a promise that this is part of your connector's public
> API,
> > > then you have to implement these functions yourself". The reasoning was
> > > that it might not stay an UnboundedSource forever - and indeed, with
> SDF,
> > > it won't, and the implementation of these functions will become custom.
> > >
> > > Also, the choice to return a Count.Globally (and propagate the needed
> > > functions from Combine.Globally) makes the assumption that as a
> developer
> > > of the Count.Globally transform, you fully control its API surface.
> E.g.
> > by
> > > supporting a .withFanout() function, you're saying "When counting
> things
> > > globally, you may need to configure the fanout". Corollary: you also
> > > control what functions its API *doesn't* have - e.g. it probably
> > shouldn't
> > > have .withSideInputs() because side inputs make no sense here.
> > >
> > > The choice to return a Combine.Globally lets you automatically get all
> > new
> > > APIs that Combine.Globally may ever get - which may or may not be a
> good
> > > thing. It is good because these functions may be useful; it is bad
> > because
> > > as a developer of Count.Globally it is now your responsibility to
> > > anticipate all possible interactions of parameters users might set on
> it
> > > with the particular CombineFn you supplied; and it is your
> responsibility
> > > to keep that up-to-date as Combine.Globally evolves (e.g. it might some
> > day
> > > add a parameter that interacts badly with CountFn and then you probably
> > > should warn users not to set it). Granted, in this particular case it's
> > > easy, but we're talking about the general principle.
> > >
> > > On Tue, Feb 7, 2017 at 5:45 PM Dan Halperin
> <dhalp...@google.com.invalid
> > >
> > > wrote:
> > >
> > > > A little bit more inline:
> > > >
> > > > On Tue, Feb 7, 2017 at 5:15 PM, Eugene Kirpichov <
> > > > kirpic...@google.com.invalid> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I was auditing Beam for violations of PTransform style guide
> > > > > https://beam.apache.org/contribute/ptransform-style-guide/ and
> came
> > > > across
> > > > > another style point that deserves discussion.
> > > > >
> > > > > Look at Count transform:
> > > > >
> > > > >   public static <T> Combine.Globally<T, Long> globally() {
> > > > >     return Combine.globally(new CountFn<T>());
> > > > >   }
> > > > >
> > > > >   public static <K, V> Combine.PerKey<K, V, Long> perKey() {
> > > > >     return Combine.<K, V, Long>perKey(new CountFn<V>());
> > > > >   }
> > > > >
> > > > >   public static <T> PerElement<T> perElement() {
> > > > >     return new PerElement<>();
> > > > >   }
> > > > >
> > > > > I asked myself: should globally() and perKey() also define wrapper
> > > > classes
> > > > > - e.g. should it be "public static <T> Globally<T, Long>
> globally()"
> > > > where
> > > > > "Globally" is a new inner class of Count?
> > > > >
> > > > > I argue that the answer is yes, but it's not clear-cut.
> > > > > Cons:
> > > > > - If we return a Combine.Globally, the user can use the features
> > > provided
> > > > > by Combine.Globally - e.g. .withDefaults(), .withFanout(),
> > > > > .asSingletonView().
> > > > >
> > > >
> > > > +1 to this point, which was a conscious decision in the pre-Beam days
> > > > (which of course means it IS worth revisiting ;).
> > > > Trying to replay the reasoning:
> > > >
> > > > * If wrapping, the author of a new Count.Globally can now only make
> the
> > > > extra functionality in Combine available by similarly exposing all
> such
> > > > functions.
> > > >
> > > > * Conversely, the current implementation makes new functionality in
> > > Combine
> > > > available "for free" to users of Count.globally(). Whereas new
> > > > functionality on Combine would otherwise mandate that *all wrappers*
> > > change
> > > > to actually be exposed.
> > > >
> > > > * There's almost no data here, but: we have added new functionality
> to
> > > > Combine (withSideInputs) and have not added new functionality to
> Count.
> > > >
> > > >
> > > > > Pros:
> > > > > - Style consistency with other transforms. Almost all transforms
> have
> > > > their
> > > > > own class, and their factory functions return that class.
> > > > >
> > > >
> > > > IMO this should only happen if the user needs that class. For all
> > > examples
> > > > I'm aware of,
> > > > the returned class has stuff you need to do, like configuring coders
> or
> > > > side inputs or other parameters.
> > > > IMO if the user need not configure, return the least constraining
> thing
> > > you
> > > > can.
> > > >
> > > >
> > > > > - Implementation can evolve. However, in case of Count, that is
> > > unlikely.
> > > > >
> > > >
> > > > +1 to "unlikely"
> > > >
> > > >
> > > > > - ...Or is it? If the transform has a concrete class, then the
> runner
> > > can
> > > > > intercept that class and e.g. provide an (even) more efficient
> > > > > implementation of Count.Globally. This gets much more awkward if
> the
> > > > runner
> > > > > has to intercept every Combine and check whether it's combining
> > using a
> > > > > CountFn.
> > > > >
> > > >
> > > > Post-runner API, this argument goes away. Instead, the runner will
> need
> > > to
> > > > inspect generically the attributes of the transform to do the
> > > replacement.
> > > >
> > > > Summarizing: I currently am a -0.8 on this proposal.
> > > >
> > > >
> > > > >
> > > > > So, I propose to add this as a style guide rule, and solve the
> > problem
> > > in
> > > > > "Cons" by saying "Yeah, if you want the extra features of the
> > transform
> > > > > you're expanding into, you have to propagate them through the API
> of
> > > your
> > > > > transform and delegate to the underlying transform manually".
> > > > >
> > > > > Thoughts?
> > > > >
> > > >
> > >
> >
>

Re: Should you always have a separate PTransform class for a new transform?

Reply via email to