Re: Should you always have a separate PTransform class for a new transform?

Ben Chambers Tue, 07 Feb 2017 19:56:28 -0800

Going back to the beginning, why not "Combine.perKey(Count.fn())" or some
such?


We have a lot of boilerplate already to support "Count.perKey()" and that
will only become worse when we try to have a Count.PerKey class that has
the functionality of Combine.PerKey, why not just get rid of the
boilerplate?

Pithy examples aside, there seems to be value in less implementation
complexity and making it clear that Count is just a CombineFn.

On Tue, Feb 7, 2017, 7:49 PM Kenneth Knowles <[email protected]> wrote:

> I am +0.7 on this idea. My rationale is contained in this thread, but I
> thought I would paraphrase it anyhow:
>
>
> "You automatically get all the features of Combine" / "If you add a feature
> to Combine you have to update all wrappers"
>
> 0. I have been in some of the mentioned historical discussions and was
> swayed enough to not bother trying to change things, but I don't actually
> think the abstraction breakage is worth the benefits. I think the status
> quo is hanging on because it is the status quo.
>
> 1. Any Combine wrapper should also be providing its CombineFn, for use with
> state. So the user already has a good way to get the same result without
> the wrapper, e.g. Combine.globally(Mean.of()). There are reasons to provide
> a factory method, such as abstracting away tweaking of defaults, but I
> don't know of any cases of this happening.
>
> 2. Most interesting transforms will not just be Combine wrappers - we have
> a lot of them because we try to write pithy examples so I bet this issue is
> a bigger deal to us than to our users. I don't have data to back this up;
> maybe users do write a bunch of Combine wrappers, but IMO they should spend
> their time elsewhere. I certainly would. So the burden of people updating
> their Combine wrappers is not compelling to me.
>
> 2a. Consider also Count.perElement(). It really is a glorified combine, but
> none of the supposed benefits of automatic propagation of tuning knobs can
> by accrue to it, because it is not _just_ a Combine.
>
> 3. Those transforms that really are just Combine wrappers will not require
> update when there is a new feature added to Combine. A new performance
> tuning knob, perhaps, but as Eugene points out, new features like side
> inputs or access to pipeline options aren't automatically applicable even
> if your transform is a glorified CombineFn.
>
> 3a. And since the argument is only about universally applicable builder
> methods on Combine, how many more do you imagine we will be adding?
>
> 3b. You can use type hackery to ensure you don't forget a knob, along the
> lines of `CountPerElement implements HasACombineInsideIt<CountPerElement>`.
> This would actually be a step towards enabling e.g. hot key counting in
> Count.PerElement
>
> 4. If a transform that is just a wrapper needs to evolve into more, then
> you have to make a backwards incompatible change because you broke your
> abstractions, and you incur all the manual programming cost anyhow.
>
>
> If I were to write a style guide, I might summarize as:
>
> (a) If you are writing an interesting CombineFn, give it some public
> factory methods.
>
> (b) If you are writing a transform that _by definition, forever and always_
> is a wrapper on Combine, stop after (a)
>
> (c) If you are writing a transform that has conceivably multiple
> implementation choices, you need an abstraction boundary to protect against
> the current decision.
>
> (d) Only return a class more specific than PTransform if your transform has
> more-specific methods to call.
>
>
> All that said, I'm not sure how much influence this particular collection
> of guidelines will have, hence I don't feel a full +1. (except (a) which is
> very important).
>
> Kenn
>
> On Tue, Feb 7, 2017 at 5:59 PM, Eugene Kirpichov <
> [email protected]> wrote:
>
> > There's 2 points here
> >
> > 1. If Count.globally() is implemented via Combine.globally(), then should
> > Count.globally() return a Combine.Globally, or should it wrap it into a
> new
> > class Count.Globally? (that's what I'm wondering in this thread)
> >
> > I think the "least visibility" argument here would lead us to saying we
> > should wrap into a new class, because returning a Combine.Globally leaks
> > the implementation detail that Count is always implemented via Combine.
> >
> > 2. If Thumbs.twiddle() transforms Foo to Bar, should it return a
> > PTransform<Foo, Bar>, or should it return Thumbs.Twiddle (which extends
> > PTransform<Foo, Bar>)? This is where your "least visibility" argument
> from
> > that thread applies more - we're exposing a more specific type - but I
> > think by exposing this type we are not imposing any additional
> obligations
> > on ourselves and not leaking any details: saying "twiddle() returns a
> > Twiddle" is as good as saying nothing.
> >
> > In case of Combine, the ability to additionally configure it for the user
> > is a strong argument. But I think we've been through a similar situation
> > and made a different decision: bounded reads from unbounded sources. We
> > started off with exposing UnboundedSource because a user might want to
> > specify .withMaxNumRecords() and .withMaxReadTime() on it, but changed
> our
> > minds to saying "if you're developing an unbounded connector, and if you
> > want to make a promise that this is part of your connector's public API,
> > then you have to implement these functions yourself". The reasoning was
> > that it might not stay an UnboundedSource forever - and indeed, with SDF,
> > it won't, and the implementation of these functions will become custom.
> >
> > Also, the choice to return a Count.Globally (and propagate the needed
> > functions from Combine.Globally) makes the assumption that as a developer
> > of the Count.Globally transform, you fully control its API surface. E.g.
> by
> > supporting a .withFanout() function, you're saying "When counting things
> > globally, you may need to configure the fanout". Corollary: you also
> > control what functions its API *doesn't* have - e.g. it probably
> shouldn't
> > have .withSideInputs() because side inputs make no sense here.
> >
> > The choice to return a Combine.Globally lets you automatically get all
> new
> > APIs that Combine.Globally may ever get - which may or may not be a good
> > thing. It is good because these functions may be useful; it is bad
> because
> > as a developer of Count.Globally it is now your responsibility to
> > anticipate all possible interactions of parameters users might set on it
> > with the particular CombineFn you supplied; and it is your responsibility
> > to keep that up-to-date as Combine.Globally evolves (e.g. it might some
> day
> > add a parameter that interacts badly with CountFn and then you probably
> > should warn users not to set it). Granted, in this particular case it's
> > easy, but we're talking about the general principle.
> >
> > On Tue, Feb 7, 2017 at 5:45 PM Dan Halperin <[email protected]
> >
> > wrote:
> >
> > > A little bit more inline:
> > >
> > > On Tue, Feb 7, 2017 at 5:15 PM, Eugene Kirpichov <
> > > [email protected]> wrote:
> > >
> > > > Hello,
> > > >
> > > > I was auditing Beam for violations of PTransform style guide
> > > > https://beam.apache.org/contribute/ptransform-style-guide/ and came
> > > across
> > > > another style point that deserves discussion.
> > > >
> > > > Look at Count transform:
> > > >
> > > >   public static <T> Combine.Globally<T, Long> globally() {
> > > >     return Combine.globally(new CountFn<T>());
> > > >   }
> > > >
> > > >   public static <K, V> Combine.PerKey<K, V, Long> perKey() {
> > > >     return Combine.<K, V, Long>perKey(new CountFn<V>());
> > > >   }
> > > >
> > > >   public static <T> PerElement<T> perElement() {
> > > >     return new PerElement<>();
> > > >   }
> > > >
> > > > I asked myself: should globally() and perKey() also define wrapper
> > > classes
> > > > - e.g. should it be "public static <T> Globally<T, Long> globally()"
> > > where
> > > > "Globally" is a new inner class of Count?
> > > >
> > > > I argue that the answer is yes, but it's not clear-cut.
> > > > Cons:
> > > > - If we return a Combine.Globally, the user can use the features
> > provided
> > > > by Combine.Globally - e.g. .withDefaults(), .withFanout(),
> > > > .asSingletonView().
> > > >
> > >
> > > +1 to this point, which was a conscious decision in the pre-Beam days
> > > (which of course means it IS worth revisiting ;).
> > > Trying to replay the reasoning:
> > >
> > > * If wrapping, the author of a new Count.Globally can now only make the
> > > extra functionality in Combine available by similarly exposing all such
> > > functions.
> > >
> > > * Conversely, the current implementation makes new functionality in
> > Combine
> > > available "for free" to users of Count.globally(). Whereas new
> > > functionality on Combine would otherwise mandate that *all wrappers*
> > change
> > > to actually be exposed.
> > >
> > > * There's almost no data here, but: we have added new functionality to
> > > Combine (withSideInputs) and have not added new functionality to Count.
> > >
> > >
> > > > Pros:
> > > > - Style consistency with other transforms. Almost all transforms have
> > > their
> > > > own class, and their factory functions return that class.
> > > >
> > >
> > > IMO this should only happen if the user needs that class. For all
> > examples
> > > I'm aware of,
> > > the returned class has stuff you need to do, like configuring coders or
> > > side inputs or other parameters.
> > > IMO if the user need not configure, return the least constraining thing
> > you
> > > can.
> > >
> > >
> > > > - Implementation can evolve. However, in case of Count, that is
> > unlikely.
> > > >
> > >
> > > +1 to "unlikely"
> > >
> > >
> > > > - ...Or is it? If the transform has a concrete class, then the runner
> > can
> > > > intercept that class and e.g. provide an (even) more efficient
> > > > implementation of Count.Globally. This gets much more awkward if the
> > > runner
> > > > has to intercept every Combine and check whether it's combining
> using a
> > > > CountFn.
> > > >
> > >
> > > Post-runner API, this argument goes away. Instead, the runner will need
> > to
> > > inspect generically the attributes of the transform to do the
> > replacement.
> > >
> > > Summarizing: I currently am a -0.8 on this proposal.
> > >
> > >
> > > >
> > > > So, I propose to add this as a style guide rule, and solve the
> problem
> > in
> > > > "Cons" by saying "Yeah, if you want the extra features of the
> transform
> > > > you're expanding into, you have to propagate them through the API of
> > your
> > > > transform and delegate to the underlying transform manually".
> > > >
> > > > Thoughts?
> > > >
> > >
> >
>

Re: Should you always have a separate PTransform class for a new transform?

Reply via email to