Re: Should you always have a separate PTransform class for a new transform?

Eugene Kirpichov Tue, 07 Feb 2017 18:01:08 -0800

There's 2 points here

1. If Count.globally() is implemented via Combine.globally(), then should
Count.globally() return a Combine.Globally, or should it wrap it into a new
class Count.Globally? (that's what I'm wondering in this thread)

I think the "least visibility" argument here would lead us to saying we
should wrap into a new class, because returning a Combine.Globally leaks
the implementation detail that Count is always implemented via Combine.

2. If Thumbs.twiddle() transforms Foo to Bar, should it return a
PTransform<Foo, Bar>, or should it return Thumbs.Twiddle (which extends
PTransform<Foo, Bar>)? This is where your "least visibility" argument from
that thread applies more - we're exposing a more specific type - but I
think by exposing this type we are not imposing any additional obligations
on ourselves and not leaking any details: saying "twiddle() returns a
Twiddle" is as good as saying nothing.

In case of Combine, the ability to additionally configure it for the user
is a strong argument. But I think we've been through a similar situation
and made a different decision: bounded reads from unbounded sources. We
started off with exposing UnboundedSource because a user might want to
specify .withMaxNumRecords() and .withMaxReadTime() on it, but changed our
minds to saying "if you're developing an unbounded connector, and if you
want to make a promise that this is part of your connector's public API,
then you have to implement these functions yourself". The reasoning was
that it might not stay an UnboundedSource forever - and indeed, with SDF,
it won't, and the implementation of these functions will become custom.

Also, the choice to return a Count.Globally (and propagate the needed
functions from Combine.Globally) makes the assumption that as a developer
of the Count.Globally transform, you fully control its API surface. E.g. by
supporting a .withFanout() function, you're saying "When counting things
globally, you may need to configure the fanout". Corollary: you also
control what functions its API *doesn't* have - e.g. it probably shouldn't
have .withSideInputs() because side inputs make no sense here.

The choice to return a Combine.Globally lets you automatically get all new
APIs that Combine.Globally may ever get - which may or may not be a good
thing. It is good because these functions may be useful; it is bad because
as a developer of Count.Globally it is now your responsibility to
anticipate all possible interactions of parameters users might set on it
with the particular CombineFn you supplied; and it is your responsibility
to keep that up-to-date as Combine.Globally evolves (e.g. it might some day
add a parameter that interacts badly with CountFn and then you probably
should warn users not to set it). Granted, in this particular case it's
easy, but we're talking about the general principle.

On Tue, Feb 7, 2017 at 5:45 PM Dan Halperin <dhalp...@google.com.invalid>
wrote:

> A little bit more inline:
>
> On Tue, Feb 7, 2017 at 5:15 PM, Eugene Kirpichov <
> kirpic...@google.com.invalid> wrote:
>
> > Hello,
> >
> > I was auditing Beam for violations of PTransform style guide
> > https://beam.apache.org/contribute/ptransform-style-guide/ and came
> across
> > another style point that deserves discussion.
> >
> > Look at Count transform:
> >
> >   public static <T> Combine.Globally<T, Long> globally() {
> >     return Combine.globally(new CountFn<T>());
> >   }
> >
> >   public static <K, V> Combine.PerKey<K, V, Long> perKey() {
> >     return Combine.<K, V, Long>perKey(new CountFn<V>());
> >   }
> >
> >   public static <T> PerElement<T> perElement() {
> >     return new PerElement<>();
> >   }
> >
> > I asked myself: should globally() and perKey() also define wrapper
> classes
> > - e.g. should it be "public static <T> Globally<T, Long> globally()"
> where
> > "Globally" is a new inner class of Count?
> >
> > I argue that the answer is yes, but it's not clear-cut.
> > Cons:
> > - If we return a Combine.Globally, the user can use the features provided
> > by Combine.Globally - e.g. .withDefaults(), .withFanout(),
> > .asSingletonView().
> >
>
> +1 to this point, which was a conscious decision in the pre-Beam days
> (which of course means it IS worth revisiting ;).
> Trying to replay the reasoning:
>
> * If wrapping, the author of a new Count.Globally can now only make the
> extra functionality in Combine available by similarly exposing all such
> functions.
>
> * Conversely, the current implementation makes new functionality in Combine
> available "for free" to users of Count.globally(). Whereas new
> functionality on Combine would otherwise mandate that *all wrappers* change
> to actually be exposed.
>
> * There's almost no data here, but: we have added new functionality to
> Combine (withSideInputs) and have not added new functionality to Count.
>
>
> > Pros:
> > - Style consistency with other transforms. Almost all transforms have
> their
> > own class, and their factory functions return that class.
> >
>
> IMO this should only happen if the user needs that class. For all examples
> I'm aware of,
> the returned class has stuff you need to do, like configuring coders or
> side inputs or other parameters.
> IMO if the user need not configure, return the least constraining thing you
> can.
>
>
> > - Implementation can evolve. However, in case of Count, that is unlikely.
> >
>
> +1 to "unlikely"
>
>
> > - ...Or is it? If the transform has a concrete class, then the runner can
> > intercept that class and e.g. provide an (even) more efficient
> > implementation of Count.Globally. This gets much more awkward if the
> runner
> > has to intercept every Combine and check whether it's combining using a
> > CountFn.
> >
>
> Post-runner API, this argument goes away. Instead, the runner will need to
> inspect generically the attributes of the transform to do the replacement.
>
> Summarizing: I currently am a -0.8 on this proposal.
>
>
> >
> > So, I propose to add this as a style guide rule, and solve the problem in
> > "Cons" by saying "Yeah, if you want the extra features of the transform
> > you're expanding into, you have to propagate them through the API of your
> > transform and delegate to the underlying transform manually".
> >
> > Thoughts?
> >
>

Re: Should you always have a separate PTransform class for a new transform?

Reply via email to