Re: Should you always have a separate PTransform class for a new transform?

Dan Halperin Tue, 07 Feb 2017 17:45:59 -0800

A little bit more inline:

On Tue, Feb 7, 2017 at 5:15 PM, Eugene Kirpichov <
[email protected]> wrote:


> Hello,
>
> I was auditing Beam for violations of PTransform style guide
> https://beam.apache.org/contribute/ptransform-style-guide/ and came across
> another style point that deserves discussion.
>
> Look at Count transform:
>
>   public static <T> Combine.Globally<T, Long> globally() {
>     return Combine.globally(new CountFn<T>());
>   }
>
>   public static <K, V> Combine.PerKey<K, V, Long> perKey() {
>     return Combine.<K, V, Long>perKey(new CountFn<V>());
>   }
>
>   public static <T> PerElement<T> perElement() {
>     return new PerElement<>();
>   }
>
> I asked myself: should globally() and perKey() also define wrapper classes
> - e.g. should it be "public static <T> Globally<T, Long> globally()" where
> "Globally" is a new inner class of Count?
>
> I argue that the answer is yes, but it's not clear-cut.
> Cons:
> - If we return a Combine.Globally, the user can use the features provided
> by Combine.Globally - e.g. .withDefaults(), .withFanout(),
> .asSingletonView().
>

+1 to this point, which was a conscious decision in the pre-Beam days
(which of course means it IS worth revisiting ;).
Trying to replay the reasoning:

* If wrapping, the author of a new Count.Globally can now only make the
extra functionality in Combine available by similarly exposing all such
functions.

* Conversely, the current implementation makes new functionality in Combine
available "for free" to users of Count.globally(). Whereas new
functionality on Combine would otherwise mandate that *all wrappers* change
to actually be exposed.

* There's almost no data here, but: we have added new functionality to
Combine (withSideInputs) and have not added new functionality to Count.


> Pros:
> - Style consistency with other transforms. Almost all transforms have their
> own class, and their factory functions return that class.
>

IMO this should only happen if the user needs that class. For all examples
I'm aware of,
the returned class has stuff you need to do, like configuring coders or
side inputs or other parameters.
IMO if the user need not configure, return the least constraining thing you
can.


> - Implementation can evolve. However, in case of Count, that is unlikely.
>

+1 to "unlikely"


> - ...Or is it? If the transform has a concrete class, then the runner can
> intercept that class and e.g. provide an (even) more efficient
> implementation of Count.Globally. This gets much more awkward if the runner
> has to intercept every Combine and check whether it's combining using a
> CountFn.
>

Post-runner API, this argument goes away. Instead, the runner will need to
inspect generically the attributes of the transform to do the replacement.

Summarizing: I currently am a -0.8 on this proposal.


>
> So, I propose to add this as a style guide rule, and solve the problem in
> "Cons" by saying "Yeah, if you want the extra features of the transform
> you're expanding into, you have to propagate them through the API of your
> transform and delegate to the underlying transform manually".
>
> Thoughts?
>

Re: Should you always have a separate PTransform class for a new transform?

Reply via email to