I'm split between our current one pass model of pipeline construction and a
two pass model where all information is gathered and then PTransform
expansions are performed.


On Tue, Jul 25, 2017 at 8:25 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hello,
>
> I've worked on a few different things recently and ran repeatedly into the
> same issue: that we do not have clear guidance on who should set the Coder
> on a PCollection: is it responsibility of the PTransform that outputs it,
> or is it responsibility of the user, or is it sometimes one and sometimes
> the other?
>
> I believe that the answer is "it's responsibility of the transform" and
> moreover that  ideally PCollection.setCoder() should not exist. Instead:
>
> - Require that all transforms set a Coder on the PCollection's they produce
> - i.e. it should never be responsibility of the user to "fix up" a coder on
> a PCollection produced by a transform.
>
> - Since all transforms are composed of primitive transforms, saying
> "transforms must set a Coder" means simply that all *primitive* transforms
> must set a Coder on their output.
>
> - In some cases, a primitive PTransform currently doesn't have enough
> information to infer a coder for its output collection - e.g.
> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
> OutputT. In that case such transforms should allow the user to provide a
> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs from
> requiring the user to set a coder on the resulting collection]
>
> - Corollary: composite transforms need to only configure their primitive
> transforms (and composite sub-transforms) properly, and give them a Coder
> if needed.
>
> - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs to
> be configurable with coders for all of these, because the implementation of
> the transform may change and it may introduce intermediate collections
> involving these types. However, in many cases, some of these type
> parameters appear in the type of the transform's input, e.g. a
> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will always be
> able to extract the coders for FooT and BarT from the input PCollection, so
> the user does not need to provide them. However, a coder for BarT must be
> provided. I think in most cases the transform needs to be configurable only
> with coders for its output.
>
> Here's a smooth migration path to accomplish the above:
> - Make PCollection.createPrimitiveOutputInternal() take a Coder.
> - Make all primitive transforms optionally configurable with a coder for
> their outputs, such as ParDo.of(DoFn).withOutputCoder().
> - By using the above, make all composite transforms shipped with the SDK
> set a Coder on the collections they produce; in some cases, this will
> require adding a withSomethingCoder() option to the transform and
> propagating that coder to its sub-transforms. If the option is unset,
> that's fine for now.
> - As a result of the above, get rid of all setCoder() calls in the Beam
> repo. The call will still be there, but it will just not be used anywhere
> in the SDK or examples, and we can mark it deprecated.
> - Add guidance to PTransform Style Guide in line with the above
>
> Does this sound like a good idea? I'm not sure how urgent it would be to
> actually do this, but I'd like to know whether people agree that this is a
> good goal in general.
>

Reply via email to