Re: Requiring PTransform to set a coder on its resulting collections

Robert Bradshaw Thu, 27 Jul 2017 10:23:37 -0700

On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
<[email protected]> wrote:
> On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik <[email protected]>
> wrote:
>>
>> Ken/Robert, I believe users will want the ability to set the output coder
>> because coders may have intrinsic properties where the type isn't enough
>> information to fully specify what I want as a user. Some cases I can see
>> are:
>> 1) I have a cheap and fast non-deterministic coder but a different slower
>> coder when I want to use it as the key to a GBK, For example with a set
>> coder, it would need to consistently order the values of the set when used
>> as the key.
>> 2) I know a property of the data which allows me to have a cheaper
>> encoding. Imagine I know that all the strings have a common prefix or
>> integers that are in a certain range, or that a matrix is sparse/dense. Not
>> all PCollections of strings / integers / matrices in the pipeline will have
>> this property, just some.
>> 3) Sorting comes up occasionally, traditionally in Google this was done by
>> sorting the encoded version of the object lexicographically during a GBK.
>> There are good lexicographical byte representations for ASCII strings,
>> integers, and for some IEEE number representations which could be done by
>> the use of a special coder.
>>
>
> Items (1) and (3) do not require special knowledge from the user. They are
> easily observed properties of a pipeline. My proposal included full
> automation for both. The suggestion is new methods
> .getDeterministicCoder(TypeDescriptor) and
> .getLexicographicCoder(TypeDescriptor).


Completely agree--usecases (1) and (3) are an indirect use of Coders
that are used to achieve an effect that would be better expressed
directly.

> (2) is an interesting hypothetical for massive scale where tiny incremental
> optimization represents a lot of cost _and_ your data has sufficient
> structure to realize a benefit _and_ it needs to be pinpointed to just some
> PCollections. I think our experience with coders so far is that their
> existence is almost entirely negative. It would be nice to support this
> vanishingly rare case without inflicting a terrible pain point on the model
> and all other users.

(2) is not just about cheapness, sometimes there's other structure in
the data we can leverage. Consider the UnionCoder used in
CoGBK--RawUnionValue has an integer value that specifies indicates the
type of it's raw Object field. Unless we want to extend the type
language, there's not a sufficient type descriptor that can be used to
infer the coder. I'm dubious going down the road of adding special
cases is the right thing here.

> For example, in those cases you could encode in your
> DoFn so the type descriptor would just be byte[].

As well as being an extremely cumbersome API, this would incur the
cost of coding/decoding at that DoFn boundary even if it is fused
away.

>> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>> > Hi,
>> >
>> > That's an interesting thread and I was wondering the relationship between
>> > type descriptor and coder for a while ;)
>> >
>> > Today, in a PCollection, we can set the coder and we also have a
>> > getTypeDescriptor(). It sounds weird to me: it should be one or the
>> other.
>> >
>> > Basically, if the Coder is not used to define the type, than, I fully
>> > agree with Eugene.
>> >
>> > Basically, the PCollection should define only the type descriptor, not
>> the
>> > coder by itself: the coder can be found using the type descriptor.
>> >
>> > With both coder and type descriptor on the PCollection, it sounds a big
>> > "decoupled" to me and it would be possible to have a coder on the
>> > PCollection that doesn't match the type descriptor.
>> >
>> > I think PCollection type descriptor should be defined, and the coder
>> > should be implicit based on this type descriptor.
>> >
>> > Thoughts ?
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote:
>> >
>> >> Hello,
>> >>
>> >> I've worked on a few different things recently and ran repeatedly into
>> the
>> >> same issue: that we do not have clear guidance on who should set the
>> Coder
>> >> on a PCollection: is it responsibility of the PTransform that outputs
>> it,
>> >> or is it responsibility of the user, or is it sometimes one and
>> sometimes
>> >> the other?
>> >>
>> >> I believe that the answer is "it's responsibility of the transform" and
>> >> moreover that  ideally PCollection.setCoder() should not exist. Instead:
>> >>
>> >> - Require that all transforms set a Coder on the PCollection's they
>> >> produce
>> >> - i.e. it should never be responsibility of the user to "fix up" a coder
>> >> on
>> >> a PCollection produced by a transform.
>> >>
>> >> - Since all transforms are composed of primitive transforms, saying
>> >> "transforms must set a Coder" means simply that all *primitive*
>> transforms
>> >> must set a Coder on their output.
>> >>
>> >> - In some cases, a primitive PTransform currently doesn't have enough
>> >> information to infer a coder for its output collection - e.g.
>> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for
>> >> OutputT. In that case such transforms should allow the user to provide a
>> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs from
>> >> requiring the user to set a coder on the resulting collection]
>> >>
>> >> - Corollary: composite transforms need to only configure their primitive
>> >> transforms (and composite sub-transforms) properly, and give them a
>> Coder
>> >> if needed.
>> >>
>> >> - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs
>> to
>> >> be configurable with coders for all of these, because the implementation
>> >> of
>> >> the transform may change and it may introduce intermediate collections
>> >> involving these types. However, in many cases, some of these type
>> >> parameters appear in the type of the transform's input, e.g. a
>> >> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will always
>> be
>> >> able to extract the coders for FooT and BarT from the input PCollection,
>> >> so
>> >> the user does not need to provide them. However, a coder for BarT must
>> be
>> >> provided. I think in most cases the transform needs to be configurable
>> >> only
>> >> with coders for its output.
>> >>
>> >> Here's a smooth migration path to accomplish the above:
>> >> - Make PCollection.createPrimitiveOutputInternal() take a Coder.
>> >> - Make all primitive transforms optionally configurable with a coder for
>> >> their outputs, such as ParDo.of(DoFn).withOutputCoder().
>> >> - By using the above, make all composite transforms shipped with the SDK
>> >> set a Coder on the collections they produce; in some cases, this will
>> >> require adding a withSomethingCoder() option to the transform and
>> >> propagating that coder to its sub-transforms. If the option is unset,
>> >> that's fine for now.
>> >> - As a result of the above, get rid of all setCoder() calls in the Beam
>> >> repo. The call will still be there, but it will just not be used
>> anywhere
>> >> in the SDK or examples, and we can mark it deprecated.
>> >> - Add guidance to PTransform Style Guide in line with the above
>> >>
>> >> Does this sound like a good idea? I'm not sure how urgent it would be to
>> >> actually do this, but I'd like to know whether people agree that this
>> is a
>> >> good goal in general.
>> >>
>> >>
>> > --
>> > Jean-Baptiste Onofré
>> > [email protected]
>> > http://blog.nanthrax.net
>> > Talend - http://www.talend.com
>> >
>>

Re: Requiring PTransform to set a coder on its resulting collections

Reply via email to