On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles <[email protected]> wrote: > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik <[email protected]> > wrote: >> >> Ken/Robert, I believe users will want the ability to set the output coder >> because coders may have intrinsic properties where the type isn't enough >> information to fully specify what I want as a user. Some cases I can see >> are: >> 1) I have a cheap and fast non-deterministic coder but a different slower >> coder when I want to use it as the key to a GBK, For example with a set >> coder, it would need to consistently order the values of the set when used >> as the key. >> 2) I know a property of the data which allows me to have a cheaper >> encoding. Imagine I know that all the strings have a common prefix or >> integers that are in a certain range, or that a matrix is sparse/dense. Not >> all PCollections of strings / integers / matrices in the pipeline will have >> this property, just some. >> 3) Sorting comes up occasionally, traditionally in Google this was done by >> sorting the encoded version of the object lexicographically during a GBK. >> There are good lexicographical byte representations for ASCII strings, >> integers, and for some IEEE number representations which could be done by >> the use of a special coder. >> > > Items (1) and (3) do not require special knowledge from the user. They are > easily observed properties of a pipeline. My proposal included full > automation for both. The suggestion is new methods > .getDeterministicCoder(TypeDescriptor) and > .getLexicographicCoder(TypeDescriptor).
Completely agree--usecases (1) and (3) are an indirect use of Coders that are used to achieve an effect that would be better expressed directly. > (2) is an interesting hypothetical for massive scale where tiny incremental > optimization represents a lot of cost _and_ your data has sufficient > structure to realize a benefit _and_ it needs to be pinpointed to just some > PCollections. I think our experience with coders so far is that their > existence is almost entirely negative. It would be nice to support this > vanishingly rare case without inflicting a terrible pain point on the model > and all other users. (2) is not just about cheapness, sometimes there's other structure in the data we can leverage. Consider the UnionCoder used in CoGBK--RawUnionValue has an integer value that specifies indicates the type of it's raw Object field. Unless we want to extend the type language, there's not a sufficient type descriptor that can be used to infer the coder. I'm dubious going down the road of adding special cases is the right thing here. > For example, in those cases you could encode in your > DoFn so the type descriptor would just be byte[]. As well as being an extremely cumbersome API, this would incur the cost of coding/decoding at that DoFn boundary even if it is fused away. >> On Thu, Jul 27, 2017 at 1:34 AM, Jean-Baptiste Onofré <[email protected]> >> wrote: >> >> > Hi, >> > >> > That's an interesting thread and I was wondering the relationship between >> > type descriptor and coder for a while ;) >> > >> > Today, in a PCollection, we can set the coder and we also have a >> > getTypeDescriptor(). It sounds weird to me: it should be one or the >> other. >> > >> > Basically, if the Coder is not used to define the type, than, I fully >> > agree with Eugene. >> > >> > Basically, the PCollection should define only the type descriptor, not >> the >> > coder by itself: the coder can be found using the type descriptor. >> > >> > With both coder and type descriptor on the PCollection, it sounds a big >> > "decoupled" to me and it would be possible to have a coder on the >> > PCollection that doesn't match the type descriptor. >> > >> > I think PCollection type descriptor should be defined, and the coder >> > should be implicit based on this type descriptor. >> > >> > Thoughts ? >> > >> > Regards >> > JB >> > >> > >> > On 07/26/2017 05:25 AM, Eugene Kirpichov wrote: >> > >> >> Hello, >> >> >> >> I've worked on a few different things recently and ran repeatedly into >> the >> >> same issue: that we do not have clear guidance on who should set the >> Coder >> >> on a PCollection: is it responsibility of the PTransform that outputs >> it, >> >> or is it responsibility of the user, or is it sometimes one and >> sometimes >> >> the other? >> >> >> >> I believe that the answer is "it's responsibility of the transform" and >> >> moreover that ideally PCollection.setCoder() should not exist. Instead: >> >> >> >> - Require that all transforms set a Coder on the PCollection's they >> >> produce >> >> - i.e. it should never be responsibility of the user to "fix up" a coder >> >> on >> >> a PCollection produced by a transform. >> >> >> >> - Since all transforms are composed of primitive transforms, saying >> >> "transforms must set a Coder" means simply that all *primitive* >> transforms >> >> must set a Coder on their output. >> >> >> >> - In some cases, a primitive PTransform currently doesn't have enough >> >> information to infer a coder for its output collection - e.g. >> >> ParDo.of(DoFn<InputT, OutputT>) might be unable to infer a coder for >> >> OutputT. In that case such transforms should allow the user to provide a >> >> coder: ParDo.of(DoFn).withOutputCoder(...) [note that this differs from >> >> requiring the user to set a coder on the resulting collection] >> >> >> >> - Corollary: composite transforms need to only configure their primitive >> >> transforms (and composite sub-transforms) properly, and give them a >> Coder >> >> if needed. >> >> >> >> - Corollary: a PTransform with type parameters <FooT, BarT, ...> needs >> to >> >> be configurable with coders for all of these, because the implementation >> >> of >> >> the transform may change and it may introduce intermediate collections >> >> involving these types. However, in many cases, some of these type >> >> parameters appear in the type of the transform's input, e.g. a >> >> PTransform<PCollection<KV<FooT, BarT>>, PCollection<MooT>> will always >> be >> >> able to extract the coders for FooT and BarT from the input PCollection, >> >> so >> >> the user does not need to provide them. However, a coder for BarT must >> be >> >> provided. I think in most cases the transform needs to be configurable >> >> only >> >> with coders for its output. >> >> >> >> Here's a smooth migration path to accomplish the above: >> >> - Make PCollection.createPrimitiveOutputInternal() take a Coder. >> >> - Make all primitive transforms optionally configurable with a coder for >> >> their outputs, such as ParDo.of(DoFn).withOutputCoder(). >> >> - By using the above, make all composite transforms shipped with the SDK >> >> set a Coder on the collections they produce; in some cases, this will >> >> require adding a withSomethingCoder() option to the transform and >> >> propagating that coder to its sub-transforms. If the option is unset, >> >> that's fine for now. >> >> - As a result of the above, get rid of all setCoder() calls in the Beam >> >> repo. The call will still be there, but it will just not be used >> anywhere >> >> in the SDK or examples, and we can mark it deprecated. >> >> - Add guidance to PTransform Style Guide in line with the above >> >> >> >> Does this sound like a good idea? I'm not sure how urgent it would be to >> >> actually do this, but I'd like to know whether people agree that this >> is a >> >> good goal in general. >> >> >> >> >> > -- >> > Jean-Baptiste Onofré >> > [email protected] >> > http://blog.nanthrax.net >> > Talend - http://www.talend.com >> > >>
