Yes, I see your points on why you limited it to the map-style operators. My gripe with it is that it moves knowledge about the Partition Operator into some operators that now need special-case handling and that these operations are duplicated on PartitionedDataSet. If we have Partition as just another operator there would be no special-case handling and the integration would be scalable (in a sense that we don't have to touch other operators, or have to add partition support if we add another map-style operator).
My view is that users should know what they are doing if the add a repartition before a reduce. In the future the optimizer could catch those cases and merge a custom repartition with a reduce if there are no other consumers of the repartitioning. Cheers, Aljoscha On Thu, Sep 25, 2014 at 10:13 AM, Fabian Hueske <[email protected]> wrote: > Hi, > > I limited explicit repartitioning to map-like operatators because Map > operations should be by far the most common use cases (rebalancing for > expensive mappers, repartition for partitionMap, ...) and I doubt the > benefits of repartitioning before other operators. > Right now, partitioning is implemented as an own runtime operator with > noop-driver. Hence, there would be a serialization overhead if subsequent > operators cannot be chained (i.e., join). > Also having a partitioning in front of a combinable reduce would make > things complicated, because the partitioning should be moved between > combiner and reducer. > Overall, I think, there are not many benefits of offering repartitioning > for all operators and if you still want it, you can use an identity mapper. > > Does that make sense? > If not, we can also change the implementation without breaking the API and > offer repartitioning for all operators. ;-) > > Anyway, I'm with you that we should think about how to integrate physical > operators (partition, sort, combine) into the API. > > Cheers, Fabian > > > 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[email protected]>: > >> Hi, >> (mostly @fabian) why is the re-partition operator special-cased like >> it is now? You can only do map/filter on partitioned data. Wouldn't it >> be nice if we had a general re-partition operator. Operators that >> normally do their own repartitioning would notice that the data is >> already partitioned and use that. The way it is now, the >> implementation relies heavily on special-case handling. >> >> In the long run we could even introduce combine and sort as special >> operators that users could insert themselves. The optimiser would then >> also insert these before operations when required. This would >> simplify/generalise things a bit. >> >> What do you think? >> >> Cheers, >> Aljoscha >>
