Hi, I limited explicit repartitioning to map-like operatators because Map operations should be by far the most common use cases (rebalancing for expensive mappers, repartition for partitionMap, ...) and I doubt the benefits of repartitioning before other operators. Right now, partitioning is implemented as an own runtime operator with noop-driver. Hence, there would be a serialization overhead if subsequent operators cannot be chained (i.e., join). Also having a partitioning in front of a combinable reduce would make things complicated, because the partitioning should be moved between combiner and reducer. Overall, I think, there are not many benefits of offering repartitioning for all operators and if you still want it, you can use an identity mapper.
Does that make sense? If not, we can also change the implementation without breaking the API and offer repartitioning for all operators. ;-) Anyway, I'm with you that we should think about how to integrate physical operators (partition, sort, combine) into the API. Cheers, Fabian 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[email protected]>: > Hi, > (mostly @fabian) why is the re-partition operator special-cased like > it is now? You can only do map/filter on partitioned data. Wouldn't it > be nice if we had a general re-partition operator. Operators that > normally do their own repartitioning would notice that the data is > already partitioned and use that. The way it is now, the > implementation relies heavily on special-case handling. > > In the long run we could even introduce combine and sort as special > operators that users could insert themselves. The optimiser would then > also insert these before operations when required. This would > simplify/generalise things a bit. > > What do you think? > > Cheers, > Aljoscha >
