Hi,

I limited explicit repartitioning to map-like operatators because Map
operations should be by far the most common use cases (rebalancing for
expensive mappers, repartition for partitionMap, ...) and I doubt the
benefits of repartitioning before other operators.
Right now, partitioning is implemented as an own runtime operator with
noop-driver. Hence, there would be a serialization overhead if subsequent
operators cannot be chained (i.e., join).
Also having a partitioning in front of a combinable reduce would make
things complicated, because the partitioning should be moved between
combiner and reducer.
Overall, I think, there are not many benefits of offering repartitioning
for all operators and if you still want it, you can use an identity mapper.

Does that make sense?
If not, we can also change the implementation without breaking the API and
offer repartitioning for all operators. ;-)

Anyway, I'm with you that we should think about how to integrate physical
operators (partition, sort, combine) into the API.

Cheers, Fabian


2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[email protected]>:

> Hi,
> (mostly @fabian) why is the re-partition operator special-cased like
> it is now? You can only do map/filter on partitioned data. Wouldn't it
> be nice if we had a general re-partition operator. Operators that
> normally do their own repartitioning would notice that the data is
> already partitioned and use that. The way it is now, the
> implementation relies heavily on special-case handling.
>
> In the long run we could even introduce combine and sort as special
> operators that users could insert themselves. The optimiser would then
> also insert these before operations when required. This would
> simplify/generalise things a bit.
>
> What do you think?
>
> Cheers,
> Aljoscha
>

Reply via email to