Re: Question about Re-Partition Operator

Aljoscha Krettek Thu, 25 Sep 2014 01:29:31 -0700

Yes, I see your points on why you limited it to the map-style
operators. My gripe with it is that it moves knowledge about the
Partition Operator into some operators that now need special-case
handling and that these operations are duplicated on
PartitionedDataSet. If we have Partition as just another operator
there would be no special-case handling and the integration would be
scalable (in a sense that we don't have to touch other operators, or
have to add partition support if we add another map-style operator).


My view is that users should know what they are doing if the add a
repartition before a reduce. In the future the optimizer could catch
those cases and merge a custom repartition with a reduce if there are
no other consumers of the repartitioning.

Cheers,
Aljoscha

On Thu, Sep 25, 2014 at 10:13 AM, Fabian Hueske <[email protected]> wrote:
> Hi,
>
> I limited explicit repartitioning to map-like operatators because Map
> operations should be by far the most common use cases (rebalancing for
> expensive mappers, repartition for partitionMap, ...) and I doubt the
> benefits of repartitioning before other operators.
> Right now, partitioning is implemented as an own runtime operator with
> noop-driver. Hence, there would be a serialization overhead if subsequent
> operators cannot be chained (i.e., join).
> Also having a partitioning in front of a combinable reduce would make
> things complicated, because the partitioning should be moved between
> combiner and reducer.
> Overall, I think, there are not many benefits of offering repartitioning
> for all operators and if you still want it, you can use an identity mapper.
>
> Does that make sense?
> If not, we can also change the implementation without breaking the API and
> offer repartitioning for all operators. ;-)
>
> Anyway, I'm with you that we should think about how to integrate physical
> operators (partition, sort, combine) into the API.
>
> Cheers, Fabian
>
>
> 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[email protected]>:
>
>> Hi,
>> (mostly @fabian) why is the re-partition operator special-cased like
>> it is now? You can only do map/filter on partitioned data. Wouldn't it
>> be nice if we had a general re-partition operator. Operators that
>> normally do their own repartitioning would notice that the data is
>> already partitioned and use that. The way it is now, the
>> implementation relies heavily on special-case handling.
>>
>> In the long run we could even introduce combine and sort as special
>> operators that users could insert themselves. The optimiser would then
>> also insert these before operations when required. This would
>> simplify/generalise things a bit.
>>
>> What do you think?
>>
>> Cheers,
>> Aljoscha
>>

Re: Question about Re-Partition Operator

Reply via email to