Decrease shuffle in TreeAggregate with coalesce ?

Guillaume Pitel Wed, 27 Apr 2016 04:47:56 -0700

Hi,

I've been looking at the code of RDD.treeAggregate, because we've seen ahuge performance drop between 1.5.2 and 1.6.1 on a treeReduce. I thinkthe treeAggregate code hasn't changed, so my message is not about theperformance drop, but a more general remark about treeAggregate.

In treeAggregate, after the aggregate is applied inside originalpartitions, we enter the tree :

while (numPartitions > scale + math.ceil(numPartitions.toDouble /scale)) {


        numPartitions /= scale

        val curNumPartitions = numPartitions

        *partiallyAggregated **=**partiallyAggregated.mapPartitionsWithIndex {*

        *(i, iter) **=>**iter.map((i **%**curNumPartitions, _))*

        }.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values

        }

The two lines where the partitions are numbered then renumbered, thenreducedByKey seems below optimality to me. There is a huge shuffle cost,while a simple coalesce followed by a partition-level aggregation wouldprobably perfectly do the job.


Have I missed something that requires to do this reshuffle ?

Best regards
Guillaume Pitel

Decrease shuffle in TreeAggregate with coalesce ?

Reply via email to