Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19317#discussion_r144766776
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
---
@@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* as in scala.TraversableOnce. The former operation is used for merging
values within a
* partition, and the latter is used for merging values between
partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return
their first argument
+ * instead of creating a new U. This method is different from the
ordinary "aggregateByKey"
+ * method, it directly returns a map to the driver, rather than a rdd.
This will also perform
+ * the merging locally on each mapper before sending results to a
reducer, similarly to a
--- End diff --
> collect all the `map` directly to driver
technically it's a shuffle too, and generally `aggregateByKey` is better
for the following reasons:
1. The final combine is executed on multiple reducers, which has better
parallelism than doing it on the driver.
2. We should always prefer doing computation on executors instead of the
driver, because the driver is responsible for scheduling and has a high cost of
failure recovery.
3. using spark shuffle is better for fault tolerant. If one reducer failed,
you don't need to rerun all mappers.
So I'm -1 on this API. If there are special cases we wanna to local
aggregate, just call `RDD.map.collect` and do the local aggregate manually.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]