Github user cloud-fan commented on a diff in the pull request:
https://github.com/apache/spark/pull/19317#discussion_r144767175
--- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
---
@@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
* as in scala.TraversableOnce. The former operation is used for merging
values within a
* partition, and the latter is used for merging values between
partitions. To avoid memory
* allocation, both of these functions are allowed to modify and return
their first argument
+ * instead of creating a new U. This method is different from the
ordinary "aggregateByKey"
+ * method, it directly returns a map to the driver, rather than a rdd.
This will also perform
+ * the merging locally on each mapper before sending results to a
reducer, similarly to a
--- End diff --
BTW `collect all the map directly to driver` may easily OOM the driver,
while shuffling to multiple reducers can reduce the memory pressure. Even if
only shuffle to one reducer, it still better as the executor usually have more
memory than the driver.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]