Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19317#discussion_r144766776
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
    @@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
        * as in scala.TraversableOnce. The former operation is used for merging 
values within a
        * partition, and the latter is used for merging values between 
partitions. To avoid memory
        * allocation, both of these functions are allowed to modify and return 
their first argument
    +   * instead of creating a new U. This method is different from the 
ordinary "aggregateByKey"
    +   * method, it directly returns a map to the driver, rather than a rdd. 
This will also perform
    +   * the merging locally on each mapper before sending results to a 
reducer, similarly to a
    --- End diff --
    
    > collect all the `map` directly to driver
    
    technically it's a shuffle too, and generally `aggregateByKey` is better 
for the following reasons:
    1. The final combine is executed on multiple reducers, which has better 
parallelism than doing it on the driver.
    2. We should always prefer doing computation on executors instead of the 
driver, because the driver is responsible for scheduling and has a high cost of 
failure recovery.
    3. using spark shuffle is better for fault tolerant. If one reducer failed, 
you don't need to rerun all mappers.
    
    So I'm -1 on this API. If there are special cases we wanna to local 
aggregate, just call `RDD.map.collect` and do the local aggregate manually.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to