[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

cloud-fan Mon, 16 Oct 2017 00:03:01 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19317#discussion_r144767175
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 
---
    @@ -180,6 +180,56 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
        * as in scala.TraversableOnce. The former operation is used for merging 
values within a
        * partition, and the latter is used for merging values between 
partitions. To avoid memory
        * allocation, both of these functions are allowed to modify and return 
their first argument
    +   * instead of creating a new U. This method is different from the 
ordinary "aggregateByKey"
    +   * method, it directly returns a map to the driver, rather than a rdd. 
This will also perform
    +   * the merging locally on each mapper before sending results to a 
reducer, similarly to a
    --- End diff --
    
    BTW `collect all the map directly to driver` may easily OOM the driver, 
while shuffling to multiple reducers can reduce the memory pressure. Even if 
only shuffle to one reducer, it still better as the executor usually have more 
memory than the driver.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #19317: [SPARK-22098][CORE] Add new method aggregateByKey...

Reply via email to