[jira] [Created] (SPARK-21448) Hi dear guys, I have a question about aggregateByKey of pairrrd.

qihuagao (JIRA) Mon, 17 Jul 2017 18:54:55 -0700

qihuagao created SPARK-21448:
--------------------------------

             Summary: Hi dear guys,  I have a question about aggregateByKey of 
pairrrd.
                 Key: SPARK-21448
                 URL: https://issues.apache.org/jira/browse/SPARK-21448
             Project: Spark
          Issue Type: Question
          Components: Java API
    Affects Versions: 2.0.0
         Environment: Spark 2.0
            Reporter: qihuagao



java pair rrd has aggregateByKey, which can avoid full shuffle, so have 
impressive performance. which has parameters, 
The aggregateByKey function requires 3 parameters:
# An intitial ‘zero’ value that will not effect the total values to be collected
# A combining function accepting two paremeters. The second paramter is merged 
into the first parameter. This function combines/merges values within a 
partition.
# A merging function function accepting two parameters. In this case the 
paremters are merged into one. This step merges values across partitions.
While Dataframe, I noticed groupByKey, which could do save function as 
aggregateByKey, but without merge functions, so I assumed it should trigger 
shuffle operation. Is this true? if true should we have a funtion like the 
performance like  aggregateByKey for dataframe?

Thanks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-21448) Hi dear guys, I have a question about aggregateByKey of pairrrd.

Reply via email to