[ 
https://issues.apache.org/jira/browse/SPARK-12524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071504#comment-15071504
 ] 

Shushant Arora commented on SPARK-12524:
----------------------------------------

Its not similar to bucketisation- Its for the scenaro where you create a rdd 
from external source say kafka - using low level consumer and say kafka 
partitions are based on some key. Now when you read kafka rdd - you will have a 
spark rdd with one to one mapping of kafka partitions and you know all reecords 
with same key are in same partition and now you want to remove duplicate in 
your data. So you will groupby key locally instead of default groupby key(which 
causes shuffle). After dedup you perform any other transformations on your rdd 
and dump it somewhere.

Do you see any requirement of dataframes here or agree on that this kind of 
operation in opaque paired rdd is better ?

> Group by key in a pairrdd without any shuffle
> ---------------------------------------------
>
>                 Key: SPARK-12524
>                 URL: https://issues.apache.org/jira/browse/SPARK-12524
>             Project: Spark
>          Issue Type: Improvement
>          Components: Build, Java API
>    Affects Versions: 1.5.2
>            Reporter: Shushant Arora
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> In a PairRDD<K,V>. When we are all values of same key are in same partition 
> and want to perform group by key locally and no reduce/aggregation operation 
> afterwords just further tranformation on grouped rdd. There is no facility 
> for that. We have to perform shuffle which is costly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to