[
https://issues.apache.org/jira/browse/SPARK-12524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15071504#comment-15071504
]
Shushant Arora commented on SPARK-12524:
----------------------------------------
Its not similar to bucketisation- Its for the scenaro where you create a rdd
from external source say kafka - using low level consumer and say kafka
partitions are based on some key. Now when you read kafka rdd - you will have a
spark rdd with one to one mapping of kafka partitions and you know all reecords
with same key are in same partition and now you want to remove duplicate in
your data. So you will groupby key locally instead of default groupby key(which
causes shuffle). After dedup you perform any other transformations on your rdd
and dump it somewhere.
Do you see any requirement of dataframes here or agree on that this kind of
operation in opaque paired rdd is better ?
> Group by key in a pairrdd without any shuffle
> ---------------------------------------------
>
> Key: SPARK-12524
> URL: https://issues.apache.org/jira/browse/SPARK-12524
> Project: Spark
> Issue Type: Improvement
> Components: Build, Java API
> Affects Versions: 1.5.2
> Reporter: Shushant Arora
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> In a PairRDD<K,V>. When we are all values of same key are in same partition
> and want to perform group by key locally and no reduce/aggregation operation
> afterwords just further tranformation on grouped rdd. There is no facility
> for that. We have to perform shuffle which is costly.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]