I would love to hear the answer to this as well.

On Thu, Mar 6, 2014 at 4:09 AM, Manoj Awasthi <awasthi.ma...@gmail.com> wrote:
> Hi All,
>
>
> I have a three machine cluster. I have two RDDs each consisting of (K,V)
> pairs. RDDs have just three keys 'a', 'b' and 'c'.
>
>     // list1 - List(('a',1), ('b',2), ....
>     val rdd1 = sc.parallelize(list1).groupByKey(new HashPartitioner(3))
>
>     // list2 - List(('a',2), ('b',7), ....
>     val rdd2 = sc.parallelize(list2).groupByKey(new HashPartitioner(3))
>
> By using a HashPartitioner with 3 partitions I can achieve that each of the
> keys ('a', 'b' and 'c') in each RDD gets partitioned on different machines
> on cluster (based on the hashCode).
>
> Problem is that I cannot deterministically do the same allocation for
> second RDD? (all 'a's from rdd2 going to the same machine where 'a's from
> first RDD went to).
>
> Is there a way to achieve this?
>
> Manoj



-- 
--
Evan Chan
Staff Engineer
e...@ooyala.com  |

Reply via email to