I think you can randomly reshuffle your elements just by emitting a random key (mapping a PairRdd's key triggers a reshuffle IIRC)

yourrdd.map{ x => (rand(), x)}

There is obiously a risk that rand() will give same sequence of numbers in each partition, so you may need to use mapPartitionsWithIndex first and seed your rand with the partition id (or compute your rand from a seed based on x).

Guillaume
Hello,

In the context of a machine learning algorithm, I need to be able to
randomly distribute the elements of a large RDD across partitions (i.e.,
essentially assign each element to a random partition). How could I achieve
this? I have tried to call repartition() with the current number of
partitions - but it seems to me that this moves only some of the elements,
and in a deterministic way.

I know this will be an expensive operation but I only need to perform it
every once in a while.

Thanks a lot!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-randomly-distribute-elements-tp23391.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



--
eXenSa

        
*Guillaume PITEL, Président*
+33(0)626 222 431

eXenSa S.A.S. <http://www.exensa.com/>
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)184 163 677 / Fax +33(0)972 283 705

Reply via email to