Re: Memory-efficient successive calls to repartition()

2015-09-08 Thread Aurélien Bellet
grows regularly throughout the execution until no free space is available, despite the call to the GC. Aurelien Le 9/8/15 6:22 PM, Aurélien Bellet a écrit : Hi, This is what I tried: for i in range(1000): print i data2=data.repartition(50).cache() if (i+1) % 10 == 0

Re: Memory-efficient successive calls to repartition()

2015-09-08 Thread Aurélien Bellet
Aurélien Bellet mailto:aurelien.bel...@telecom-paristech.fr>>: Thanks a lot for the useful link and comments Alexis! First of all, the problem occurs without doing anything else in the code (except of course loading my data from HDFS at the beginning) - so it definitely come

Re: Memory-efficient successive calls to repartition()

2015-09-02 Thread Aurélien Bellet
, 2015-09-01 22:48 GMT+08:00 Aurélien Bellet mailto:aurelien.bel...@telecom-paristech.fr>>: Dear Alexis, Thanks again for your reply. After reading about checkpointing I have modified my sample code as follows: for i in range(1000): print i data2=data.re

Re: Memory-efficient successive calls to repartition()

2015-09-01 Thread Aurélien Bellet
Dear Alexis, Thanks again for your reply. After reading about checkpointing I have modified my sample code as follows: for i in range(1000): print i data2=data.repartition(50).cache() if (i+1) % 10 == 0: data2.checkpoint() data2.first() # materialize rdd data.unpers

Re: Random pairs / RDD order

2015-04-19 Thread Aurélien Bellet
= rdd.sample(true,0.01,42).mapPartitions(scala.util.Random.shuffle) val sample2 = rdd.sample(true,0.01,43).mapPartitions(scala.util.Random.shuffle) ... On Fri, Apr 17, 2015 at 3:05 AM, Aurélien Bellet mailto:aurelien.bel...@telecom-paristech.fr>> wrote: Hi Sean, Thanks a lot for your

Re: Random pairs / RDD order

2015-04-17 Thread Aurélien Bellet
Hi Sean, Thanks a lot for your reply. The problem is that I need to sample random *independent* pairs. If I draw two samples and build all n*(n-1) pairs then there is a lot of dependency. My current solution is also not satisfying because some pairs (the closest ones in a partition) have a mu