Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
tition-where > > It recommends defining a custom partitioner and (PairRDD) partitionBy > method to accomplish this. > > Xinh > > On Tue, May 10, 2016 at 1:15 PM, Ayman Khalil > wrote: > >> And btw, I'm using the Python API if this makes any difference. >> >

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
And btw, I'm using the Python API if this makes any difference. On Tue, May 10, 2016 at 11:14 PM, Ayman Khalil wrote: > Hi Don, > > This didn't help. My original rdd is already created using 10 partitions. > As a matter of fact, after trying with rdd.coalesce(10, sh

Re: Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
ng on your RDD size. > > -Don > > On Tue, May 10, 2016 at 12:38 PM, Ayman Khalil > wrote: > >> Hello, >> >> I have 50,000 items parallelized into an RDD with 10 partitions, I would >> like to evenly split the items over the partitions so: >> 50,000/10

Evenly balance the number of items in each RDD partition

2016-05-10 Thread Ayman Khalil
Hello, I have 50,000 items parallelized into an RDD with 10 partitions, I would like to evenly split the items over the partitions so: 50,000/10 = 5,000 in each RDD partition. What I get instead is the following (partition index, partition count): [(0, 4096), (1, 5120), (2, 5120), (3, 5120), (4,

Locality aware tree reduction

2016-05-07 Thread Ayman Khalil
Hello, Is there a way to instruct treeReduce() to reduce RDD partitions on the same node locally? In my case, I'm using treeReduce() to reduce map results in parallel. My reduce function is just arithmetically adding map results (i.e. no notion of aggregation by key). As far as I understand, a sh