Update: I'm now using this ghetto function to partition the RDD I get back
when I call textFile() on a gzipped file:
# Python 2.6
def partitionRDD(rdd, numPartitions):
counter = {'a': 0}
def count_up(x):
counter['a'] += 1
return counter['a']
return (rdd.keyBy(count_up)
There is a repartition method in pyspark master:
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128
On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Update: I'm now using this ghetto function to partition the RDD I get back
when I call
Ah, now I see what Aaron was referring to. So I'm guessing we will get this
in the next release or two. Thank you.
On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra m...@clearstorydata.comwrote:
There is a repartition method in pyspark master:
Will be in 1.0.0
On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Ah, now I see what Aaron was referring to. So I'm guessing we will get
this in the next release or two. Thank you.
On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra
Just an FYI, it's not obvious from the
docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat
the following code should fail:
a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2)
a._jrdd.splits().size()
a.count()
b = a.partitionBy(5)
Hm, yeah, the docs are not clear on this one. The function you're looking
for to change the number of partitions on any ol' RDD is repartition(),
which is available in master but for some reason doesn't seem to show up in
the latest docs. Sorry about that, I also didn't realize partitionBy() had
Hmm, doing help(rdd) in PySpark doesn't show a method called repartition().
Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0.
The approach I'm going with to partition my MappedRDD is to key it by a
random int, and then partition it.
So something like:
rdd =