Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
Update: I'm now using this ghetto function to partition the RDD I get back when I call textFile() on a gzipped file: # Python 2.6 def partitionRDD(rdd, numPartitions): counter = {'a': 0} def count_up(x): counter['a'] += 1 return counter['a'] return (rdd.keyBy(count_up)

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
There is a repartition method in pyspark master: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128 On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Update: I'm now using this ghetto function to partition the RDD I get back when I call

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas
Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra m...@clearstorydata.comwrote: There is a repartition method in pyspark master:

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra
Will be in 1.0.0 On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra

PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Just an FYI, it's not obvious from the docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat the following code should fail: a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) a._jrdd.splits().size() a.count() b = a.partitionBy(5)

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Aaron Davidson
Hm, yeah, the docs are not clear on this one. The function you're looking for to change the number of partitions on any ol' RDD is repartition(), which is available in master but for some reason doesn't seem to show up in the latest docs. Sorry about that, I also didn't realize partitionBy() had

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas
Hmm, doing help(rdd) in PySpark doesn't show a method called repartition(). Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0. The approach I'm going with to partition my MappedRDD is to key it by a random int, and then partition it. So something like: rdd =