subject:"PySpark RDD.partitionBy\(\) requires an RDD of tuples"

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas

Update: I'm now using this ghetto function to partition the RDD I get back when I call textFile() on a gzipped file: # Python 2.6 def partitionRDD(rdd, numPartitions): counter = {'a': 0} def count_up(x): counter['a'] += 1 return counter['a'] return (rdd.keyBy(count_up)

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra

There is a repartition method in pyspark master: https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128 On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Update: I'm now using this ghetto function to partition the RDD I get back when I call

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Nicholas Chammas

Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra m...@clearstorydata.comwrote: There is a repartition method in pyspark master:

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-02 Thread Mark Hamstra

Will be in 1.0.0 On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Ah, now I see what Aaron was referring to. So I'm guessing we will get this in the next release or two. Thank you. On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra

PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas

Just an FYI, it's not obvious from the docshttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBythat the following code should fail: a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) a._jrdd.splits().size() a.count() b = a.partitionBy(5)

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Aaron Davidson

Hm, yeah, the docs are not clear on this one. The function you're looking for to change the number of partitions on any ol' RDD is repartition(), which is available in master but for some reason doesn't seem to show up in the latest docs. Sorry about that, I also didn't realize partitionBy() had

Re: PySpark RDD.partitionBy() requires an RDD of tuples

2014-04-01 Thread Nicholas Chammas

Hmm, doing help(rdd) in PySpark doesn't show a method called repartition(). Trying rdd.repartition() or rdd.repartition(10) also fail. I'm on 0.9.0. The approach I'm going with to partition my MappedRDD is to key it by a random int, and then partition it. So something like: rdd =

Re: PySpark RDD.partitionBy() requires an RDD of tuples

Re: PySpark RDD.partitionBy() requires an RDD of tuples

Re: PySpark RDD.partitionBy() requires an RDD of tuples

Re: PySpark RDD.partitionBy() requires an RDD of tuples

PySpark RDD.partitionBy() requires an RDD of tuples

Re: PySpark RDD.partitionBy() requires an RDD of tuples

Re: PySpark RDD.partitionBy() requires an RDD of tuples

7 matches

Site Navigation

Mail list logo

Footer information