Will be in 1.0.0
On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Ah, now I see what Aaron was referring to. So I'm guessing we will get > this in the next release or two. Thank you. > > > > On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra <m...@clearstorydata.com>wrote: > >> There is a repartition method in pyspark master: >> https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128 >> >> >> On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Update: I'm now using this ghetto function to partition the RDD I get >>> back when I call textFile() on a gzipped file: >>> >>> # Python 2.6 >>> def partitionRDD(rdd, numPartitions): >>> counter = {'a': 0} >>> def count_up(x): >>> counter['a'] += 1 >>> return counter['a'] >>> return (rdd.keyBy(count_up) >>> .partitionBy(numPartitions) >>> .map(lambda (counter, data): data)) >>> >>> If there's supposed to be a built-in Spark method to do this, I'd love >>> to learn more about it. >>> >>> Nick >>> >>> >>> On Tue, Apr 1, 2014 at 7:59 PM, Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> Hmm, doing help(rdd) in PySpark doesn't show a method called >>>> repartition(). Trying rdd.repartition() or rdd.repartition(10) also >>>> fail. I'm on 0.9.0. >>>> >>>> The approach I'm going with to partition my MappedRDD is to key it by a >>>> random int, and then partition it. >>>> >>>> So something like: >>>> >>>> rdd = sc.textFile('s3n://gzipped_file_brah.gz') # rdd has 1 partition; >>>> minSplits is not actionable due to gzip >>>> keyed_rdd = rdd.keyBy(lambda x: randint(1,100)) # we key the RDD so we >>>> can partition it >>>> partitioned_rdd = keyed_rdd.partitionBy(10) # rdd has 10 partitions >>>> >>>> Are you saying I don't have to do this? >>>> >>>> Nick >>>> >>>> >>>> >>>> On Tue, Apr 1, 2014 at 7:38 PM, Aaron Davidson <ilike...@gmail.com>wrote: >>>> >>>>> Hm, yeah, the docs are not clear on this one. The function you're >>>>> looking for to change the number of partitions on any ol' RDD is >>>>> "repartition()", which is available in master but for some reason doesn't >>>>> seem to show up in the latest docs. Sorry about that, I also didn't >>>>> realize >>>>> partitionBy() had this behavior from reading the Python docs (though it is >>>>> consistent with the Scala API, just more type-safe there). >>>>> >>>>> >>>>> On Tue, Apr 1, 2014 at 3:01 PM, Nicholas Chammas < >>>>> nicholas.cham...@gmail.com> wrote: >>>>> >>>>>> Just an FYI, it's not obvious from the >>>>>> docs<http://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#partitionBy>that >>>>>> the following code should fail: >>>>>> >>>>>> a = sc.parallelize([1,2,3,4,5,6,7,8,9,10], 2) >>>>>> a._jrdd.splits().size() >>>>>> a.count() >>>>>> b = a.partitionBy(5) >>>>>> b._jrdd.splits().size() >>>>>> b.count() >>>>>> >>>>>> I figured out from the example that if I generated a key by doing this >>>>>> >>>>>> b = a.map(lambda x: (x, x)).partitionBy(5) >>>>>> >>>>>> then all would be well. >>>>>> >>>>>> In other words, partitionBy() only works on RDDs of tuples. Is that >>>>>> correct? >>>>>> >>>>>> Nick >>>>>> >>>>>> >>>>>> ------------------------------ >>>>>> View this message in context: PySpark RDD.partitionBy() requires an >>>>>> RDD of >>>>>> tuples<http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-RDD-partitionBy-requires-an-RDD-of-tuples-tp3598.html> >>>>>> Sent from the Apache Spark User List mailing list >>>>>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at >>>>>> Nabble.com. >>>>>> >>>>> >>>>> >>>> >>> >> >