Github user gatorsmile commented on a diff in the pull request:
https://github.com/apache/spark/pull/20456#discussion_r165205764
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -667,6 +667,55 @@ def repartition(self, numPartitions, *cols):
else:
raise TypeError("numPartitions should be an int or Column")
+ @since("2.3.0")
+ def repartitionByRange(self, numPartitions, *cols, **kwargs):
+ """
+ Returns a new :class:`DataFrame` partitioned by the given
partitioning expressions. The
+ resulting DataFrame is range partitioned.
+
+ ``numPartitions`` can be an int to specify the target number of
partitions or a Column.
+ If it is a Column, it will be used as the first partitioning
column. If not specified,
+ the default number of partitions is used.
+
+ At least one partition-by expression must be specified.
+ When no explicit sort order is specified, "ascending nulls first"
is assumed.
+
+ >>> df.repartitionByRange(2, "age").rdd.getNumPartitions()
+ 2
+ >>> data = df.union(df).repartition(1, "age")
+ >>> data.rdd.getNumPartitions()
+ 1
+ >>> data.show()
+ +---+-----+
+ |age| name|
+ +---+-----+
+ | 2|Alice|
+ | 5| Bob|
+ | 2|Alice|
+ | 5| Bob|
+ +---+-----+
+ >>> data = data.repartitionByRange(3, "age")
+ >>> data.show()
+ +---+-----+
+ |age| name|
+ +---+-----+
+ | 2|Alice|
+ | 2|Alice|
+ | 5| Bob|
+ | 5| Bob|
+ +---+-----+
+ >>> data.rdd.getNumPartitions()
+ 3
+ """
+ if isinstance(numPartitions, int):
+ if len(cols) == 0:
+ return ValueError("At least one partition-by expression
must be specified.")
+ else:
+ return DataFrame(
+ self._jdf.repartitionByRange(numPartitions,
self._jcols(*cols)), self.sql_ctx)
+ else:
--- End diff --
It sounds like we missing the case when `numPartitions ` is not provided.
Please check the above implementation of `repartition`?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]