This is an automated email from the ASF dual-hosted git repository.
ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new c7ef21a49ac4 [SPARK-51754][PYTHON][DOCS][TESTS] Make `sampleBy`
doctest deterministic
c7ef21a49ac4 is described below
commit c7ef21a49ac4a026a74a67b72068f2bb541ab3eb
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Apr 10 09:38:30 2025 +0800
[SPARK-51754][PYTHON][DOCS][TESTS] Make `sampleBy` doctest deterministic
### What changes were proposed in this pull request?
Make `sampleBy` doctest deterministic, by specifying the partitioning
### Why are the changes needed?
it fails on some envs, e.g. in my local (macos + python 3.13):
```
In [1]: >>> from pyspark.sql.functions import col
...: >>> dataset = spark.range(0, 100).select((col("id") %
3).alias("key"))
...: >>> sampled = dataset.sampleBy("key", fractions={0: 0.1, 1:
0.2}, seed=0)
...: >>> sampled.groupBy("key").count().orderBy("key").show()
+---+-----+
|key|count|
+---+-----+
| 0| 2|
| 1| 6|
+---+-----+
```
### Does this PR introduce _any_ user-facing change?
minor doc change
### How was this patch tested?
CI and manually test
### Was this patch authored or co-authored using generative AI tooling?
no
Closes #50547 from zhengruifeng/py_test_sample_by.
Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
---
python/pyspark/sql/dataframe.py | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index fb34625b71ef..c00c3f484232 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -2127,15 +2127,16 @@ class DataFrame:
Examples
--------
>>> from pyspark.sql.functions import col
- >>> dataset = spark.range(0, 100).select((col("id") % 3).alias("key"))
+ >>> dataset = spark.range(0, 100, 1, 5).select((col("id") %
3).alias("key"))
>>> sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2},
seed=0)
>>> sampled.groupBy("key").count().orderBy("key").show()
+---+-----+
|key|count|
+---+-----+
- | 0| 3|
- | 1| 6|
+ | 0| 4|
+ | 1| 9|
+---+-----+
+
>>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
33
"""
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]