This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new c7ef21a49ac4 [SPARK-51754][PYTHON][DOCS][TESTS] Make `sampleBy` 
doctest deterministic
c7ef21a49ac4 is described below

commit c7ef21a49ac4a026a74a67b72068f2bb541ab3eb
Author: Ruifeng Zheng <[email protected]>
AuthorDate: Thu Apr 10 09:38:30 2025 +0800

    [SPARK-51754][PYTHON][DOCS][TESTS] Make `sampleBy` doctest deterministic
    
    ### What changes were proposed in this pull request?
    Make `sampleBy` doctest deterministic, by specifying the partitioning
    
    ### Why are the changes needed?
    it fails on some envs, e.g. in my local (macos + python 3.13):
    
    ```
    In [1]:         >>> from pyspark.sql.functions import col
       ...:         >>> dataset = spark.range(0, 100).select((col("id") % 
3).alias("key"))
       ...:         >>> sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 
0.2}, seed=0)
       ...:         >>> sampled.groupBy("key").count().orderBy("key").show()
    +---+-----+
    |key|count|
    +---+-----+
    |  0|    2|
    |  1|    6|
    +---+-----+
    
    ```
    
    ### Does this PR introduce _any_ user-facing change?
    minor doc change
    
    ### How was this patch tested?
    CI and manually test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    no
    
    Closes #50547 from zhengruifeng/py_test_sample_by.
    
    Authored-by: Ruifeng Zheng <[email protected]>
    Signed-off-by: Ruifeng Zheng <[email protected]>
---
 python/pyspark/sql/dataframe.py | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index fb34625b71ef..c00c3f484232 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -2127,15 +2127,16 @@ class DataFrame:
         Examples
         --------
         >>> from pyspark.sql.functions import col
-        >>> dataset = spark.range(0, 100).select((col("id") % 3).alias("key"))
+        >>> dataset = spark.range(0, 100, 1, 5).select((col("id") % 
3).alias("key"))
         >>> sampled = dataset.sampleBy("key", fractions={0: 0.1, 1: 0.2}, 
seed=0)
         >>> sampled.groupBy("key").count().orderBy("key").show()
         +---+-----+
         |key|count|
         +---+-----+
-        |  0|    3|
-        |  1|    6|
+        |  0|    4|
+        |  1|    9|
         +---+-----+
+
         >>> dataset.sampleBy(col("key"), fractions={2: 1.0}, seed=0).count()
         33
         """


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to