beliefer opened a new pull request, #39476:
URL: https://github.com/apache/spark/pull/39476
### What changes were proposed in this pull request?
The `test_functions.py` have one test case for `stat.sampleBy`.
```
df = self.spark.createDataFrame([Row(a=i, b=(i % 3)) for i in range(100)])
sampled = df.stat.sampleBy("b", fractions={0: 0.5, 1: 0.5}, seed=0)
self.assertTrue(sampled.count() == 35)
```
Connect's py API cannot passed the tests.
```
Traceback (most recent call last):
File
"/Users/s.singh/personal/spark-oss/python/pyspark/sql/tests/test_functions.py",
line 202, in test_sampleby
self.assertTrue(sampled.count() == 35)
AssertionError: False is not true
```
After my investigation, the root cause is the plan is different from
pyspark, so the result is not determined.
The plan come from pyspark show below.
```
== Physical Plan ==
* Filter (2)
+- * Scan ExistingRDD (1)
(1) Scan ExistingRDD [codegen id : 1]
Output [2]: [a#4L, b#5L]
Arguments: [a#4L, b#5L], MapPartitionsRDD[9] at applySchemaToPythonRDD at
NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
(2) Filter [codegen id : 1]
Input [2]: [a#4L, b#5L]
Condition : UDF(b#5L, rand(0))
```
The plan come from connect show below.
```
== Physical Plan ==
LocalTableScan (1)
(1) LocalTableScan
Output [2]: [a#5L, b#6L]
Arguments: [a#5L, b#6L]
```
### Why are the changes needed?
The issue is not related to `stat.sampleBy` directly.
This PR just let the code follows pyspark API and update the comment about
skip test.
### Does this PR introduce _any_ user-facing change?
'No'.
New feature.
### How was this patch tested?
N/A
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]