This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.5 by this push:
new d18659de626c [SPARK-47824][PS] Fix nondeterminism in
pyspark.pandas.series.asof
d18659de626c is described below
commit d18659de626cc3743e7f6a5dceca0f2a25b006de
Author: Mark Jarvin <[email protected]>
AuthorDate: Fri Apr 12 09:37:19 2024 +0900
[SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof
### What changes were proposed in this pull request?
Use the monotonically ID as a sorting condition for `max_by` instead of a
literal string.
### Why are the changes needed?
https://github.com/apache/spark/pull/35191 had a error where the literal
string `"__monotonically_increasing_id__"` was used as the tie-breaker in
`max_by` instead of the actual ID.
### Does this PR introduce _any_ user-facing change?
Fixes nondeterminism in `asof`
### How was this patch tested?
In some circumstances
`//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient
to reproduce
### Was this patch authored or co-authored using generative AI tooling?
No
Closes #46018 from markj-db/SPARK-47824.
Authored-by: Mark Jarvin <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit a0ccdf27e5ff30817b8f058f08f98d5b44bad2db)
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/pandas/series.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index 95ca92e78787..b54ae88616fa 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -5910,7 +5910,7 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
# then return monotonically_increasing_id. This will let
max by
# to return last index value, which is the behaviour of
pandas
else spark_column.isNotNull(),
- monotonically_increasing_id_column,
+ F.col(monotonically_increasing_id_column),
),
)
for index in where
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]