itholic opened a new pull request, #42788:
URL: https://github.com/apache/spark/pull/42788
### What changes were proposed in this pull request?
This PR added warning messages throughout the Pandas API on Spark wherever
the `numeric_only` parameter is used with different default value.
This is to inform users of the different default behavior in Pandas API on
Spark compared to traditional Pandas.
### Why are the changes needed?
The default behavior of the `numeric_only` parameter in Pandas API on Spark
is different from traditional Pandas due to the inability to support mixed data
types in a single column.
This could potentially lead to unexpected results for users unfamiliar with
this distinction. By providing a clear warning message, we aim to prevent
confusion and potential mistakes.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as
the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes
- provide the console output, description and/or an example to show the
behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to
the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->
Yes. Users will now see a warning message when they use functions or methods
that involve the `numeric_only` parameter without explicitly setting it. This
message informs them of the different default behaviors between Pandas API on
Spark and traditional Pandas.
```python
>>> psdf = ps.DataFrame({"A": [1, 2, 3], "B": ['a', 'b', 'c']})
>>> psdf.cov()
/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/pandas/frame.py:9051:
UserWarning: In Pandas API on Spark, the 'numeric_only' parameter defaults to
'True' if not set, due to the inability to mix different data types in a single
column. This behavior might differ from the traditional Pandas behavior where
'numeric_only' defaults to 'False'. Please ensure to set this parameter
explicitly to avoid unexpected results.
warnings.warn(
A
A 1.0
```
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some
test cases that check the changes thoroughly including negative and positive
cases if possible.
If it was tested in a way different from regular unit tests, please clarify
how you tested step by step, ideally copy and paste-able, so that other
reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why
it was difficult to add.
If benchmark tests were added, please run the benchmarks in GitHub Actions
for the consistent environment, and the instructions could accord to:
https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
-->
Manually tested. The existing CI should pass.
### Was this patch authored or co-authored using generative AI tooling?
<!--
If generative AI tooling has been used in the process of authoring this
patch, please include the
phrase: 'Generated-by: ' followed by the name of the tool and its version.
If no, write 'No'.
Please refer to the [ASF Generative Tooling
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
-->
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]