[GitHub] [spark] itholic opened a new pull request, #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

via GitHub Sun, 03 Sep 2023 20:09:09 -0700


itholic opened a new pull request, #42788:
URL: https://github.com/apache/spark/pull/42788


   ### What changes were proposed in this pull request?
   
   This PR added warning messages throughout the Pandas API on Spark wherever 
the `numeric_only` parameter is used with different default value.
   
   This is to inform users of the different default behavior in Pandas API on 
Spark compared to traditional Pandas.
   
   ### Why are the changes needed?
   
   The default behavior of the `numeric_only` parameter in Pandas API on Spark 
is different from traditional Pandas due to the inability to support mixed data 
types in a single column.
   
   This could potentially lead to unexpected results for users unfamiliar with 
this distinction. By providing a clear warning message, we aim to prevent 
confusion and potential mistakes.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as 
the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes 
- provide the console output, description and/or an example to show the 
behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to 
the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   Yes. Users will now see a warning message when they use functions or methods 
that involve the `numeric_only` parameter without explicitly setting it. This 
message informs them of the different default behaviors between Pandas API on 
Spark and traditional Pandas.
   
   ```python
   >>> psdf = ps.DataFrame({"A": [1, 2, 3], "B": ['a', 'b', 'c']})
   >>> psdf.cov()
   
/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/pandas/frame.py:9051: 
UserWarning: In Pandas API on Spark, the 'numeric_only' parameter defaults to 
'True' if not set, due to the inability to mix different data types in a single 
column. This behavior might differ from the traditional Pandas behavior where 
'numeric_only' defaults to 'False'. Please ensure to set this parameter 
explicitly to avoid unexpected results.
     warnings.warn(
        A
   A  1.0
   ```
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some 
test cases that check the changes thoroughly including negative and positive 
cases if possible.
   If it was tested in a way different from regular unit tests, please clarify 
how you tested step by step, ideally copy and paste-able, so that other 
reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why 
it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions 
for the consistent environment, and the instructions could accord to: 
https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   Manually tested. The existing CI should pass.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this 
patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling 
Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic opened a new pull request, #42788: [SPARK-43291][PS] Generate proper warning on different behavior with `numeric_only`

Reply via email to