This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.5 by this push:
new 351a5f8c004 [SPARK-46016][DOCS][PS] Fix pandas API support list
properly
351a5f8c004 is described below
commit 351a5f8c004a449013ab25acbcfdd85e9e7868b8
Author: Haejoon Lee <[email protected]>
AuthorDate: Fri Nov 24 19:38:31 2023 +0900
[SPARK-46016][DOCS][PS] Fix pandas API support list properly
### What changes were proposed in this pull request?
This PR proposes to fix a critical issue in the [Supported pandas API
documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html)
where many essential APIs such as `DataFrame.max`, `DataFrame.min`,
`DataFrame.mean`, `and DataFrame.median`, etc. were incorrectly marked as not
implemented - marked as "N" - as below:
<img width="291" alt="Screenshot 2023-11-24 at 12 37 49 PM"
src="https://github.com/apache/spark/assets/44108233/95c5785c-711c-400c-b2ec-0db034e90fd8">
The root cause of this issue was that the script used to generate the
support list excluded functions inherited from parent classes. For instance,
`CategoricalIndex.max` is actually supported by inheriting the `Index` class
but was not directly implemented in `CategoricalIndex`, leading to it being
marked as unsupported:
<img width="397" alt="Screenshot 2023-11-24 at 12 30 08 PM"
src="https://github.com/apache/spark/assets/44108233/90e92996-a88a-4a20-bb0c-4909097e2688">
### Why are the changes needed?
The current documentation inaccurately represents the state of supported
pandas API, which could significantly hinder user experience and adoption. By
correcting these inaccuracies, we ensure that the documentation reflects the
true capabilities of Pandas API on Spark, providing users with reliable and
accurate information.
### Does this PR introduce _any_ user-facing change?
No. This PR only updates the documentation to accurately reflect the
current state of supported pandas API.
### How was this patch tested?
Manually build documentation, and check if the supported pandas API list is
correctly generated as below:
<img width="299" alt="Screenshot 2023-11-24 at 12 36 31 PM"
src="https://github.com/apache/spark/assets/44108233/a2da0f0b-0973-45cb-b22d-9582bbeb51b5">
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43996 from itholic/fix_supported_api_gen.
Authored-by: Haejoon Lee <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 132bb63a897f4f4049f34deefc065ed3eac6a90f)
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/pandas/supported_api_gen.py | 16 ++--------------
1 file changed, 2 insertions(+), 14 deletions(-)
diff --git a/python/pyspark/pandas/supported_api_gen.py
b/python/pyspark/pandas/supported_api_gen.py
index 06591c5b26a..8c3cdec3671 100644
--- a/python/pyspark/pandas/supported_api_gen.py
+++ b/python/pyspark/pandas/supported_api_gen.py
@@ -138,23 +138,11 @@ def _create_supported_by_module(
# module not implemented
return {}
- pd_funcs = dict(
- [
- m
- for m in getmembers(pd_module, isfunction)
- if not m[0].startswith("_") and m[0] in pd_module.__dict__
- ]
- )
+ pd_funcs = dict([m for m in getmembers(pd_module, isfunction) if not
m[0].startswith("_")])
if not pd_funcs:
return {}
- ps_funcs = dict(
- [
- m
- for m in getmembers(ps_module, isfunction)
- if not m[0].startswith("_") and m[0] in ps_module.__dict__
- ]
- )
+ ps_funcs = dict([m for m in getmembers(ps_module, isfunction) if not
m[0].startswith("_")])
return _organize_by_implementation_status(
module_name, pd_funcs, ps_funcs, pd_module_group, ps_module_group
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]