This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 132bb63a897 [SPARK-46016][DOCS][PS] Fix pandas API support list
properly
132bb63a897 is described below
commit 132bb63a897f4f4049f34deefc065ed3eac6a90f
Author: Haejoon Lee <[email protected]>
AuthorDate: Fri Nov 24 19:38:31 2023 +0900
[SPARK-46016][DOCS][PS] Fix pandas API support list properly
### What changes were proposed in this pull request?
This PR proposes to fix a critical issue in the [Supported pandas API
documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html)
where many essential APIs such as `DataFrame.max`, `DataFrame.min`,
`DataFrame.mean`, `and DataFrame.median`, etc. were incorrectly marked as not
implemented - marked as "N" - as below:
<img width="291" alt="Screenshot 2023-11-24 at 12 37 49 PM"
src="https://github.com/apache/spark/assets/44108233/95c5785c-711c-400c-b2ec-0db034e90fd8">
The root cause of this issue was that the script used to generate the
support list excluded functions inherited from parent classes. For instance,
`CategoricalIndex.max` is actually supported by inheriting the `Index` class
but was not directly implemented in `CategoricalIndex`, leading to it being
marked as unsupported:
<img width="397" alt="Screenshot 2023-11-24 at 12 30 08 PM"
src="https://github.com/apache/spark/assets/44108233/90e92996-a88a-4a20-bb0c-4909097e2688">
### Why are the changes needed?
The current documentation inaccurately represents the state of supported
pandas API, which could significantly hinder user experience and adoption. By
correcting these inaccuracies, we ensure that the documentation reflects the
true capabilities of Pandas API on Spark, providing users with reliable and
accurate information.
### Does this PR introduce _any_ user-facing change?
No. This PR only updates the documentation to accurately reflect the
current state of supported pandas API.
### How was this patch tested?
Manually build documentation, and check if the supported pandas API list is
correctly generated as below:
<img width="299" alt="Screenshot 2023-11-24 at 12 36 31 PM"
src="https://github.com/apache/spark/assets/44108233/a2da0f0b-0973-45cb-b22d-9582bbeb51b5">
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes #43996 from itholic/fix_supported_api_gen.
Authored-by: Haejoon Lee <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
python/pyspark/pandas/supported_api_gen.py | 16 ++--------------
1 file changed, 2 insertions(+), 14 deletions(-)
diff --git a/python/pyspark/pandas/supported_api_gen.py
b/python/pyspark/pandas/supported_api_gen.py
index a83731db8fc..27d5cd4b37f 100644
--- a/python/pyspark/pandas/supported_api_gen.py
+++ b/python/pyspark/pandas/supported_api_gen.py
@@ -138,23 +138,11 @@ def _create_supported_by_module(
# module not implemented
return {}
- pd_funcs = dict(
- [
- m
- for m in getmembers(pd_module, isfunction)
- if not m[0].startswith("_") and m[0] in pd_module.__dict__
- ]
- )
+ pd_funcs = dict([m for m in getmembers(pd_module, isfunction) if not
m[0].startswith("_")])
if not pd_funcs:
return {}
- ps_funcs = dict(
- [
- m
- for m in getmembers(ps_module, isfunction)
- if not m[0].startswith("_") and m[0] in ps_module.__dict__
- ]
- )
+ ps_funcs = dict([m for m in getmembers(ps_module, isfunction) if not
m[0].startswith("_")])
return _organize_by_implementation_status(
module_name, pd_funcs, ps_funcs, pd_module_group, ps_module_group
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]