(spark) branch branch-3.5 updated: [SPARK-46016][DOCS][PS] Fix pandas API support list properly

gurwls223 Fri, 24 Nov 2023 02:39:05 -0800

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/branch-3.5 by this push:
     new 351a5f8c004 [SPARK-46016][DOCS][PS] Fix pandas API support list 
properly
351a5f8c004 is described below

commit 351a5f8c004a449013ab25acbcfdd85e9e7868b8
Author: Haejoon Lee <[email protected]>
AuthorDate: Fri Nov 24 19:38:31 2023 +0900

    [SPARK-46016][DOCS][PS] Fix pandas API support list properly
    
    ### What changes were proposed in this pull request?
    
    This PR proposes to fix a critical issue in the [Supported pandas API 
documentation](https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/supported_pandas_api.html)
 where many essential APIs such as `DataFrame.max`, `DataFrame.min`, 
`DataFrame.mean`, `and DataFrame.median`, etc. were incorrectly marked as not 
implemented - marked as "N" - as below:
    
    <img width="291" alt="Screenshot 2023-11-24 at 12 37 49 PM" 
src="https://github.com/apache/spark/assets/44108233/95c5785c-711c-400c-b2ec-0db034e90fd8";>
    
     The root cause of this issue was that the script used to generate the 
support list excluded functions inherited from parent classes. For instance, 
`CategoricalIndex.max` is actually supported by inheriting the `Index` class 
but was not directly implemented in `CategoricalIndex`, leading to it being 
marked as unsupported:
    
    <img width="397" alt="Screenshot 2023-11-24 at 12 30 08 PM" 
src="https://github.com/apache/spark/assets/44108233/90e92996-a88a-4a20-bb0c-4909097e2688";>
    
    ### Why are the changes needed?
    
    The current documentation inaccurately represents the state of supported 
pandas API, which could significantly hinder user experience and adoption. By 
correcting these inaccuracies, we ensure that the documentation reflects the 
true capabilities of Pandas API on Spark, providing users with reliable and 
accurate information.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No. This PR only updates the documentation to accurately reflect the 
current state of supported pandas API.
    
    ### How was this patch tested?
    
    Manually build documentation, and check if the supported pandas API list is 
correctly generated as below:
    
    <img width="299" alt="Screenshot 2023-11-24 at 12 36 31 PM" 
src="https://github.com/apache/spark/assets/44108233/a2da0f0b-0973-45cb-b22d-9582bbeb51b5";>
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No.
    
    Closes #43996 from itholic/fix_supported_api_gen.
    
    Authored-by: Haejoon Lee <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
    (cherry picked from commit 132bb63a897f4f4049f34deefc065ed3eac6a90f)
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 python/pyspark/pandas/supported_api_gen.py | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/python/pyspark/pandas/supported_api_gen.py 
b/python/pyspark/pandas/supported_api_gen.py
index 06591c5b26a..8c3cdec3671 100644
--- a/python/pyspark/pandas/supported_api_gen.py
+++ b/python/pyspark/pandas/supported_api_gen.py
@@ -138,23 +138,11 @@ def _create_supported_by_module(
         # module not implemented
         return {}
 
-    pd_funcs = dict(
-        [
-            m
-            for m in getmembers(pd_module, isfunction)
-            if not m[0].startswith("_") and m[0] in pd_module.__dict__
-        ]
-    )
+    pd_funcs = dict([m for m in getmembers(pd_module, isfunction) if not 
m[0].startswith("_")])
     if not pd_funcs:
         return {}
 
-    ps_funcs = dict(
-        [
-            m
-            for m in getmembers(ps_module, isfunction)
-            if not m[0].startswith("_") and m[0] in ps_module.__dict__
-        ]
-    )
+    ps_funcs = dict([m for m in getmembers(ps_module, isfunction) if not 
m[0].startswith("_")])
 
     return _organize_by_implementation_status(
         module_name, pd_funcs, ps_funcs, pd_module_group, ps_module_group


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch branch-3.5 updated: [SPARK-46016][DOCS][PS] Fix pandas API support list properly

Reply via email to