Xinrong Meng created SPARK-39076:
------------------------------------

             Summary: Standardize Statistical Functions of pandas API on Spark
                 Key: SPARK-39076
                 URL: https://issues.apache.org/jira/browse/SPARK-39076
             Project: Spark
          Issue Type: Umbrella
          Components: PySpark
    Affects Versions: 3.4.0
            Reporter: Xinrong Meng


Statistical functions are the most commonly-used functions in Data Engineering 
and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data 
Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although 
there may be semantic differences of certain functions due to the high cost of 
big data calculations, for example, median. We should still try to reach the 
parity from the API level.

However, critical parameters, such as `skipna`,  of statistical functions are 
missing of basic objects: DataFrame, Series, and Index are missing. 

There is even a larger gap between statistical functions of pandas-on-Spark 
GroupBy objects and those of pandas GroupBy objects. In addition, tests 
coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased 
since missing parameters will be implemented. That would further improve the 
user adoption.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to