Xinrong Meng created SPARK-39076:
------------------------------------
Summary: Standardize Statistical Functions of pandas API on Spark
Key: SPARK-39076
URL: https://issues.apache.org/jira/browse/SPARK-39076
Project: Spark
Issue Type: Umbrella
Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng
Statistical functions are the most commonly-used functions in Data Engineering
and Data Analysis.
Spark and pandas provide statistical functions in the context of SQL and Data
Science separately.
pandas API on Spark implements the pandas API on top of Apache Spark. Although
there may be semantic differences of certain functions due to the high cost of
big data calculations, for example, median. We should still try to reach the
parity from the API level.
However, critical parameters, such as `skipna`, of statistical functions are
missing of basic objects: DataFrame, Series, and Index are missing.
There is even a larger gap between statistical functions of pandas-on-Spark
GroupBy objects and those of pandas GroupBy objects. In addition, tests
coverage is far from perfect.
With statistical functions standardized, pandas API coverage will be increased
since missing parameters will be implemented. That would further improve the
user adoption.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]