Sougata Pandit created SPARK-56286:
--------------------------------------
Summary: Add DataFrame.dataQuality API for column profiling in
PySpark
Key: SPARK-56286
URL: https://issues.apache.org/jira/browse/SPARK-56286
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 4.2.0
Reporter: Sougata Pandit
This improvement proposes a new PySpark DataFrame API, `dataQuality()`, for
exploratory dataset profiling.
The API would return a DataFrame with one row per input column and one
synthetic dataset-level row. It would include metrics such as row count, column
count, total cells, non-null count, null count, null ratio, distinct count,
min, max, and mode. For numeric columns, it would also include mean, standard
deviation, and median.
This is intended to simplify data quality inspection workflows that currently
require users to compose multiple custom aggregations manually.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]