Haejoon Lee created SPARK-37711:
-----------------------------------
Summary: Create plans for top & frequency for pandas-on-Spark
optimization
Key: SPARK-37711
URL: https://issues.apache.org/jira/browse/SPARK-37711
Project: Spark
Issue Type: Improvement
Components: PySpark
Affects Versions: 3.3.0
Reporter: Haejoon Lee
When invoking the DataFrame.describe() in pandas API on Spark, the multiple
Spark job is run as much as number of columns to retrieve the `top` and `freq`
status.
Top is the most common value in column, and the freq is the count of the most
common value.
We should write a util in Scala side, and make it return key value (count) in
one Spark job. e.g.) Dataset.mapPartitions and calculate the summation of the
key and value (count). Such APIs are missing in PySpark so we would have to
write one in Scala side.
See the [https://github.com/apache/spark/pull/34931#discussion_r772220260] for
more detail.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]