Haejoon Lee created SPARK-37711:
-----------------------------------

             Summary: Create plans for top & frequency for pandas-on-Spark 
optimization
                 Key: SPARK-37711
                 URL: https://issues.apache.org/jira/browse/SPARK-37711
             Project: Spark
          Issue Type: Improvement
          Components: PySpark
    Affects Versions: 3.3.0
            Reporter: Haejoon Lee


When invoking the DataFrame.describe() in pandas API on Spark, the multiple 
Spark job is run as much as number of columns to retrieve the `top` and `freq` 
status.

Top is the most common value in column, and the freq is the count of the most 
common value.

We should write a util in Scala side, and make it return key value (count) in 
one Spark job. e.g.) Dataset.mapPartitions and calculate the summation of the 
key and value (count). Such APIs are missing in PySpark so we would have to 
write one in Scala side.

See the [https://github.com/apache/spark/pull/34931#discussion_r772220260] for 
more detail.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to