Devin Petersohn created SPARK-55664:
---------------------------------------

             Summary: Explore bounded collection types for pandas aggregations
                 Key: SPARK-55664
                 URL: https://issues.apache.org/jira/browse/SPARK-55664
             Project: Spark
          Issue Type: Bug
          Components: Pandas API on Spark, PySpark
    Affects Versions: 4.1.1
            Reporter: Devin Petersohn


>From [https://github.com/apache/spark/pull/54370#discussion_r2824804015]

Right now, pandas API for Spark aggregations (like describe) collect full 
intermediate results back to the driver.

The idea is to use bounded/fixed-size summary structures during aggregations on 
the executors. Instead of collecting raw data, each executor would produce a 
compact summary (e.g., counts, min/max, approximate quantiles) that has a known 
upper bound on size. These summaries would then be merged (similar to a 
treeReduce pattern), so the final result sent to the driver is smaller.

This is a more general optimization that could apply beyond describe to other 
Spark aggregations.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to