[GitHub] [spark] cloud-fan commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

GitBox Mon, 17 Aug 2020 07:39:41 -0700


cloud-fan commented on pull request #28804:
URL: https://github.com/apache/spark/pull/28804#issuecomment-674921032



   AQE doesn't provide column stats, and column stats propagation can be 
incorrect if we have many operators.
   
   IIUC the current approach is: sample the first 100000 rows, if they can't 
reduce data by half (which means one key has 2 values by average), then we skip 
the partial aggregate.
   
   This sounds reasonable, but it's hard to tell how to pick the config values. 
@karuppayya do you have any experience of using it in practice?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cloud-fan commented on pull request #28804: [SPARK-31973][SQL] Skip partial aggregates if grouping keys have high cardinality

Reply via email to