Andy Grove created SPARK-39991:
----------------------------------
Summary: AQE should use available column statistics from completed
query stages
Key: SPARK-39991
URL: https://issues.apache.org/jira/browse/SPARK-39991
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.3.0
Reporter: Andy Grove
n QueryStageExec.computeStats we copy partial statistics from materlized query
stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls
ShuffleExchangeLike#runtimeStatistics or
BroadcastExchangeLike#runtimeStatistics.
Only dataSize and numOutputRows are copied into the new Statistics object:
{code:scala}
def computeStats(): Option[Statistics] = if (isMaterialized) {
val runtimeStats = getRuntimeStatistics
val dataSize = runtimeStats.sizeInBytes.max(0)
val numOutputRows = runtimeStats.rowCount.map(_.max(0))
Some(Statistics(dataSize, numOutputRows, isRuntime = true))
} else {
None
}
{code}
I would like to also copy over the column statistics stored in
Statistics.attributeMap so that they can be fed back into the logical plan
optimization phase. This is a small change as shown below:
{code:scala}
def computeStats(): Option[Statistics] = if (isMaterialized) {
val runtimeStats = getRuntimeStatistics
val dataSize = runtimeStats.sizeInBytes.max(0)
val numOutputRows = runtimeStats.rowCount.map(_.max(0))
val attributeStats = runtimeStats.attributeStats
Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = true))
} else {
None
}
{code}
The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do
not currently provide such column statistics, but other custom implementations
can.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]