Andy Grove created SPARK-39991:
----------------------------------

             Summary: AQE should use available column statistics from completed 
query stages
                 Key: SPARK-39991
                 URL: https://issues.apache.org/jira/browse/SPARK-39991
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Andy Grove


n QueryStageExec.computeStats we copy partial statistics from materlized query 
stages by calling QueryStageExec#getRuntimeStatistics, which in turn calls 
ShuffleExchangeLike#runtimeStatistics or 
BroadcastExchangeLike#runtimeStatistics.

Only dataSize and numOutputRows are copied into the new Statistics object:

 {code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
    val runtimeStats = getRuntimeStatistics
    val dataSize = runtimeStats.sizeInBytes.max(0)
    val numOutputRows = runtimeStats.rowCount.map(_.max(0))
    Some(Statistics(dataSize, numOutputRows, isRuntime = true))
  } else {
    None
  }
{code}

I would like to also copy over the column statistics stored in 
Statistics.attributeMap so that they can be fed back into the logical plan 
optimization phase. This is a small change as shown below:

{code:scala}
  def computeStats(): Option[Statistics] = if (isMaterialized) {
    val runtimeStats = getRuntimeStatistics
    val dataSize = runtimeStats.sizeInBytes.max(0)
    val numOutputRows = runtimeStats.rowCount.map(_.max(0))
    val attributeStats = runtimeStats.attributeStats
    Some(Statistics(dataSize, numOutputRows, attributeStats, isRuntime = true))
  } else {
    None
  }
{code}

The Spark implementations of ShuffleExchangeLike and BroadcastExchangeLike do 
not currently provide such column statistics, but other custom implementations 
can.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to