[ 
https://issues.apache.org/jira/browse/SPARK-57805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-57805:
-----------------------------
    Affects Version/s: 4.3.0
                           (was: 5.0.0)

> CBO filter/join estimation throws MatchError for TimestampNTZ, ANSI interval, 
> and TIME columns
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-57805
>                 URL: https://issues.apache.org/jira/browse/SPARK-57805
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Max Gekk
>            Priority: Major
>
> h3. Problem
> When cost-based optimization is enabled (spark.sql.cbo.enabled=true) and 
> column
> statistics have been collected, a filter/equality/IN predicate or an 
> equi-join on a
> column of type TIMESTAMP_NTZ, an ANSI interval (YEAR MONTH / DAY TIME), or 
> TIME throws
> {{scala.MatchError}} and crashes query optimization.
> The optimizer's statistics-estimation layer only enumerates
> NUMERIC / DATE / TIMESTAMP / BOOLEAN. These three type families are 
> collectable (their
> catalog->plan stats conversion is implemented) but are not handled on the 
> consumption
> side, so once their stats exist they reach an unhandled branch:
> * {{EstimationUtils.toDouble}} / {{fromDouble}} - matches only
>   {{_: NumericType | DateType | TimestampType}} (+ {{BooleanType}}).
> * {{FilterEstimation.evaluateBinary}} (non-exhaustive), {{evaluateEquality}} 
> (via
>   {{ValueInterval}} -> {{toDouble}}), and {{evaluateInSet}}.
> * {{JoinEstimation}} - {{ValueInterval(leftKeyStat.min, ...)}} on join keys, 
> no type guard.
> The error is not swallowed: {{BasicStatsPlanVisitor.visitFilter}} uses
> {{FilterEstimation(p).estimate.getOrElse(fallback(p))}}, which rescues a 
> {{None}} result,
> not a thrown exception - so the {{MatchError}} propagates out.
> h3. Affected types
> * TIMESTAMP_NTZ - collection shipped in SPARK-42777, which touched only
>   {{CatalogColumnStat}} + a collection test and never the estimation side. 
> Latent since ~3.4.
> * ANSI intervals (YearMonth / DayTime) - same gap.
> * TIME - newly reachable via SPARK-54582 (PR apache/spark#53312), which adds 
> TIME statistics
>   collection.
> Contrast: {{UnionEstimation.isTypeSupported}} *was* extended per-type for 
> TIMESTAMP_NTZ and
> ANSI intervals in SPARK-37468 (it uses {{PhysicalDataType.ordering}}, so it 
> degrades gracefully
> rather than crashing) - but TIME is still missing there too and should be 
> added.
> h3. Reproduction (unit-test form, FilterEstimationSuite)
> {code:scala}
> val tMin = DateTimeUtils.localTimeToNanos(LocalTime.parse("08:00:00"))
> val t12  = DateTimeUtils.localTimeToNanos(LocalTime.parse("12:00:00"))
> val attrTime = AttributeReference("ctime", TimeType(6))()
> val colStatTime = ColumnStat(distinctCount = Some(10), min = Some(tMin), max 
> = Some(tMax),
>   nullCount = Some(0), avgLen = Some(8), maxLen = Some(8))
> // Under CBO this reaches FilterEstimation -> EstimationUtils.toDouble -> 
> scala.MatchError.
> Filter(LessThan(attrTime, Literal(t12, TimeType(6))),
>   childStatsTestPlan(Seq(attrTime), 10L, AttributeMap(Seq(attrTime -> 
> colStatTime)))).stats
> {code}
> h3. Proposed fix
> Extend the numeric-value contract to cover all three families at once:
> * {{EstimationUtils.toDouble}}: add {{TimestampNTZType | _: TimeType | _: 
> AnsiIntervalType}} to the
>   {{value.toString.toDouble}} branch (their internal values are already 
> numeric Long/Int).
> * {{EstimationUtils.fromDouble}}: add {{TimestampNTZType | _: TimeType | _: 
> DayTimeIntervalType => double.toLong}}
>   and {{_: YearMonthIntervalType => double.toInt}}.
> * {{FilterEstimation.evaluateBinary}} and {{evaluateInSet}}: add the same 
> types to the numeric branch.
> * {{UnionEstimation.isTypeSupported}}: add {{_: TimeType}} (TIMESTAMP_NTZ / 
> intervals already present).
> * Add {{FilterEstimationSuite}} / {{JoinEstimationSuite}} cases for TIME, 
> TIMESTAMP_NTZ, and an interval type.
> h3. Notes
> CBO is off by default, so this is latent, but it is a hard crash once enabled 
> with stats collected on
> one of these columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to