MaxGekk commented on code in PR #56842:
URL: https://github.com/apache/spark/pull/56842#discussion_r3487492208
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala:
##########
@@ -326,6 +326,21 @@ private[columnar] final class IntervalColumnStats extends
ColumnStats {
Array[Any](null, null, nullCount, count, sizeInBytes)
}
+private[columnar] final class TimestampNanosColumnStats extends ColumnStats {
Review Comment:
`TimestampNanosColumnStats` emits `null`/`null` for lower/upper (the
`CalendarInterval` / `IntervalColumnStats` pattern), so cached
nanosecond-timestamp columns get no batch-level partition pruning.
The same logical type at micro precision takes a different path:
`TimestampType`/`TimestampNTZType` -> `LongColumnBuilder` -> `LongColumnStats`,
which collects min/max. So a range filter (`WHERE ts > '...'`) over a cached
`TIMESTAMP_NTZ(6)` column skips non-matching batches, while the same filter
over a cached `TIMESTAMP_NTZ(9)` column scans every batch.
`TimestampNanosVal` is `Comparable` (its total order is calendar order), and
ordered non-primitive cache types already keep bounds — `DecimalColumnStats`
collects `Decimal` min/max. So tracking `upper`/`lower` as `TimestampNanosVal`
here (modeled on `DecimalColumnStats` rather than `IntervalColumnStats`) would
preserve the pruning the micro path provides.
Not a correctness issue — the feature works. Is the bounds-less choice
intentional (follow `CalendarInterval`), or worth collecting min/max so cached
nanos timestamps prune like micro timestamps?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]