viirya commented on code in PR #56842:
URL: https://github.com/apache/spark/pull/56842#discussion_r3488257027
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala:
##########
@@ -326,6 +326,21 @@ private[columnar] final class IntervalColumnStats extends
ColumnStats {
Array[Any](null, null, nullCount, count, sizeInBytes)
}
+private[columnar] final class TimestampNanosColumnStats extends ColumnStats {
Review Comment:
Good point -- collecting min/max is the right call, thanks. You're right
that the bounds-less version was a regression from the micro path:
`TIMESTAMP_NTZ(6)` prunes via `LongColumnStats` while `TIMESTAMP_NTZ(9)`
scanned every batch.
Following your suggestion, `TimestampNanosColumnStats` now collects
`upper`/`lower` as `TimestampNanosVal` (modeled on `DecimalColumnStats` rather
than `IntervalColumnStats`), using its `compareTo` (which is calendar order).
The pruning path is already wired for it -- `TimestampNTZNanosType` is an
`AtomicType` so `ExtractableLiteral` extracts the literal, and
`PhysicalTimestampNTZNanosType` defines an ordering, so the bound comparisons
`buildFilter` generates are valid -- so cached nanos timestamps now prune like
micro timestamps.
Added coverage: `ColumnStatsSuite` asserts the min/max bounds for both NTZ
and LTZ, and `PartitionBatchPruningSuite` verifies a range filter over a cached
nanos column reads fewer batches with in-memory partition pruning on than off
(and returns the same rows as a pre-cache evaluation).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]