viirya commented on code in PR #56842:
URL: https://github.com/apache/spark/pull/56842#discussion_r3488257027


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala:
##########
@@ -326,6 +326,21 @@ private[columnar] final class IntervalColumnStats extends 
ColumnStats {
     Array[Any](null, null, nullCount, count, sizeInBytes)
 }
 
+private[columnar] final class TimestampNanosColumnStats extends ColumnStats {

Review Comment:
   Good point -- collecting min/max is the right call, thanks. You're right 
that the bounds-less version was a regression from the micro path: 
`TIMESTAMP_NTZ(6)` prunes via `LongColumnStats` while `TIMESTAMP_NTZ(9)` 
scanned every batch.
   
   Following your suggestion, `TimestampNanosColumnStats` now collects 
`upper`/`lower` as `TimestampNanosVal` (modeled on `DecimalColumnStats` rather 
than `IntervalColumnStats`), using its `compareTo` (which is calendar order). 
The pruning path is already wired for it -- `TimestampNTZNanosType` is an 
`AtomicType` so `ExtractableLiteral` extracts the literal, and 
`PhysicalTimestampNTZNanosType` defines an ordering, so the bound comparisons 
`buildFilter` generates are valid -- so cached nanos timestamps now prune like 
micro timestamps.
   
   Added coverage: `ColumnStatsSuite` asserts the min/max bounds for both NTZ 
and LTZ, and `PartitionBatchPruningSuite` verifies a range filter over a cached 
nanos column reads fewer batches with in-memory partition pruning on than off 
(and returns the same rows as a pre-cache evaluation).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to