davidm-db commented on code in PR #56661:
URL: https://github.com/apache/spark/pull/56661#discussion_r3453233043
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/types/ops/TimeTypeParquetOps.scala:
##########
@@ -88,6 +90,11 @@ case class TimeTypeParquetOps(t: TimeType) extends
ParquetTypeOps {
// ==================== Vectorized Read ====================
override def isBatchReadSupported(sqlConf: SQLConf): Boolean = true
+
+ // Spark internal: nanos-of-day; Parquet storage: INT64 micros-of-day. The
updater ignores
+ // `descriptor` (the micros -> nanos conversion is the same for every
TimeType precision).
+ override def getVectorUpdater(descriptor: ColumnDescriptor):
Option[ParquetVectorUpdater] =
Review Comment:
Two correctness concerns, both from `getVectorUpdater` returning `Some`
unconditionally and ignoring `descriptor`:
1. **Non-INT64 regression.** The framework dispatch in
`ParquetVectorUpdaterFactory.getUpdater` runs before `switch (typeName)`, and
this returns an updater for any `TimeType` request regardless of physical type.
Previously the `instanceof TimeType` arm was inside `case INT64`, so e.g. an
`INT32 TIME(MILLIS)` column read as `TimeType` fell through to the clean
`SchemaColumnConvertNotSupportedException`. Now it returns
`TimeMicrosToNanosUpdater` and `readLongs` over an INT32 column - a silent
mis-read / low-level error instead of a clear type mismatch.
2. **The row path's MICROS guard isn't carried over.** Within INT64,
`TIME(NANOS)` / raw INT64 / `TIMESTAMP` read as `TimeType` are still accepted
here, whereas `requireCompatibleParquetType` rejects them on the row path. So
this widens the row-vs-vectorized divergence instead of closing it.
Both are fixed by one change: validate `descriptor` here with the same guard
the row path uses - `requireCompatibleParquetType(t,
descriptor.getPrimitiveType)` - returning `Some(updater)` only for canonical
INT64 TIME(MICROS), else `None` (falling through to the existing clean throw).
That also unifies the read guard across both readers, which is really the prize
of this routing.
Do you think this PR is the right place to address both, or would you rather
keep it purely about the framework plumbing and scope the guard + regression
fix separately?
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/types/ops/TimeTypeParquetOpsSuite.scala:
##########
@@ -118,6 +118,19 @@ class TimeTypeParquetOpsSuite extends SparkFunSuite {
// the TIME annotation; the raw-INT64 / TIMESTAMP / DECIMAL / group tests
// above already exercise the !isPrimitive and "non-TIME annotation"
branches.
+ // ---------- vectorized read updater ----------
+
+ test("getVectorUpdater returns a framework updater for TimeType") {
Review Comment:
These cover the plumbing (hook returns Some / null) but not behavior. The
happy path is covered by the `withAllParquetReaders` test in `ParquetIOSuite`
from SPARK-57416; what's missing - and what would catch the two issues above -
is a vectorized-reader reject test: `INT32 TIME(MILLIS)` / `INT64 TIME(NANOS)`
/ raw INT64 read as `TimeType` should throw, not silently decode. Worth adding
with the guard (mirrors the row-path reject tests here, or extend the
`ParquetIOSuite` test under `withAllParquetReaders`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]