Re: [PR] [WIP][SPARK-55444][SQL] Introduce and Route TimeType to Parquet vectorized read through the Types Framework [spark]

via GitHub Mon, 22 Jun 2026 08:01:10 -0700


davidm-db commented on code in PR #56661:
URL: https://github.com/apache/spark/pull/56661#discussion_r3453233043



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/types/ops/TimeTypeParquetOps.scala:
##########
@@ -88,6 +90,11 @@ case class TimeTypeParquetOps(t: TimeType) extends 
ParquetTypeOps {
   // ==================== Vectorized Read ====================
 
   override def isBatchReadSupported(sqlConf: SQLConf): Boolean = true
+
+  // Spark internal: nanos-of-day; Parquet storage: INT64 micros-of-day. The 
updater ignores
+  // `descriptor` (the micros -> nanos conversion is the same for every 
TimeType precision).
+  override def getVectorUpdater(descriptor: ColumnDescriptor): 
Option[ParquetVectorUpdater] =

Review Comment:
   Two correctness concerns, both from `getVectorUpdater` returning `Some` 
unconditionally and ignoring `descriptor`:
   
   1. **Non-INT64 regression.** The framework dispatch in 
`ParquetVectorUpdaterFactory.getUpdater` runs before `switch (typeName)`, and 
this returns an updater for any `TimeType` request regardless of physical type. 
Previously the `instanceof TimeType` arm was inside `case INT64`, so e.g. an 
`INT32 TIME(MILLIS)` column read as `TimeType` fell through to the clean 
`SchemaColumnConvertNotSupportedException`. Now it returns 
`TimeMicrosToNanosUpdater` and `readLongs` over an INT32 column - a silent 
mis-read / low-level error instead of a clear type mismatch.
   2. **The row path's MICROS guard isn't carried over.** Within INT64, 
`TIME(NANOS)` / raw INT64 / `TIMESTAMP` read as `TimeType` are still accepted 
here, whereas `requireCompatibleParquetType` rejects them on the row path. So 
this widens the row-vs-vectorized divergence instead of closing it.
   
   Both are fixed by one change: validate `descriptor` here with the same guard 
the row path uses - `requireCompatibleParquetType(t, 
descriptor.getPrimitiveType)` - returning `Some(updater)` only for canonical 
INT64 TIME(MICROS), else `None` (falling through to the existing clean throw). 
That also unifies the read guard across both readers, which is really the prize 
of this routing.
   
   Do you think this PR is the right place to address both, or would you rather 
keep it purely about the framework plumbing and scope the guard + regression 
fix separately?



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/types/ops/TimeTypeParquetOpsSuite.scala:
##########
@@ -118,6 +118,19 @@ class TimeTypeParquetOpsSuite extends SparkFunSuite {
   // the TIME annotation; the raw-INT64 / TIMESTAMP / DECIMAL / group tests
   // above already exercise the !isPrimitive and "non-TIME annotation" 
branches.
 
+  // ---------- vectorized read updater ----------
+
+  test("getVectorUpdater returns a framework updater for TimeType") {

Review Comment:
   These cover the plumbing (hook returns Some / null) but not behavior. The 
happy path is covered by the `withAllParquetReaders` test in `ParquetIOSuite` 
from SPARK-57416; what's missing - and what would catch the two issues above - 
is a vectorized-reader reject test: `INT32 TIME(MILLIS)` / `INT64 TIME(NANOS)` 
/ raw INT64 read as `TimeType` should throw, not silently decode. Worth adding 
with the guard (mirrors the row-path reject tests here, or extend the 
`ParquetIOSuite` test under `withAllParquetReaders`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SPARK-55444][SQL] Introduce and Route TimeType to Parquet vectorized read through the Types Framework [spark]

Reply via email to