MaxGekk commented on code in PR #56622:
URL: https://github.com/apache/spark/pull/56622#discussion_r3449088420


##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -165,8 +165,17 @@ public ParquetVectorUpdater getUpdater(ColumnDescriptor 
descriptor, DataType spa
           return new LongUpdater();
         } else if (canReadAsDecimal(descriptor, sparkType)) {
           return new LongToDecimalUpdater(descriptor, (DecimalType) sparkType);
-        } else if (sparkType instanceof TimeType) {
-          return new LongAsNanosUpdater();
+        } else if (sparkType instanceof TimeType &&
+          isTimeTypeMatched(LogicalTypeAnnotation.TimeUnit.NANOS)) {

Review Comment:
   Thanks @cloud-fan -- agreed the tightening is a user-visible change in 
principle, and you're right that the code did ship in 4.1.0/4.1.1/4.1.2.
   
   One thing I'd like your read on before adding the flag: the entire TIME type 
is gated behind `spark.sql.timeType.enabled`, which is `.internal()` and 
`createWithDefault(Utils.isTesting)` -- i.e. off by default in production -- 
since it was introduced in 4.1.0:
   
   ```scala
   val TIME_TYPE_ENABLED =
     buildConf("spark.sql.timeType.enabled")
       .internal()
       .doc("When true, the TIME data type is supported.")
       .version("4.1.0")
       .booleanConf
       .createWithDefault(Utils.isTesting)
   ```
   
   So in the released versions the lenient vectorized path was never reachable 
on default settings -- hitting it requires flipping this internal/undocumented 
flag and then reading a non-`TIME`-annotated INT64 column with an explicit 
`TIME` read schema (a preview/test scenario). My read is that the legacy escape 
hatch + migration-guide entry are aimed at GA, default-on, user-visible 
behavior, so they may not be warranted for a change confined to a 
still-internal, default-off preview type -- similar to the reasoning on the 
Avro follow-up (SPARK-57581). The change itself is also strictly safer: it 
turns a silent 1000x-wrong micros mis-decode into a loud failure that matches 
the row-based reader's `requireCompatibleParquetType`.
   
   That said, I'm happy to add it if you'd prefer the belt-and-suspenders 
option -- it's a small `.internal()` 
`spark.sql.legacy.parquet.lenientTimeInt64Read` (default `false`, 
`version("4.3.0")`) that falls back to the pre-PR `LongAsNanosUpdater` 
regardless of annotation, plus a `docs/sql-migration-guide.md` note. Let me 
know which way you'd like to go.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to