MaxGekk commented on code in PR #56622:
URL: https://github.com/apache/spark/pull/56622#discussion_r3449088420
##########
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java:
##########
@@ -165,8 +165,17 @@ public ParquetVectorUpdater getUpdater(ColumnDescriptor
descriptor, DataType spa
return new LongUpdater();
} else if (canReadAsDecimal(descriptor, sparkType)) {
return new LongToDecimalUpdater(descriptor, (DecimalType) sparkType);
- } else if (sparkType instanceof TimeType) {
- return new LongAsNanosUpdater();
+ } else if (sparkType instanceof TimeType &&
+ isTimeTypeMatched(LogicalTypeAnnotation.TimeUnit.NANOS)) {
Review Comment:
Thanks @cloud-fan -- agreed the tightening is a user-visible change in
principle, and you're right that the code did ship in 4.1.0/4.1.1/4.1.2.
One thing I'd like your read on before adding the flag: the entire TIME type
is gated behind `spark.sql.timeType.enabled`, which is `.internal()` and
`createWithDefault(Utils.isTesting)` -- i.e. off by default in production --
since it was introduced in 4.1.0:
```scala
val TIME_TYPE_ENABLED =
buildConf("spark.sql.timeType.enabled")
.internal()
.doc("When true, the TIME data type is supported.")
.version("4.1.0")
.booleanConf
.createWithDefault(Utils.isTesting)
```
So in the released versions the lenient vectorized path was never reachable
on default settings -- hitting it requires flipping this internal/undocumented
flag and then reading a non-`TIME`-annotated INT64 column with an explicit
`TIME` read schema (a preview/test scenario). My read is that the legacy escape
hatch + migration-guide entry are aimed at GA, default-on, user-visible
behavior, so they may not be warranted for a change confined to a
still-internal, default-off preview type -- similar to the reasoning on the
Avro follow-up (SPARK-57581). The change itself is also strictly safer: it
turns a silent 1000x-wrong micros mis-decode into a loud failure that matches
the row-based reader's `requireCompatibleParquetType`.
That said, I'm happy to add it if you'd prefer the belt-and-suspenders
option -- it's a small `.internal()`
`spark.sql.legacy.parquet.lenientTimeInt64Read` (default `false`,
`version("4.3.0")`) that falls back to the pre-PR `LongAsNanosUpdater`
regardless of annotation, plus a `docs/sql-migration-guide.md` note. Let me
know which way you'd like to go.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]