MaxGekk commented on code in PR #56422:
URL: https://github.com/apache/spark/pull/56422#discussion_r3387610929
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -694,7 +704,8 @@ class SparkToParquetSchemaConverter(
case _: TimeType =>
Types.primitive(INT64, repetition)
- .as(LogicalTypeAnnotation.timeType(false,
TimeUnit.MICROS)).named(field.name)
+ .as(LogicalTypeAnnotation.timeType(timeIsAdjustedToUTC,
TimeUnit.MICROS))
Review Comment:
This write-path change introduces a silent predicate-pushdown regression
when `timeIsAdjustedToUTC=true`.
`ParquetScanBuilder.scala:68` calls `new
SparkToParquetSchemaConverter(sparkSession.sessionState.conf)` to build the
Parquet schema for `ParquetFilters`. With `timeIsAdjustedToUTC=true`, this now
emits `timeType(true, MICROS)` for `TimeType` columns. However
`ParquetFilters.scala:154` has:
```scala
private val ParquetTimeMicrosType =
ParquetSchemaType(LogicalTypeAnnotation.timeType(false, TimeUnit.MICROS),
INT64, 0)
```
Parquet's annotation equality includes `isAdjustedToUTC`, so `timeType(true,
MICROS) != timeType(false, MICROS)`. Every `case ParquetTimeMicrosType` branch
in `ParquetFilters` (equality, range, IN — eight call sites) silently falls
through to `None`, and no TIME predicate is ever pushed down to Parquet
row-group statistics or bloom filters.
Fix: change `ParquetFilters` to match `TimeLogicalTypeAnnotation` regardless
of `isAdjustedToUTC`. The cleanest approach is a wildcard pattern instead of
the hardcoded constant:
```scala
// in all makeEq/makeLt/etc. partial functions, replace:
case ParquetTimeMicrosType =>
// with:
case ParquetSchemaType(_: TimeLogicalTypeAnnotation, INT64, 0) =>
```
And add a test in `ParquetFilterSuite` verifying that TIME predicates push
down when `spark.sql.parquet.timeIsAdjustedToUTC=true`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]