MaxGekk commented on code in PR #56422:
URL: https://github.com/apache/spark/pull/56422#discussion_r3387610929


##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala:
##########
@@ -694,7 +704,8 @@ class SparkToParquetSchemaConverter(
 
       case _: TimeType =>
         Types.primitive(INT64, repetition)
-          .as(LogicalTypeAnnotation.timeType(false, 
TimeUnit.MICROS)).named(field.name)
+          .as(LogicalTypeAnnotation.timeType(timeIsAdjustedToUTC, 
TimeUnit.MICROS))

Review Comment:
   This write-path change introduces a silent predicate-pushdown regression 
when `timeIsAdjustedToUTC=true`.
   
   `ParquetScanBuilder.scala:68` calls `new 
SparkToParquetSchemaConverter(sparkSession.sessionState.conf)` to build the 
Parquet schema for `ParquetFilters`. With `timeIsAdjustedToUTC=true`, this now 
emits `timeType(true, MICROS)` for `TimeType` columns. However 
`ParquetFilters.scala:154` has:
   
   ```scala
   private val ParquetTimeMicrosType =
     ParquetSchemaType(LogicalTypeAnnotation.timeType(false, TimeUnit.MICROS), 
INT64, 0)
   ```
   
   Parquet's annotation equality includes `isAdjustedToUTC`, so `timeType(true, 
MICROS) != timeType(false, MICROS)`. Every `case ParquetTimeMicrosType` branch 
in `ParquetFilters` (equality, range, IN — eight call sites) silently falls 
through to `None`, and no TIME predicate is ever pushed down to Parquet 
row-group statistics or bloom filters.
   
   Fix: change `ParquetFilters` to match `TimeLogicalTypeAnnotation` regardless 
of `isAdjustedToUTC`. The cleanest approach is a wildcard pattern instead of 
the hardcoded constant:
   
   ```scala
   // in all makeEq/makeLt/etc. partial functions, replace:
   case ParquetTimeMicrosType =>
   // with:
   case ParquetSchemaType(_: TimeLogicalTypeAnnotation, INT64, 0) =>
   ```
   
   And add a test in `ParquetFilterSuite` verifying that TIME predicates push 
down when `spark.sql.parquet.timeIsAdjustedToUTC=true`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to