Re: [PR] [SPARK-55444][SQL] Introduce and Route TimeType to Parquet vectorized read through the Types Framework [spark]

via GitHub Thu, 25 Jun 2026 07:45:46 -0700


uros-b commented on code in PR #56661:
URL: https://github.com/apache/spark/pull/56661#discussion_r3475290421



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/types/ops/TimeTypeParquetOps.scala:
##########
@@ -122,33 +139,82 @@ private[ops] object TimeTypeParquetOps {
     }
 
   /**
-   * Validates that a Parquet field can be decoded as TimeType. TimeType is 
written as INT64 with
-   * TIME(MICROS, isAdjustedToUTC=false) for precision 0..6 and TIME(NANOS, 
isAdjustedToUTC=false)
-   * for precision 7..9. On read, any INT64 TIME(MICROS) or TIME(NANOS) column 
is accepted
-   * regardless of the isAdjustedToUTC flag: Spark's zone-less TimeType 
decodes the raw
-   * time-of-day identically either way, matching the legacy 
ParquetRowConverter guard (see
-   * SPARK-57416). Any other encoding (raw INT64, INT32 TIME(MILLIS), INT64 
TIMESTAMP(_),
-   * decimal-annotated, etc.) cannot be decoded as TimeType - throw the same 
error as the legacy
-   * ParquetRowConverter path so reads fail loudly instead of silently 
misinterpreting bytes.
+   * Whether a Parquet field is a canonical INT64 TIME(MICROS) (precision 
0..6) or TIME(NANOS)
+   * (precision 7..9) column - the only encodings Spark can decode as 
TimeType. The isAdjustedToUTC
+   * flag is intentionally ignored: Spark's TimeType is zone-less local time, 
so the raw
+   * time-of-day decodes identically either way, matching the legacy 
ParquetRowConverter guard
+   * (see SPARK-57416). Any other encoding (raw INT64, INT32 TIME(MILLIS), 
INT64 TIMESTAMP(_),
+   * decimal-annotated, etc.) is incompatible. Shared by both the row-based 
and vectorized read
+   * paths so they accept/reject the same set.
    */
-  private[ops] def requireCompatibleParquetType(
-      sparkType: TimeType, parquetType: Type): Unit = {
-    val ok = parquetType.isPrimitive &&
+  private[ops] def isCompatibleParquetType(parquetType: Type): Boolean =
+    parquetType.isPrimitive &&
       parquetType.asPrimitiveType.getPrimitiveTypeName == INT64 &&
       (parquetType.getLogicalTypeAnnotation match {
         case t: LogicalTypeAnnotation.TimeLogicalTypeAnnotation =>
-          // Accept both MICROS (precision 0..6) and NANOS (precision 7..9), 
and both
-          // isAdjustedToUTC=false and =true. Spark's TimeType is zone-less 
local time, so the
-          // UTC-adjustment flag carries no extra information on read: the raw 
time-of-day value
-          // decodes identically either way. Mirroring the legacy 
ParquetRowConverter guard keeps
-          // the framework row-based read path consistent with the legacy and 
vectorized readers.
-          // SPARK-57416.
           t.getUnit == TimeUnit.MICROS || t.getUnit == TimeUnit.NANOS
         case _ => false
       })
-    if (!ok) {
+
+  /**
+   * Validates that a Parquet field can be decoded as TimeType on the 
row-based path, throwing the
+   * same error as the legacy ParquetRowConverter so incompatible reads fail 
loudly instead of
+   * silently misinterpreting bytes. See [[isCompatibleParquetType]] for the 
accepted encodings.
+   */
+  private[ops] def requireCompatibleParquetType(
+      sparkType: TimeType, parquetType: Type): Unit = {
+    if (!isCompatibleParquetType(parquetType)) {
       throw QueryExecutionErrors.cannotCreateParquetConverterForDataTypeError(
         sparkType, parquetType.toString)
     }
   }
 }
+
+/**
+ * Vectorized (batch) updater for TimeType: reads an INT64 TIME column into 
the internal
+ * nanos-of-day representation and truncates it to the requested precision. 
`fileStoresNanos`
+ * selects the on-disk unit - TIME(NANOS) stores nanos (identity), 
TIME(MICROS) stores micros
+ * (converted to nanos). Mirrors the row-based newConverter path so the 
vectorized and row-based
+ * readers agree; replaces the former 
`ParquetVectorUpdaterFactory.TimeUpdater`, now owned by the
+ * type's ops.
+ */
+private[ops] class TimeVectorUpdater(precision: Int, fileStoresNanos: Boolean)
+    extends ParquetVectorUpdater {
+  private def toTruncatedNanos(value: Long): Long = {

Review Comment:
   The new TimeVectorUpdater.toTruncatedNanos calls the shared 
DateTimeUtils.truncateTimeToPrecision per value; but that function is not 
table-backed, it recomputes math.pow on every call. Suggested resolution: make 
truncateTimeToPrecision itself table-backed (a 10^k lookup for k in 0..9) 
instead of math.pow. Then the row path, the vectorized path, and Cast all share 
one source of truth and the per-value cost drops to a table lookup; and the 
"table-backed" comment becomes accurate.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-55444][SQL] Introduce and Route TimeType to Parquet vectorized read through the Types Framework [spark]

Reply via email to