uros-b commented on code in PR #56661:
URL: https://github.com/apache/spark/pull/56661#discussion_r3475290421
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/types/ops/TimeTypeParquetOps.scala:
##########
@@ -122,33 +139,82 @@ private[ops] object TimeTypeParquetOps {
}
/**
- * Validates that a Parquet field can be decoded as TimeType. TimeType is
written as INT64 with
- * TIME(MICROS, isAdjustedToUTC=false) for precision 0..6 and TIME(NANOS,
isAdjustedToUTC=false)
- * for precision 7..9. On read, any INT64 TIME(MICROS) or TIME(NANOS) column
is accepted
- * regardless of the isAdjustedToUTC flag: Spark's zone-less TimeType
decodes the raw
- * time-of-day identically either way, matching the legacy
ParquetRowConverter guard (see
- * SPARK-57416). Any other encoding (raw INT64, INT32 TIME(MILLIS), INT64
TIMESTAMP(_),
- * decimal-annotated, etc.) cannot be decoded as TimeType - throw the same
error as the legacy
- * ParquetRowConverter path so reads fail loudly instead of silently
misinterpreting bytes.
+ * Whether a Parquet field is a canonical INT64 TIME(MICROS) (precision
0..6) or TIME(NANOS)
+ * (precision 7..9) column - the only encodings Spark can decode as
TimeType. The isAdjustedToUTC
+ * flag is intentionally ignored: Spark's TimeType is zone-less local time,
so the raw
+ * time-of-day decodes identically either way, matching the legacy
ParquetRowConverter guard
+ * (see SPARK-57416). Any other encoding (raw INT64, INT32 TIME(MILLIS),
INT64 TIMESTAMP(_),
+ * decimal-annotated, etc.) is incompatible. Shared by both the row-based
and vectorized read
+ * paths so they accept/reject the same set.
*/
- private[ops] def requireCompatibleParquetType(
- sparkType: TimeType, parquetType: Type): Unit = {
- val ok = parquetType.isPrimitive &&
+ private[ops] def isCompatibleParquetType(parquetType: Type): Boolean =
+ parquetType.isPrimitive &&
parquetType.asPrimitiveType.getPrimitiveTypeName == INT64 &&
(parquetType.getLogicalTypeAnnotation match {
case t: LogicalTypeAnnotation.TimeLogicalTypeAnnotation =>
- // Accept both MICROS (precision 0..6) and NANOS (precision 7..9),
and both
- // isAdjustedToUTC=false and =true. Spark's TimeType is zone-less
local time, so the
- // UTC-adjustment flag carries no extra information on read: the raw
time-of-day value
- // decodes identically either way. Mirroring the legacy
ParquetRowConverter guard keeps
- // the framework row-based read path consistent with the legacy and
vectorized readers.
- // SPARK-57416.
t.getUnit == TimeUnit.MICROS || t.getUnit == TimeUnit.NANOS
case _ => false
})
- if (!ok) {
+
+ /**
+ * Validates that a Parquet field can be decoded as TimeType on the
row-based path, throwing the
+ * same error as the legacy ParquetRowConverter so incompatible reads fail
loudly instead of
+ * silently misinterpreting bytes. See [[isCompatibleParquetType]] for the
accepted encodings.
+ */
+ private[ops] def requireCompatibleParquetType(
+ sparkType: TimeType, parquetType: Type): Unit = {
+ if (!isCompatibleParquetType(parquetType)) {
throw QueryExecutionErrors.cannotCreateParquetConverterForDataTypeError(
sparkType, parquetType.toString)
}
}
}
+
+/**
+ * Vectorized (batch) updater for TimeType: reads an INT64 TIME column into
the internal
+ * nanos-of-day representation and truncates it to the requested precision.
`fileStoresNanos`
+ * selects the on-disk unit - TIME(NANOS) stores nanos (identity),
TIME(MICROS) stores micros
+ * (converted to nanos). Mirrors the row-based newConverter path so the
vectorized and row-based
+ * readers agree; replaces the former
`ParquetVectorUpdaterFactory.TimeUpdater`, now owned by the
+ * type's ops.
+ */
+private[ops] class TimeVectorUpdater(precision: Int, fileStoresNanos: Boolean)
+ extends ParquetVectorUpdater {
+ private def toTruncatedNanos(value: Long): Long = {
Review Comment:
The new TimeVectorUpdater.toTruncatedNanos calls the shared
DateTimeUtils.truncateTimeToPrecision per value; but that function is not
table-backed, it recomputes math.pow on every call. Suggested resolution: make
truncateTimeToPrecision itself table-backed (a 10^k lookup for k in 0..9)
instead of math.pow. Then the row path, the vectorized path, and Cast all share
one source of truth and the per-value cost drops to a table lookup; and the
"table-backed" comment becomes accurate.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]