[
https://issues.apache.org/jira/browse/SPARK-57583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Max Gekk updated SPARK-57583:
-----------------------------
Description:
h2. Background
SPARK-57551 added read-time precision truncation for {{TIME}} columns in the
vectorized Parquet reader: a {{TIME(NANOS)}} / {{TIME(MICROS)}} column read
with an explicit lower precision (e.g. {{TIME(7)}}) must drop the sub-precision
digits. The decoded value now passes through a precision-aware {{TimeUpdater}}
(in {{ParquetVectorUpdaterFactory}}).
For dictionary-encoded columns the vectorized reader has two paths (see
{{VectorizedColumnReader.readBatch}}):
* *eager* decode via {{updater.decodeDictionaryIds(...)}} (runs the updater, so
truncation applies);
* *lazy* decode, where the column keeps the dictionary IDs and a
{{ParquetDictionary}} resolves values on demand via {{decodeToLong}},
*bypassing the updater*.
{{ParquetDictionary}} only supports fixed reinterpretations (unsigned int32 ->
long, unsigned int64 -> decimal bytes) toggled by a single {{needTransform}}
boolean; it carries no unit or precision, so it cannot apply micros->nanos
conversion or precision truncation.
h2. Decision in SPARK-57551 (option A)
To keep the fix consistent with every other arithmetic INT64 transform
({{TIMESTAMP_MILLIS}}, {{TIME(MICROS)}}, timestamp rebase) -- all of which
already opt out of lazy decoding -- SPARK-57551 disabled lazy dictionary
decoding for {{TIME(NANOS)}} in
{{VectorizedColumnReader.isLazyDecodingSupported}}. Dictionary-encoded
{{TIME(NANOS)}} columns therefore eager-decode through the truncating updater.
This is correct and matches {{TIME(MICROS)}}.
h2. Proposed improvement (option B)
Preserve the lazy dictionary-decoding optimization for {{TIME}} columns by
making the lazy path itself precision/unit aware, rather than disabling it:
* extend {{ParquetDictionary}} (or introduce a TIME-specific dictionary
wrapper) to carry the on-disk unit and the requested precision, and apply
micros->nanos conversion plus {{truncateTimeToPrecision}} in {{decodeToLong}};
* re-enable lazy decoding for {{TIME}} in
{{VectorizedColumnReader.isLazyDecodingSupported}};
* ideally do this uniformly for both {{TIME(MICROS)}} and {{TIME(NANOS)}} so
the two units behave the same.
h2. Notes
Niche read-path optimization (dictionary-encoded TIME columns read at a lower
precision than stored); correctness is already handled by option A. No
user-facing behavior change. Found during review of SPARK-57551 (PR
https://github.com/apache/spark/pull/56622).
> Support lazy dictionary decoding for nanosecond TIME in the vectorized
> Parquet reader
> -------------------------------------------------------------------------------------
>
> Key: SPARK-57583
> URL: https://issues.apache.org/jira/browse/SPARK-57583
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 5.0.0
> Reporter: Max Gekk
> Priority: Major
>
> h2. Background
> SPARK-57551 added read-time precision truncation for {{TIME}} columns in the
> vectorized Parquet reader: a {{TIME(NANOS)}} / {{TIME(MICROS)}} column read
> with an explicit lower precision (e.g. {{TIME(7)}}) must drop the
> sub-precision digits. The decoded value now passes through a precision-aware
> {{TimeUpdater}} (in {{ParquetVectorUpdaterFactory}}).
> For dictionary-encoded columns the vectorized reader has two paths (see
> {{VectorizedColumnReader.readBatch}}):
> * *eager* decode via {{updater.decodeDictionaryIds(...)}} (runs the updater,
> so truncation applies);
> * *lazy* decode, where the column keeps the dictionary IDs and a
> {{ParquetDictionary}} resolves values on demand via {{decodeToLong}},
> *bypassing the updater*.
> {{ParquetDictionary}} only supports fixed reinterpretations (unsigned int32
> -> long, unsigned int64 -> decimal bytes) toggled by a single
> {{needTransform}} boolean; it carries no unit or precision, so it cannot
> apply micros->nanos conversion or precision truncation.
> h2. Decision in SPARK-57551 (option A)
> To keep the fix consistent with every other arithmetic INT64 transform
> ({{TIMESTAMP_MILLIS}}, {{TIME(MICROS)}}, timestamp rebase) -- all of which
> already opt out of lazy decoding -- SPARK-57551 disabled lazy dictionary
> decoding for {{TIME(NANOS)}} in
> {{VectorizedColumnReader.isLazyDecodingSupported}}. Dictionary-encoded
> {{TIME(NANOS)}} columns therefore eager-decode through the truncating
> updater. This is correct and matches {{TIME(MICROS)}}.
> h2. Proposed improvement (option B)
> Preserve the lazy dictionary-decoding optimization for {{TIME}} columns by
> making the lazy path itself precision/unit aware, rather than disabling it:
> * extend {{ParquetDictionary}} (or introduce a TIME-specific dictionary
> wrapper) to carry the on-disk unit and the requested precision, and apply
> micros->nanos conversion plus {{truncateTimeToPrecision}} in {{decodeToLong}};
> * re-enable lazy decoding for {{TIME}} in
> {{VectorizedColumnReader.isLazyDecodingSupported}};
> * ideally do this uniformly for both {{TIME(MICROS)}} and {{TIME(NANOS)}} so
> the two units behave the same.
> h2. Notes
> Niche read-path optimization (dictionary-encoded TIME columns read at a lower
> precision than stored); correctness is already handled by option A. No
> user-facing behavior change. Found during review of SPARK-57551 (PR
> https://github.com/apache/spark/pull/56622).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]