[ 
https://issues.apache.org/jira/browse/SPARK-57583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-57583:
-----------------------------
    Description: 
h2. Background

SPARK-57551 added read-time precision truncation for {{TIME}} columns in the 
vectorized Parquet reader: a {{TIME(NANOS)}} / {{TIME(MICROS)}} column read 
with an explicit lower precision (e.g. {{TIME(7)}}) must drop the sub-precision 
digits. The decoded value now passes through a precision-aware {{TimeUpdater}} 
(in {{ParquetVectorUpdaterFactory}}).

For dictionary-encoded columns the vectorized reader has two paths (see 
{{VectorizedColumnReader.readBatch}}):
* *eager* decode via {{updater.decodeDictionaryIds(...)}} (runs the updater, so 
truncation applies);
* *lazy* decode, where the column keeps the dictionary IDs and a 
{{ParquetDictionary}} resolves values on demand via {{decodeToLong}}, 
*bypassing the updater*.

{{ParquetDictionary}} only supports fixed reinterpretations (unsigned int32 -> 
long, unsigned int64 -> decimal bytes) toggled by a single {{needTransform}} 
boolean; it carries no unit or precision, so it cannot apply micros->nanos 
conversion or precision truncation.

h2. Decision in SPARK-57551 (option A)

To keep the fix consistent with every other arithmetic INT64 transform 
({{TIMESTAMP_MILLIS}}, {{TIME(MICROS)}}, timestamp rebase) -- all of which 
already opt out of lazy decoding -- SPARK-57551 disabled lazy dictionary 
decoding for {{TIME(NANOS)}} in 
{{VectorizedColumnReader.isLazyDecodingSupported}}. Dictionary-encoded 
{{TIME(NANOS)}} columns therefore eager-decode through the truncating updater. 
This is correct and matches {{TIME(MICROS)}}.

h2. Proposed improvement (option B)

Preserve the lazy dictionary-decoding optimization for {{TIME}} columns by 
making the lazy path itself precision/unit aware, rather than disabling it:
* extend {{ParquetDictionary}} (or introduce a TIME-specific dictionary 
wrapper) to carry the on-disk unit and the requested precision, and apply 
micros->nanos conversion plus {{truncateTimeToPrecision}} in {{decodeToLong}};
* re-enable lazy decoding for {{TIME}} in 
{{VectorizedColumnReader.isLazyDecodingSupported}};
* ideally do this uniformly for both {{TIME(MICROS)}} and {{TIME(NANOS)}} so 
the two units behave the same.

h2. Notes

Niche read-path optimization (dictionary-encoded TIME columns read at a lower 
precision than stored); correctness is already handled by option A. No 
user-facing behavior change. Found during review of SPARK-57551 (PR 
https://github.com/apache/spark/pull/56622).


> Support lazy dictionary decoding for nanosecond TIME in the vectorized 
> Parquet reader
> -------------------------------------------------------------------------------------
>
>                 Key: SPARK-57583
>                 URL: https://issues.apache.org/jira/browse/SPARK-57583
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 5.0.0
>            Reporter: Max Gekk
>            Priority: Major
>
> h2. Background
> SPARK-57551 added read-time precision truncation for {{TIME}} columns in the 
> vectorized Parquet reader: a {{TIME(NANOS)}} / {{TIME(MICROS)}} column read 
> with an explicit lower precision (e.g. {{TIME(7)}}) must drop the 
> sub-precision digits. The decoded value now passes through a precision-aware 
> {{TimeUpdater}} (in {{ParquetVectorUpdaterFactory}}).
> For dictionary-encoded columns the vectorized reader has two paths (see 
> {{VectorizedColumnReader.readBatch}}):
> * *eager* decode via {{updater.decodeDictionaryIds(...)}} (runs the updater, 
> so truncation applies);
> * *lazy* decode, where the column keeps the dictionary IDs and a 
> {{ParquetDictionary}} resolves values on demand via {{decodeToLong}}, 
> *bypassing the updater*.
> {{ParquetDictionary}} only supports fixed reinterpretations (unsigned int32 
> -> long, unsigned int64 -> decimal bytes) toggled by a single 
> {{needTransform}} boolean; it carries no unit or precision, so it cannot 
> apply micros->nanos conversion or precision truncation.
> h2. Decision in SPARK-57551 (option A)
> To keep the fix consistent with every other arithmetic INT64 transform 
> ({{TIMESTAMP_MILLIS}}, {{TIME(MICROS)}}, timestamp rebase) -- all of which 
> already opt out of lazy decoding -- SPARK-57551 disabled lazy dictionary 
> decoding for {{TIME(NANOS)}} in 
> {{VectorizedColumnReader.isLazyDecodingSupported}}. Dictionary-encoded 
> {{TIME(NANOS)}} columns therefore eager-decode through the truncating 
> updater. This is correct and matches {{TIME(MICROS)}}.
> h2. Proposed improvement (option B)
> Preserve the lazy dictionary-decoding optimization for {{TIME}} columns by 
> making the lazy path itself precision/unit aware, rather than disabling it:
> * extend {{ParquetDictionary}} (or introduce a TIME-specific dictionary 
> wrapper) to carry the on-disk unit and the requested precision, and apply 
> micros->nanos conversion plus {{truncateTimeToPrecision}} in {{decodeToLong}};
> * re-enable lazy decoding for {{TIME}} in 
> {{VectorizedColumnReader.isLazyDecodingSupported}};
> * ideally do this uniformly for both {{TIME(MICROS)}} and {{TIME(NANOS)}} so 
> the two units behave the same.
> h2. Notes
> Niche read-path optimization (dictionary-encoded TIME columns read at a lower 
> precision than stored); correctness is already handled by option A. No 
> user-facing behavior change. Found during review of SPARK-57551 (PR 
> https://github.com/apache/spark/pull/56622).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to