[PR] [SPARK-56872][SQL] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId [spark]

via GitHub Thu, 14 May 2026 22:15:50 -0700


LuciferYang opened a new pull request, #55890:
URL: https://github.com/apache/spark/pull/55890


   ### What changes were proposed in this pull request?
   
   `DowncastLongUpdater` (selected for reading INT64 DECIMAL columns into a 
Spark target whose precision is `<= 9`) targets a 32-bit decimal column vector, 
which is backed by `intData[]`; `longData[]` is unallocated. Its 
`decodeSingleDictionaryId` previously called `values.putLong(...)`, which NPE'd 
as soon as that path was actually exercised.
   
   The fix narrows the dictionary's long value to int with the same `(int) 
longValue` cast already used by `readValue` and `readValues`:
   
   ```java
   values.putInt(offset, (int) 
dictionary.decodeToLong(dictionaryIds.getDictId(offset)));
   ```
   
   ### Why are the changes needed?
   
   This is a latent bug going back to SPARK-35640 (Jun 2021). It went 
undetected because the path is only reachable when:
   
   1. The Parquet column is stored as **INT64** with logical type 
**DECIMAL(precision <= 9)** — which Spark's own writer never produces (it emits 
INT32 for `DECIMAL(p<=9)`); only external writers (Hive, Impala, ...) emit this 
form.
   2. The Spark read schema targets a **DecimalType with precision <= 9**, so 
the factory routes to `DowncastLongUpdater`.
   3. The vectorized reader has to **eagerly drain** dictionary IDs — for 
example when parquet-mr starts dictionary-encoded and then falls back to PLAIN 
mid-column. The normal lazy-dictionary path (where decoding happens at row read 
time via `ParquetDictionary`) bypasses this updater method entirely, which is 
why everyday workloads never hit it.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes — reads that previously failed with a `NullPointerException` now succeed 
and return the correct values.
   
   ### How was this patch tested?
   
   Added a regression test in `ParquetIOSuite` that writes INT64 DECIMAL(9, 2) 
via parquet-mr's low-level writer with a mix-cardinality pattern (80% from a 
4-value pool, 20% unique-per-row, 5000 rows). This forces the 
dictionary-to-PLAIN fallback that triggers the eager-decode path. The test 
NPE'd on master without this fix and now passes.
   
   ```
   build/sbt 'sql/testOnly *ParquetIOSuite'
   ...
   [info] Tests: succeeded 92, failed 0, canceled 0, ignored 0, pending 0
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-56872][SQL] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId [spark]

Reply via email to