Re: [PR] [SPARK-56893][SQL] Optimize Parquet dictionary decoding with hasNull fast path and per-class updater overrides [spark]

via GitHub Mon, 22 Jun 2026 02:42:20 -0700


MaxGekk commented on PR #55920:
URL: https://github.com/apache/spark/pull/55920#issuecomment-4766943956


   Following up on the benchmark note — concretely, here's the 
precursor-baseline flow (the same one your #55922 / #55924 already get for free 
by regenerating an existing benchmark file):
   
   **Step 1 — land the benchmark against `master` first (separate, small PR):**
   1. Open a benchmark-only PR off current `master` containing just 
`ParquetDictionaryDecodeBenchmark.scala` — **no production change**. So it 
measures today's unoptimized `decodeDictionaryIds`.
   2. Generate the result files for the three JDKs. Locally for the host JDK:
      ```
      SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt \
        "sql/Test/runMain 
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionaryDecodeBenchmark"
      ```
      and the JDK 21 / 25 files via the `Run benchmarks` GitHub Actions 
workflow on your fork (the same one you linked in the description).
   3. Commit `ParquetDictionaryDecodeBenchmark-results.txt`, 
`-jdk21-results.txt`, `-jdk25-results.txt`. These are the **baseline**. Merge 
this PR.
   
   **Step 2 — rebase this PR on top:**
   4. Rebase #55920 onto the updated `master`. The benchmark source now already 
exists there, so drop it from this PR's diff — keep only the 
`ParquetVectorUpdater` / `ParquetVectorUpdaterFactory` production change.
   5. Regenerate the same three result files with the command above (now 
running the optimized code) and commit them.
   
   **Result:** this PR's diff to the three `.txt` files becomes `baseline -> 
optimized` — the speedup is visible right in the PR, reviewable, and guarded 
against regression going forward. It also avoids trying to A/B the inlining 
effect inside one process, which is unreliable since both class shapes load and 
get profiled together.
   
   If a separate precursor PR is too much ceremony, the alternative is to point 
this PR's benchmark at an existing dictionary-exercising benchmark that's 
already committed on `master` and regenerate that instead — same outcome (a 
before/after diff), no second PR. Either way, no production-code change needed; 
this is purely about making the gain reproducible from the tree.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56893][SQL] Optimize Parquet dictionary decoding with hasNull fast path and per-class updater overrides [spark]

Reply via email to