MaxGekk commented on PR #55920:
URL: https://github.com/apache/spark/pull/55920#issuecomment-4766943956
Following up on the benchmark note — concretely, here's the
precursor-baseline flow (the same one your #55922 / #55924 already get for free
by regenerating an existing benchmark file):
**Step 1 — land the benchmark against `master` first (separate, small PR):**
1. Open a benchmark-only PR off current `master` containing just
`ParquetDictionaryDecodeBenchmark.scala` — **no production change**. So it
measures today's unoptimized `decodeDictionaryIds`.
2. Generate the result files for the three JDKs. Locally for the host JDK:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt \
"sql/Test/runMain
org.apache.spark.sql.execution.datasources.parquet.ParquetDictionaryDecodeBenchmark"
```
and the JDK 21 / 25 files via the `Run benchmarks` GitHub Actions
workflow on your fork (the same one you linked in the description).
3. Commit `ParquetDictionaryDecodeBenchmark-results.txt`,
`-jdk21-results.txt`, `-jdk25-results.txt`. These are the **baseline**. Merge
this PR.
**Step 2 — rebase this PR on top:**
4. Rebase #55920 onto the updated `master`. The benchmark source now already
exists there, so drop it from this PR's diff — keep only the
`ParquetVectorUpdater` / `ParquetVectorUpdaterFactory` production change.
5. Regenerate the same three result files with the command above (now
running the optimized code) and commit them.
**Result:** this PR's diff to the three `.txt` files becomes `baseline ->
optimized` — the speedup is visible right in the PR, reviewable, and guarded
against regression going forward. It also avoids trying to A/B the inlining
effect inside one process, which is unreliable since both class shapes load and
get profiled together.
If a separate precursor PR is too much ceremony, the alternative is to point
this PR's benchmark at an existing dictionary-exercising benchmark that's
already committed on `master` and regenerate that instead — same outcome (a
before/after diff), no second PR. Either way, no production-code change needed;
this is purely about making the gain reproducible from the tree.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]