luis4a0 opened a new pull request, #12107:
URL: https://github.com/apache/gluten/pull/12107
## What changes are proposed in this pull request?
Companion to #12083 (`SCOPED_TIMER` fix). With the per-stage timers in
`cpuWallTimingList_` now recording real numbers, the natural complement
is a counter that tells the reader what shape the writer actually sees
— i.e. for each `write(cb, ...)` call, how many of the input children
arrive as FLAT / DICTIONARY / CONSTANT / LAZY / other encoding (the
values of `facebook::velox::VectorEncoding::Simple`).
A new `inputEncodingCounts_` member (`std::array<int64_t, 5>`) is
maintained on `VeloxHashShuffleWriter`. The counts are taken *before*
any `getFlattenedRowVector()` invocation in `write(cb, ...)`, so they
reflect the encoding the writer actually receives — not the post-flatten
shape (which is always FLAT by construction). Output is gated through
the existing `VELOX_SHUFFLE_WRITER_LOG_FLAG` log channel and uses the
same line shape as the existing `cpuWallTimingList_` lines, so behavior
is unchanged when the flag is off (default).
This information is useful for at least three things:
- Verifying whether a workload would benefit from writer-side
dictionary / constant pass-through optimizations *before* writing
the pass-through code (does the writer ever actually see those
encodings?). Recent OSS PR #9727 (writer-side dictionary rebuild)
and several proposed constant pass-through patches all share the
same assumption that the writer sees DICT/CONST inputs — this
counter makes that assumption directly checkable on any workload.
- Spotting unexpected upstream operator changes that collapse /
preserve encodings differently across Velox versions (e.g., a
change in HashAggregate that starts emitting CONSTANT children
when it didn't before).
- Correlating a wall-time spike in a particular stage with the
encoding mix that drove it.
The counter is always-on (one branch + one increment per input
child, swamped by the surrounding shuffle work). Scope is
deliberately narrow: only the 5-bucket encoding mix at the
hash-shuffle writer entry, surfaced via the existing log channel. A
per-type breakdown (BIGINT / DECIMAL / VARCHAR / ...) and exposure
through `ShuffleWriterMetrics` to the JVM SparkListener are both
plausible follow-ups but kept out of this PR.
Diff: +277 / -0 over 4 files (3 in `cpp/velox/`, 1 new test).
## How was this patch tested?
This PR adds `cpp/velox/tests/VeloxHashShuffleWriterInputEncodingTest.cc`
(second commit) with three gtest cases:
- `allFlat` — two batches with 3 flat children each; asserts
`FLAT` bucket accumulates to 6 and the other 4 buckets stay 0.
- `mixedFlatDictConst` — one batch with FLAT + FLAT + DICTIONARY +
CONSTANT children; asserts FLAT=2, DICTIONARY=1, CONSTANT=1, and
the remaining 2 buckets stay 0.
- `lazy` — wraps a child in a `LazyVector`; asserts the `LAZY`
bucket increments (the encoding is captured BEFORE any
`loadedVector()` call inside the writer).
The new test is a standalone `add_velox_test(...)` entry rather than
a new `TEST_P` case in `VeloxShuffleWriterTest.cc`, so it isolates
the counter behavior from the parameterised round-trip matrix —
which keeps the failure mode obvious if a future change quietly
breaks one of the buckets.
End-to-end runtime validation in a real Spark/Gluten run was not
possible in my development environment for the same JNI-symbol-resolution
reason documented in #12083 — the production gluten-velox JAR takes
precedence on the JNI path, shadowing the locally-built OSS bundle.
The CI-built bundle will not have that issue. Likewise, the local
Velox ep build here is configured with `VELOX_BUILD_TEST_UTILS=OFF`,
so the new gtest binary was syntax-validated locally (via
`g++ -fsyntax-only` against the in-tree compile flags) but not
executed; Gluten CI will run it.
Marked as draft for one more review pass before flipping to ready.
## Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]