luis4a0 opened a new pull request, #12108:
URL: https://github.com/apache/gluten/pull/12108

   ## What changes are proposed in this pull request?
   
   > **Stacked on top of #12107.** Until that PR merges this PR's diff will
   > appear larger (it carries both #12107's commits and this PR's commits).
   > Once #12107 lands in `main`, the diff here will narrow to just the
   > SplitRV sub-stage changes (2 commits, ~232 lines). Reviewers may wait
   > for #12107 to merge if they prefer to look at this PR's contribution
   > in isolation.
   
   The existing `CpuWallTimingSplitRV` bucket wraps the entire
   `splitRowVector()` call, so when it dominates wall time in a
   `VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell which
   of the four sub-paths is responsible: fixed-width scatter, validity
   copy, binary split, or complex (Presto-serialized) split.
   
   This PR subdivides `SplitRV` into four new buckets, one per helper
   inside `splitRowVector()`:
   
   | New enum                              | Helper                            |
   |---------------------------------------|-----------------------------------|
   | `CpuWallTimingSplitFixedWidth`        | `splitFixedWidthValueBuffer()`    |
   | `CpuWallTimingSplitValidity`          | `splitValidityBuffer()`           |
   | `CpuWallTimingSplitBinary`            | `splitBinaryArray()`              |
   | `CpuWallTimingSplitComplex`           | `splitComplexType()`              |
   
   The outer `SplitRV` bucket is unchanged: it continues to count once per
   call, and its wall/cpu measurement now happens to overlap the four
   inner buckets (the outer timer is still ticking while each inner one
   is). The outer **minus** the sum-of-inners is therefore the time spent
   in `splitRowVector()` outside the four sub-paths (mostly the post-split
   `partitionBufferBase_` update loop). This is the same parent/child
   pattern most profiling tools use for nested timers, and it's documented
   in a comment next to the enum.
   
   To make the new sub-stage data accessible to anything other than the
   existing compile-time log path (tests, future `ShuffleWriterMetrics`
   exposure, profilers), three things are hoisted from `protected` to
   `public` in `VeloxShuffleWriter`:
   
   - The `CpuWallTimingType` enum
   - The `CpuWallTimingName()` helper (already de-facto public, since
     its output is printed by the `VELOX_SHUFFLE_WRITER_LOG_FLAG` log)
   - A new single-value accessor `cpuWallTiming(CpuWallTimingType)`
   
   The underlying `cpuWallTimingList_` storage stays `protected`.
   
   ### Why now
   
   This is the natural follow-up to #12083 (which made the per-stage
   timers actually record real numbers) and to #12107 (which adds a
   per-batch input-encoding counter). Together, the three changes let a
   reviewer answer **"what shape did the writer see, where did the time
   go, and what inner path drove it"** from a single
   `VELOX_SHUFFLE_WRITER_LOG_FLAG` run — instead of being stuck at
   "`SplitRV` was 65 % of wall time, *somewhere in there*".
   
   A per-query TPC-DS/TPC-H SF10 profile sweep on a closely-related
   internal branch shows `SplitRV` is the #1 or #2 bucket on 10/12
   queries surveyed (46 – 71 % of wall), with substantially different
   sub-stage breakdowns: bigint-heavy queries (q17, q14a) are
   overwhelmingly `splitFixedWidthValueBuffer`, while VARCHAR-heavy q67
   shifts toward `splitBinaryArray`. Without the sub-stage timers, those
   breakdowns are invisible.
   
   ## How was this patch tested?
   
   This PR adds `cpp/velox/tests/VeloxHashShuffleWriterSplitRvSubStagesTest.cc`
   (second commit) with five gtest cases:
   
   - `enumNames` — validates the strings emitted under
     `VELOX_SHUFFLE_WRITER_LOG_FLAG` for the 4 new enum entries (these
     strings are stable observable surface).
   - `fixedWidthBumpsItsBucket` — single fixed-width batch; asserts each
     of the 4 inner buckets has `count == 1` (every helper inside
     `splitRowVector()` is called once per batch regardless of column-type
     mix, since the new sub-stage `SCOPED_TIMER` calls live at function
     entry).
   - `binaryBumpsItsBucket` — VARCHAR batch; asserts the binary
     sub-stage's `wallNanos > 0`, confirming real work is being measured
     (not just function-entry overhead).
   - `complexBumpsItsBucket` — ARRAY batch; same `wallNanos > 0`
     assertion against the complex sub-stage (Presto-serialized
     round-trip per partition).
   - `outerSplitRvStillCounted` — sanity: two batches in, outer `SplitRV`
     bucket has `count == 2`. Confirms the new inner timers refine, not
     replace, the existing outer timer.
   
   End-to-end runtime validation in a real Spark/Gluten run was not
   possible in my development environment for the same JNI-symbol-resolution
   reason documented in #12083 and #12107. The local Velox ep build here
   is configured with `VELOX_BUILD_TEST_UTILS=OFF`, so the new gtest
   binary was syntax-validated locally (`g++ -fsyntax-only` against the
   in-tree compile flags) but not executed; Gluten CI will run it.
   
   Marked as draft until #12107 merges (so reviewers see the focused
   SplitRV diff).
   
   ## Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to