[D] RFC: Workflow Performance Profiler - full design & implementation walkthrough [texera]

via GitHub Mon, 25 May 2026 22:22:06 -0700


GitHub user PG1204 created a discussion: RFC: Workflow Performance Profiler - 
full design & implementation walkthrough


Following up on @Yicong-Huang's suggestion on PR #5098. This post documents the 
entire profiler implementation from the hackathon submission, at the level of 
detail needed to decide what to land and in what order.

### What it does
A feedback layer over Texera's existing per-operator runtime stats. While a 
workflow runs:

- Canvas heatmap colors operators cold -> hot under one of three views (Runtime 
/ Throughput / I/O imbalance).
- Rule-based hints in the property panel ("filter passes 2% of rows - push it 
upstream").
- Compare to past run - pick any completed execution; canvas recolors green/red 
by delta.
- Downloadable report (Markdown + JSON).
- (Optional) Ghost suggestions on canvas for two structural rewrites.

Design principle: read-only consumer of data we already produce. No new event 
types, no engine changes, no new HTTP endpoints in the critical path. Removing 
the profiler touches zero backend code paths.

### Architecture

OperatorStatisticsUpdateEvent (existing)
        │
        ▼
WorkflowStatusService (existing)
        │
        ▼
ProfilerService ── reactive state (enabled, view, threshold, scores, baseline)
        │
        ├── profiler-config     (round-trip via WorkflowContent.profilerConfig)
        ├── profiler-hints      (pure 6-rule engine)
        ├── profiler-delta      (baseline diffing)
        ├── profiler-history    (server-fetched past runs → BaselineReport)
        ├── profiler-report     (MD + JSON export)
        └── profiler-suggestions (optional ghost overlays)
                │
                ▼
   joint-ui (canvas color) · menu (controls) · property-panel (metrics + hints)

### Components

Module | Purpose
-- | --
ProfilerService | Throttled (500ms) subscriber to status stream; pure 
computeScores() per view; normalizes to [0,1]; resets on run restart
profiler-config | WorkflowContent.profilerConfig round-trip; defensive parser; 
equality guard against write-loop
profiler-hints | 6 rules: IDLE_HEAVY, JOIN_HIGH_FANIN_LOW_FANOUT, 
LOW_PARALLELISM_HOT_OP, RUNTIME_OUTLIER, SCAN_FULL_TABLE_NO_FILTER, 
UPSTREAM_OVERPRODUCTION
profiler-delta | Per-op delta vs baseline; ±5% / <1ms counts as unchanged; 
surfaces matched / new / removed
profiler-history | Reuses existing /executions/{wid}/stats/{eid}; converts 
persisted rows → BaselineReport; memoized per (wid, eid)
profiler-report | MD + JSON; JSON is identical to the upload schema (no 
separate format)
profiler-suggestions (optional) | Structural rewrites (INSERT_FILTER, 
BUMP_WORKERS); per-workflow dismissed set in localStorage
ProfilerScoring.scala (optional) | Pure object mirroring TS scoring; zero call 
sites today; only useful if backend ever needs scoring

Test totals on the branch: ~250 Vitest specs (all green), tsc --noEmit clean, 
ScalaTest covers the scoring helper.

### Things I'd like input on

Per-workflow config storage: one optional field on WorkflowContent. OK, or 
prefer a side table / user setting?
Backend scoring helper: include ProfilerScoring.scala now for parity, or drop 
until there's a call site?
Ghost suggestions vs agent: they cover overlapping use cases (the agent 
integration, deferred to a separate RFC, has similar Apply/Reject cards). Land 
both as complementary, or pick one?
Schema versioning: baseline JSON = exported report JSON. Add an explicit 
schemaVersion before merging, or trust the defensive parser?
Hint i18n: messages are English-only. Route through i18n now or defer?
Recompute scale: tested up to ~80 operators with no issue, no formal benchmark. 
Upper bound we care about?

### Proposed merge order
7 independent PRs, each useful on its own:

Heatmap foundation: ProfilerService + config + canvas coloring + 
Execution-Settings controls + legend. (~1.2k LOC w/ tests)
Hints + property-panel metrics + threshold slider.
UX polish: displayName threading, hover tooltip, MD/JSON report download.
Per-workflow config round-trip.
Compare-across-runs + delta heatmap.
(optional) Ghost suggestions.
(optional, backend) ProfilerScoring.scala.
Agent integration -> separate RFC once 1–5 are in.

Happy to open PR 1 once there's directional agreement here. Specific yes/no on 
"Things I'd like input on" would unblock the most.


GitHub link: https://github.com/apache/texera/discussions/5216

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

[D] RFC: Workflow Performance Profiler - full design & implementation walkthrough [texera]

Reply via email to