Xiao-zhen-Liu opened a new pull request, #5115:
URL: https://github.com/apache/texera/pull/5115
# PR Description: AI-Augmented Macro Operators
## What changes were proposed in this PR?
Introduces **macro operators** — a logical-plan-level abstraction for
encapsulating reusable sub-DAGs as single nodes on the canvas — together with
the AI surfaces that make them practical: a suggestion panel that scans the
current workflow for refactor opportunities, a one-click "fuse for performance"
path that collapses a macro body into a single Python UDF, and a drill-down
editor for inspecting / editing a macro body while live execution stats keep
flowing.
### Surface
* **Right-click → "Create macro"** swaps a selected sub-DAG with a single
`Macro` op on the parent canvas. The body is persisted as a separate
`MACRO`-kind workflow row; references the persisted body via `(macroId,
macroVersion)` in `LIVE` link mode (`SNAPSHOT` mode also supported for portable
bodies).
* **"Your Macros" palette section** lists every macro the user has saved;
click → instantiate, hover → preview, right-click → export/import as portable
JSON bundles (transitive — nested macros travel with their parent).
* **"Suggest Macros (AI)" panel** (`MacroSuggestionService`) ranks two
heuristic candidate sources side-by-side: linear chains, and recurring
`(opType₁, opType₂, …)` patterns across the workflow. Recurring patterns
auto-tier as `✓ recommended` because duplicated logic is the strongest "extract
as macro" signal. Names are domain-aware (`csv_preprocessing`,
`text_filtering`, `metric_summary`, `joined_enrichment`, `ml_train_eval`, …)
rather than underscore-joined op types.
* **Right-click → "fuse for performance (AI)"** (`MacroFusionService`) emits
a `PythonUDFOpDescV2`-compatible `process_tuple` function from the macro body.
Covers Filter, Projection, Regex, Limit, Distinct, and inlined PythonUDFV2
(yield-rewritten). Marks `fusion.verified = true` so `MacroExpander`
substitutes a single UDF for the inlined body at compile time. Visual: solid
gold stroke + `⚡ FUSED · 1.6×` badge on canvas. Speedup is grounded in the
handoff-removal model (N−1 internal actor boundaries collapsed; conservative
×0.30 per handoff, capped at ×4).
* **Drill-down editor** — double-clicking a macro op routes to
`/dashboard/user/workflow/{wid}/macro/{macroId}?instance=…` and renders the
body on a child canvas. The body is laid out with dagre using the same settings
as the main canvas's auto-layout. Live execution stats (incl. row counts per
port) flow into the drill-down view via a `(body-op-id → runtime-UUID)` map
sourced from `MacroService`'s runtime-mapping cache.
### Architecture
* **Macros live at the logical-plan layer only.** `MacroExpander` (mirrored
in `amber/` and `workflow-compiling-service/`) runs a pre-compile pass that
inlines every `MacroOpDesc` into its body operators and rewrites edges, so the
physical-plan layer never sees a macro. `WorkflowCompiler` calls
`MacroExpander.expand` before `expandLogicalPlan`.
* **Deterministic UUIDs for inner ops.** The expander assigns each inner op
a fresh ID via
`UUID.nameUUIDFromBytes("${macroInstanceId}|${originalBodyOpId}")`. Required
because (a) the long `${instanceId}--${innerOp}` prefix scheme produced 170+
char IDs that caused Iceberg commit thrash on HashJoin's internal build-side
port, and (b) Texera has two compilers — frontend validation and execution-time
— that must produce bit-identical plans, otherwise the macro-mapping side-table
written by one wouldn't match the runtime stats emitted by the other.
* **Provenance side-table.** `MacroExpander` populates a `Map[runtimeOpId →
MacroProvenance(macroChain, bodyOpId)]` during expansion; `WorkflowCompiler`
drains it after compile and stores it in `MacroMappingCache` (file-backed at
`/tmp/texera-macro-mappings/wid-{wid}.json` for cross-JVM visibility between
`ComputingUnitMaster` and `TexeraWebApplication`). Exposed via `GET
/api/workflow/{wid}/macro-mapping`. Frontend
`WorkflowStatusService.withMacroAggregates` walks the chain to roll inner-op
stats up to every macro level (parent canvas + nested drill-downs).
* **Nested macros are fully supported** — the chain stored per runtime op is
`[outerInstanceId, innerInstanceId, …]`; the resolver suffix-matches the chain
so a stats-roll-up rooted at an inner macro still finds its runtime ops.
### What this PR also fixes (along the way)
* **View-result inside a macro** — drill-down result lookups now go
body-relative-id → runtime-UUID via
`MacroService.buildBodyOpIdToRuntimeUuidMap` (replaces the obsolete
prefix-based alias). Mega-macros with 0 external outputs alias the canvas op to
the first body sink, so the auto-stored terminal output is reachable without
drilling.
* **Back-to-parent stats** — `WorkflowStatusService` re-aggregates the
cached raw status on every `runtimeMacroMappingTick`, and its emission Subject
becomes `ReplaySubject(1)` so the canvas remount after navigation gets the
latest snapshot immediately.
* **Jackson `macroSyncedAt` UnrecognizedPropertyException** at execute time
— `MacroOpDesc` annotated with `@JsonIgnoreProperties(ignoreUnknown = true)` so
UI-only fields the frontend stamps onto operatorProperties don't break
deserialization.
* **`/api/macro/*` HTTP storm** — lazy fetches inside template bindings
caused an infinite loop; reverted to a flat palette and removed the lazy
fetches.
* **Engine error visibility** — phase-transition errors and
missing-schema-port errors now propagate out of `RegionExecutionCoordinator`
instead of stalling silently.
## Any related issues, documentation, discussions?
Hackathon submission. Builds on §9.2 of the macro design doc (AI fusion
substitution path).
## How was this PR tested?
* **`MacroExpanderSpec`** (~694 lines) covers the expander on its own:
single-macro expansion, nested expansion (outer chain + inner chain), input
fan-out (single external port → multiple inner consumers), output fan-in
detection (raises), cycle detection across nested macros, depth-limit guard,
deterministic-UUID property (same input → same output across compiler
instances), and the provenance side-table population.
* **`MacroOpDescSpec`** covers Jackson serialization round-trip (incl.
tolerance of unknown frontend-only fields like `macroSyncedAt`).
* **Demo runbook** — exercised the full path end-to-end on a real
multi-macro workflow including nested macros, view-result on inner sinks, fuse
+ unfuse, drill-down navigation in/out, and a "mega-macro" (entire workflow
wrapped). Stats roll up correctly at every nesting level and the canvas remount
after navigation no longer wipes non-macro op states.
```
sbt "WorkflowExecutionService/testOnly *MacroExpanderSpec"
sbt "WorkflowOperator/testOnly *MacroOpDescSpec"
```
## Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]