GitHub user Xiao-zhen-Liu added a comment to the discussion: Task ideas for the 
dkNet-AI · Apache Texera Agent Hackathon

# Texera Workflow Completion Forecasting

**What exists today.** Each operator box already displays cumulative *processed 
input* and *produced output* tuple counts (rendered by `JointUIService` via 
`.texera-operator-processed-count` and `.texera-operator-output-count`). The 
toolbar shows an elapsed-time counter, and the workflow state is conveyed via 
color badges. Users can already see *what has happened* and *for how long*.

**The actual gap.** What they cannot see:
- **Rate** — is the operator moving 230k rows/sec, slowing down, or stalled?
- **How much is left** — counts have no denominator, so "halfway done" is not 
expressible.
- **When will it finish** — no ETA, per operator or workflow-wide.
- **One-glance workflow health** — elapsed time keeps climbing whether work is 
happening or not.

**Proposal.** Turn the existing tuple counters into a *forecasting layer*. 
Build on the per-operator `ExecutionStatsUpdate` already streamed every ~500 
ms; derive rate, predict remaining work, and surface ETA. No new transport, no 
engine rework.

**Scope (four small, independently shippable phases):**
- **Throughput and stall detection** — compute rows/sec from successive count 
snapshots (exponentially smoothed). Render a rate label next to each operator's 
existing counters (e.g. `230 k/s`); surface a stall warning when rate → 0 while 
state is still Running. Side panel adds a rate sparkline so users can see 
trends, not just instantaneous values.
- **Source cardinality estimation** — cheap pre-execution estimates of total 
input size for bounded sources: exact for Arrow/Parquet (header `numRows`), 
`FileLister` (file count), `URLFetcher`; file-size-based estimates for CSV / 
JSONL / text scans. Unknown for Python/R UDF sources and streaming APIs — those 
keep showing counts + rate, no fake estimate. No SQL `COUNT(*)` probes (Texera 
is not a database engine).
- **Per-operator ETA** — divide remaining work by smoothed rate; show "~14 s 
remaining" next to the operator. Each downstream operator's remaining work is 
bounded by its slowest input — no selectivity heuristics (which produce 
confidently-wrong forecasts).
- **Workflow-level ETA** — toolbar gains a "~ETA" label beside the existing 
elapsed counter, taken as the max over operator ETAs (longest-remaining 
operator gates the workflow). Hidden when no operator has an estimate — the 
elapsed counter remains the only indicator, same as today.

**Honesty principle.** Operators rooted in unbounded sources keep showing 
counts + rate only — no fabricated forecast. Where the system genuinely cannot 
predict, it stays silent rather than misleading.

**Deliverables.**
- One protobuf field (`estimated_output_count`) + a `CardinalityEstimable` 
trait implemented by ~8 source operators.
- Frontend additions in `WorkflowStatusService`, `JointUIService`, the operator 
property panel, and the workflow menu toolbar.
- Unit tests for each estimator and the derivation service; demo on a CSV → 
Filter → Aggregate → Sink workflow with before/after screenshots.

**Impact.** The existing counters answer *what happened*. This adds the answers 
users actually ask out loud: *how fast is it going*, *how much is left*, *when 
will it finish* — while being explicit when prediction is genuinely impossible.

GitHub link: 
https://github.com/apache/texera/discussions/5059#discussioncomment-16924286

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to