GitHub user Xiao-zhen-Liu added a comment to the discussion: Task ideas for the dkNet-AI · Apache Texera Agent Hackathon
# Texera Workflow Completion Forecasting **What exists today.** Each operator box already displays cumulative *processed input* and *produced output* tuple counts (rendered by `JointUIService` via `.texera-operator-processed-count` and `.texera-operator-output-count`). The toolbar shows an elapsed-time counter, and the workflow state is conveyed via color badges. Users can already see *what has happened* and *for how long*. **The actual gap.** What they cannot see: - **Rate** — is the operator moving 230k rows/sec, slowing down, or stalled? - **How much is left** — counts have no denominator, so "halfway done" is not expressible. - **When will it finish** — no ETA, per operator or workflow-wide. - **One-glance workflow health** — elapsed time keeps climbing whether work is happening or not. **Proposal.** Turn the existing tuple counters into a *forecasting layer*. Build on the per-operator `ExecutionStatsUpdate` already streamed every ~500 ms; derive rate, predict remaining work, and surface ETA. No new transport, no engine rework. **Scope (four small, independently shippable phases):** - **Throughput and stall detection** — compute rows/sec from successive count snapshots (exponentially smoothed). Render a rate label next to each operator's existing counters (e.g. `230 k/s`); surface a stall warning when rate → 0 while state is still Running. Side panel adds a rate sparkline so users can see trends, not just instantaneous values. - **Source cardinality estimation** — cheap pre-execution estimates of total input size for bounded sources: exact for Arrow/Parquet (header `numRows`), `FileLister` (file count), `URLFetcher`; file-size-based estimates for CSV / JSONL / text scans. Unknown for Python/R UDF sources and streaming APIs — those keep showing counts + rate, no fake estimate. No SQL `COUNT(*)` probes (Texera is not a database engine). - **Per-operator ETA** — divide remaining work by smoothed rate; show "~14 s remaining" next to the operator. Each downstream operator's remaining work is bounded by its slowest input — no selectivity heuristics (which produce confidently-wrong forecasts). - **Workflow-level ETA** — toolbar gains a "~ETA" label beside the existing elapsed counter, taken as the max over operator ETAs (longest-remaining operator gates the workflow). Hidden when no operator has an estimate — the elapsed counter remains the only indicator, same as today. **Honesty principle.** Operators rooted in unbounded sources keep showing counts + rate only — no fabricated forecast. Where the system genuinely cannot predict, it stays silent rather than misleading. **Deliverables.** - One protobuf field (`estimated_output_count`) + a `CardinalityEstimable` trait implemented by ~8 source operators. - Frontend additions in `WorkflowStatusService`, `JointUIService`, the operator property panel, and the workflow menu toolbar. - Unit tests for each estimator and the derivation service; demo on a CSV → Filter → Aggregate → Sink workflow with before/after screenshots. **Impact.** The existing counters answer *what happened*. This adds the answers users actually ask out loud: *how fast is it going*, *how much is left*, *when will it finish* — while being explicit when prediction is genuinely impossible. GitHub link: https://github.com/apache/texera/discussions/5059#discussioncomment-16924286 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
