Hi Beam community (and Danny),
Following the recent merge of my PR on dynamic batching parameters (
https://github.com/apache/beam/pull/37428), I’ve been profiling Beam’s
RunInference on LLM-style workloads to identify opportunities where we can
adapt efficiency patterns from dedicated serving engines.
Beam’s unified processing model is a big advantage, but for production
GenAI pipelines there are two bottlenecks that show up quickly in cost and
user-perceived latency:
-
*Padding waste (Throughput / GPU cost):* Current RunInference batching
relies on arrival order. In mixed-workload scenarios, a single long prompt
forces the entire batch to be padded to its length, wasting significant
compute. While my previous PR optimized batch timing, it implies nothing
about batch composition. "Smart Bucketing" is needed to group
similar-length inputs and minimize this waste.
-
*No incremental output (TTFT / Latency):* The current one-shot
generation pattern blocks until completion, which makes interactive GenAI
flows feel unresponsive compared to streaming gRPC/HTTP endpoints.
I’d like to propose a phased roadmap to make production-ready GenAI
workloads first-class on Beam ML.
------------------------------
*Phase 1 — Efficiency Core: Smart Bucketing* *Goal: Reduce padding overhead
significantly (and improve throughput) for variable-length inputs without
changing existing behavior.*
*Approach* (building on the element_size_fn infrastructure):
-
Use the element_size_fn mechanism to bucket elements by length/size
before they reach the batching transform.
-
Implement a stateful bucketing transform using BagState + Timers:
-
Maintain per-bucket buffers (e.g., token-length ranges / configurable
boundaries).
-
Emit when a specific bucket reaches its target batch size.
-
Use timer-based flushing to bound waiting time and avoid starving
smaller buckets (handling the "straggler" problem).
*Engineering Principles:*
-
Opt-in transform / feature flag so existing pipelines remain unchanged.
-
Clear semantics around flushing (latency bounds, watermark
considerations).
-
Benchmarks + tests as part of the initial PR series (padding ratio,
throughput, p95 latency).
------------------------------
*Phase 2 — Latency Breakthrough: Iterative / Streaming Inference* *Goal:
Improve Time-To-First-Token and enable real-time GenAI patterns downstream.*
*Approach:*
-
Extend ModelHandler/RunInference with an experimental iterative output
mode (token- or chunk-level), where the DoFn can emit partial results as
they’re produced.
-
Start behind a feature flag with a conservative compatibility story.
(Note: This isn’t trying to turn Beam into a serving engine; it’s about
making Beam viable for interactive GenAI pipelines where incremental
outputs are part of the dataflow.)
------------------------------
*Next Steps & Process*
I’m currently drafting a design doc for Phase 1 (focusing on the state
machine, timer policies, and portability notes). If the direction
resonates, I’d love to share it early for feedback and iterate before
coding.
A note on process: I may package this roadmap as a GSoC proposal, but I’m
happy to drive it as incremental upstream improvements either way — Phase 1
in particular should land as a sequence of reviewable, measurable PRs.
*Feedback I’d value:*
1.
Does “bucketing first, streaming next” match Beam ML’s direction for
GenAI workloads?
2.
Any runner/portability constraints you’d want explicitly addressed in
the bucketing design (state/timers semantics, watermark behavior, etc.)?
3.
Would you prefer Phase 2 to target chunk-level streaming first (lower
risk) before token-level?
Best regards,
*Elia* https://github.com/Eliaaazzz