+1, a weighted BatchElements would help this case a lot.

On Sun, Jan 25, 2026 at 1:23 AM Elia LIU <[email protected]> wrote:

> Dear Beam Community,
>
> My name is Elia, and I am a final-year student interested in contributing
> to Apache Beam's AI/ML infrastructure for GSoC 2026.
>
> I have been exploring RunInference for variable-length workloads,
> specifically within NLP and LLMs. I noticed that the current batching
> strategy in BatchElements is primarily count-based, which can lead to
> inefficient padding, compute waste on GPU cycles, and unpredictable memory
> usage (OOMs) when processing variable-length sequences.
>
> I propose introducing Content-Aware Batching (or Token-Based Batching) to
> the ML transform. This would allow batching based on a computational cost
> metric, such as total tokens, rather than element count. I intend to
> integrate this with dynamic padding in ModelHandler.
>
> I have opened a Feature Request with a conceptual API design for further
> context here: [Feature Request]: RunInference: Content-Aware Dynamic
> Batching for NLP/LLM Workloads · Issue #37414 · apache/beam
> <https://github.com/apache/beam/issues/37414>
>
> I am planning to draft a design document for this feature and would
> appreciate any feedback on this approach or information regarding existing
> efforts in this direction.
>
> Best regards,
>
> Elia
>

Reply via email to