Dear Beam Community,

My name is Elia, and I am a final-year student interested in contributing
to Apache Beam's AI/ML infrastructure for GSoC 2026.

I have been exploring RunInference for variable-length workloads,
specifically within NLP and LLMs. I noticed that the current batching
strategy in BatchElements is primarily count-based, which can lead to
inefficient padding, compute waste on GPU cycles, and unpredictable memory
usage (OOMs) when processing variable-length sequences.

I propose introducing Content-Aware Batching (or Token-Based Batching) to
the ML transform. This would allow batching based on a computational cost
metric, such as total tokens, rather than element count. I intend to
integrate this with dynamic padding in ModelHandler.

I have opened a Feature Request with a conceptual API design for further
context here: [Feature Request]: RunInference: Content-Aware Dynamic
Batching for NLP/LLM Workloads · Issue #37414 · apache/beam
<https://github.com/apache/beam/issues/37414>

I am planning to draft a design document for this feature and would
appreciate any feedback on this approach or information regarding existing
efforts in this direction.

Best regards,

Elia

Reply via email to