+1, a weighted BatchElements would help this case a lot. On Sun, Jan 25, 2026 at 1:23 AM Elia LIU <[email protected]> wrote:
> Dear Beam Community, > > My name is Elia, and I am a final-year student interested in contributing > to Apache Beam's AI/ML infrastructure for GSoC 2026. > > I have been exploring RunInference for variable-length workloads, > specifically within NLP and LLMs. I noticed that the current batching > strategy in BatchElements is primarily count-based, which can lead to > inefficient padding, compute waste on GPU cycles, and unpredictable memory > usage (OOMs) when processing variable-length sequences. > > I propose introducing Content-Aware Batching (or Token-Based Batching) to > the ML transform. This would allow batching based on a computational cost > metric, such as total tokens, rather than element count. I intend to > integrate this with dynamic padding in ModelHandler. > > I have opened a Feature Request with a conceptual API design for further > context here: [Feature Request]: RunInference: Content-Aware Dynamic > Batching for NLP/LLM Workloads · Issue #37414 · apache/beam > <https://github.com/apache/beam/issues/37414> > > I am planning to draft a design document for this feature and would > appreciate any feedback on this approach or information regarding existing > efforts in this direction. > > Best regards, > > Elia >
