Dear Beam Community, My name is Elia, and I am a final-year student interested in contributing to Apache Beam's AI/ML infrastructure for GSoC 2026.
I have been exploring RunInference for variable-length workloads, specifically within NLP and LLMs. I noticed that the current batching strategy in BatchElements is primarily count-based, which can lead to inefficient padding, compute waste on GPU cycles, and unpredictable memory usage (OOMs) when processing variable-length sequences. I propose introducing Content-Aware Batching (or Token-Based Batching) to the ML transform. This would allow batching based on a computational cost metric, such as total tokens, rather than element count. I intend to integrate this with dynamic padding in ModelHandler. I have opened a Feature Request with a conceptual API design for further context here: [Feature Request]: RunInference: Content-Aware Dynamic Batching for NLP/LLM Workloads · Issue #37414 · apache/beam <https://github.com/apache/beam/issues/37414> I am planning to draft a design document for this feature and would appreciate any feedback on this approach or information regarding existing efforts in this direction. Best regards, Elia
