Hi Beam Community, I would like to propose the inclusion of a new feature in the upcoming 2.72.0 release: Sharded Parallel Reading for SparkReceiverIO (Java SDK).
Currently, SparkReceiverIO is restricted to a single worker due to its Impulse-based initiation. I have submitted a PR that implements horizontal scalability through a sharding mechanism. Key Enhancements: Configurable Parallelism: Users can now set .withNumReaders(n) to distribute work. True Sharding: Unlike simple redundant reads, I have introduced a HasOffset.setShard(shardId, numShards) interface. This allows the underlying Receiver to partition data (e.g., via modulo or topic-partition mapping), ensuring no data duplication across workers. Modernized Architecture: Replaced legacy Impulse logic with a Create + Reshuffle pattern to ensure proper fusion-breaking across different runners. PR #37411: https://github.com/apache/beam/pull/37411 Issue #37410: https://github.com/apache/beam/issues/37410 All Pre-commit tests are passing, and I have included unit tests verifying unique record consumption across multiple parallel shards. I would appreciate feedback from the community and the release managers on including this in 2.72.0. Best regards, Atharva Ralegankar https://www.linkedin.com/in/atharvaralegankar/
