Hi Beam Community,

I would like to propose the inclusion of a new feature in the upcoming
2.72.0 release: Sharded Parallel Reading for SparkReceiverIO (Java
SDK).

Currently, SparkReceiverIO is restricted to a single worker due to its
Impulse-based initiation. I have submitted a PR that implements
horizontal scalability through a sharding mechanism.

Key Enhancements:

Configurable Parallelism: Users can now set .withNumReaders(n) to
distribute work.

True Sharding: Unlike simple redundant reads, I have introduced a
HasOffset.setShard(shardId, numShards) interface. This allows the
underlying Receiver to partition data (e.g., via modulo or
topic-partition mapping), ensuring no data duplication across workers.

Modernized Architecture: Replaced legacy Impulse logic with a Create +
Reshuffle pattern to ensure proper fusion-breaking across different
runners.

PR #37411: https://github.com/apache/beam/pull/37411 Issue #37410:
https://github.com/apache/beam/issues/37410

All Pre-commit tests are passing, and I have included unit tests
verifying unique record consumption across multiple parallel shards.

I would appreciate feedback from the community and the release
managers on including this in 2.72.0.

Best regards, Atharva Ralegankar https://www.linkedin.com/in/atharvaralegankar/

Reply via email to