sjvanrossum commented on PR #31608: URL: https://github.com/apache/beam/pull/31608#issuecomment-2173567583
@iht and I have been looking into this today for a Dataflow customer and we came across a few details that seem to be missing on this PR: 1. This does not work with `DataflowRunner` unless `--experiments=enable_custom_pubsub_sink` is specified since Dataflow's native implementation for `PubsubUnboundedSink` omits the ordering key before publishing. 2. `PubsubUnboundedSink.PubsubSink` and `PubsubUnboundedSink.PubsubDynamicSink` do not group on the ordering key property, which will cause multiple ordering keys to end up in the same batch for publishing. 3. `PubsubUnboundedSink` sets a fixed number of shards (100) on both `PubsubUnboundedSink.PubsubSink` and `PubsubUnboundedSink.PubsubDynamicSink` to improve latency within the sink. Simply adding the ordering key as an additional property may result in many small batches being produced which can have a negative impact due to the per call overhead on batch publishing. The issue we're working on is time-sensitive so we're trying to wrap up our patches today. To avoid user confusion this PR must incorporate changes to `PubsubUnboundedSink.ShardFn` to avoid triggering this error in Pub/Sub: ``` In a single publish request, all messages must have no ordering key or they must all have the same ordering key. [code=539b] ``` A nice to have would be enabling users to customize the output sharding range based on ordering keys. Given the fact that throughput per ordering key is capped to 1 MBps I'd almost be inclined to say the ordering key should replace the output shard entirely. @ahmedabu98 I'm happy to share our changes in a bit and I'll set up a PR against the source branch of this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org