sjvanrossum commented on PR #31608:
URL: https://github.com/apache/beam/pull/31608#issuecomment-2173567583

   @iht and I have been looking into this today for a Dataflow customer and we 
came across a few details that seem to be missing on this PR:
   1. This does not work with `DataflowRunner` unless 
`--experiments=enable_custom_pubsub_sink` is specified since Dataflow's native 
implementation for `PubsubUnboundedSink` omits the ordering key before 
publishing.
   2. `PubsubUnboundedSink.PubsubSink` and 
`PubsubUnboundedSink.PubsubDynamicSink` do not group on the ordering key 
property, which will cause multiple ordering keys to end up in the same batch 
for publishing.
   3. `PubsubUnboundedSink` sets a fixed number of shards (100) on both 
`PubsubUnboundedSink.PubsubSink` and `PubsubUnboundedSink.PubsubDynamicSink` to 
improve latency within the sink. Simply adding the ordering key as an 
additional property may result in many small batches being produced which can 
have a negative impact due to the per call overhead on batch publishing.
   
   The issue we're working on is time-sensitive so we're trying to wrap up our 
patches today.
   To avoid user confusion this PR must incorporate changes to 
`PubsubUnboundedSink.ShardFn` to avoid triggering this error in Pub/Sub:
   ```
   In a single publish request, all messages must have no ordering key or they 
must all have the same ordering key. [code=539b]
   ```
   A nice to have would be enabling users to customize the output sharding 
range based on ordering keys. Given the fact that throughput per ordering key 
is capped to 1 MBps I'd almost be inclined to say the ordering key should 
replace the output shard entirely.
   
   @ahmedabu98 I'm happy to share our changes in a bit and I'll set up a PR 
against the source branch of this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to