Hmm, I'm not sure how that will help, I understand how to batch up the data, but it is the triggering part that I don't see how to do. For example, in Spark Structured Streaming, you can set a time trigger which happens at a fixed interval all the way up to the source, so the source can throttle how much data to read even.
Here is my use case more thoroughly explained: I have a Kafka topic (with multiple partitions) that I'm reading from, and I need to aggregate batches of up to 500 before sending a single batch off in an RPC call. However, the vendor specified a rate limit, so if there are more than 500 unread messages in the topic, I must wait 1 second before issuing another RPC call. When searching on Stack Overflow I found this answer: https://stackoverflow.com/a/57275557/25658 that makes it seem challenging, but I wasn't sure if things had changed since then or you had better ideas. *~Vincent* On Thu, Oct 1, 2020 at 2:57 PM Luke Cwik <[email protected]> wrote: > Look at the GroupIntoBatches[1] transform. It will buffer "batches" of > size X for you. > > 1: > https://beam.apache.org/documentation/transforms/java/aggregation/groupintobatches/ > > On Thu, Oct 1, 2020 at 2:51 PM Vincent Marquez <[email protected]> > wrote: > >> the downstream consumer has these requirements. >> >> *~Vincent* >> >> >> On Thu, Oct 1, 2020 at 2:29 PM Luke Cwik <[email protected]> wrote: >> >>> Why do you want to only emit X? (e.g. running out of memory in the >>> runner) >>> >>> On Thu, Oct 1, 2020 at 2:08 PM Vincent Marquez < >>> [email protected]> wrote: >>> >>>> Hello all. If I want to 'throttle' the number of messages I pull off >>>> say, Kafka or some other queue, in order to make sure I only emit X amount >>>> per trigger, is there a way to do that and ensure that I get 'at least >>>> once' delivery guarantees? If this isn't supported, would the better way >>>> be to pull the limited amount opposed to doing it on the output side? >>>> >>>> >>>> *~Vincent* >>>> >>>
