Thanks for the response.  Is my understanding correct that SplittableDoFns
are only applicable to Batch pipelines?  I'm wondering if there's any
proposals to address backpressure needs?
*~Vincent*


On Tue, Oct 6, 2020 at 1:37 PM Luke Cwik <[email protected]> wrote:

> There is no general back pressure mechanism within Apache Beam (runners
> should be intelligent about this but there is currently no way to say I'm
> being throttled so runners don't know that throwing more CPUs at a problem
> won't make it go faster). Y
>
> You can control how quickly you ingest data for runners that support
> splittable DoFns with SDK initiated checkpoints with resume delays. A
> splittable DoFn is able to return resume().withDelay(Duration.seconds(10))
> from the @ProcessElement method. See Watch[1] for an example.
>
> The 2.25.0 release enables more splittable DoFn features on more runners.
> I'm working on a blog (initial draft[2], still mostly empty) to update the
> old blog from 2017.
>
> 1:
> https://github.com/apache/beam/blob/9c239ac93b40e911f03bec5da3c58a07fdceb245/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Watch.java#L908
> 2:
> https://docs.google.com/document/d/1kpn0RxqZaoacUPVSMYhhnfmlo8fGT-p50fEblaFr2HE/edit#
>
>
> On Tue, Oct 6, 2020 at 10:39 AM Vincent Marquez <[email protected]>
> wrote:
>
>> Hmm, I'm not sure how that will help, I understand how to batch up the
>> data, but it is the triggering part that I don't see how to do.  For
>> example, in Spark Structured Streaming, you can set a time trigger which
>> happens at a fixed interval all the way up to the source, so the source can
>> throttle how much data to read even.
>>
>> Here is my use case more thoroughly explained:
>>
>> I have a Kafka topic (with multiple partitions) that I'm reading from,
>> and I need to aggregate batches of up to 500 before sending a single batch
>> off in an RPC call.  However, the vendor specified a rate limit, so if
>> there are more than 500 unread messages in the topic, I must wait 1 second
>> before issuing another RPC call. When searching on Stack Overflow I found
>> this answer: https://stackoverflow.com/a/57275557/25658 that makes it
>> seem challenging, but I wasn't sure if things had changed since then or you
>> had better ideas.
>>
>> *~Vincent*
>>
>>
>> On Thu, Oct 1, 2020 at 2:57 PM Luke Cwik <[email protected]> wrote:
>>
>>> Look at the GroupIntoBatches[1] transform. It will buffer "batches" of
>>> size X for you.
>>>
>>> 1:
>>> https://beam.apache.org/documentation/transforms/java/aggregation/groupintobatches/
>>>
>>> On Thu, Oct 1, 2020 at 2:51 PM Vincent Marquez <
>>> [email protected]> wrote:
>>>
>>>> the downstream consumer has these requirements.
>>>>
>>>> *~Vincent*
>>>>
>>>>
>>>> On Thu, Oct 1, 2020 at 2:29 PM Luke Cwik <[email protected]> wrote:
>>>>
>>>>> Why do you want to only emit X? (e.g. running out of memory in the
>>>>> runner)
>>>>>
>>>>> On Thu, Oct 1, 2020 at 2:08 PM Vincent Marquez <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello all.  If I want to 'throttle' the number of messages I pull off
>>>>>> say, Kafka or some other queue, in order to make sure I only emit X 
>>>>>> amount
>>>>>> per trigger, is there a way to do that and ensure that I get 'at least
>>>>>> once' delivery guarantees?   If this isn't supported, would the better 
>>>>>> way
>>>>>> be to pull the limited amount opposed to doing it on the output side?
>>>>>>
>>>>>>
>>>>>> *~Vincent*
>>>>>>
>>>>>

Reply via email to