(I notice that you replied only to yourself, but there has been a whole
thread of discussion on this - are you subscribed to dev@beam?
https://lists.apache.org/thread/k81fq301ypwmjowknzyqq2qc63844rbd)

It sounds like you want what everyone wants: to have the biggest bundles
possible.

So for bounded data, basically you make even splits of the data and each
split is one bundle. And then dynamic splitting to redistribute work to
eliminate stragglers, if your engine has that capability.

For unbounded data, you more-or-less bundle as much as you can without
waiting too long, like Jan described.

Users know to put their high fixed costs in @StartBundle and then it is the
runner's job to put as many calls to @ProcessElement as possible to
amortize.

Kenn

On Fri, Sep 22, 2023 at 9:39 AM Joey Tran <joey.t...@schrodinger.com> wrote:

> Whoops, I typoed my last email. I meant to write "this isn't the
> greatest strategy for high *fixed* cost transforms", e.g. a transform
> that takes 5 minutes to get set up and then maybe a microsecond per input
>
> I suppose one solution is to move the responsibility for handling this
> kind of situation to the user and expect users to use a bundling transform
> (e.g. BatchElements [1]) followed by a Reshuffle+FlatMap. Is this what
> other runners expect? Just want to make sure I'm not missing some smart
> generic bundling strategy that might handle this for users.
>
> [1]
> https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements
>
>
> On Thu, Sep 21, 2023 at 7:23 PM Joey Tran <joey.t...@schrodinger.com>
> wrote:
>
>> Writing a runner and the first strategy for determining bundling size was
>> to just start with a bundle size of one and double it until we reach a size
>> that we expect to take some targets per-bundle runtime (e.g. maybe 10
>> minutes). I realize that this isn't the greatest strategy for high sized
>> cost transforms. I'm curious what kind of strategies other runners take?
>>
>

Reply via email to