Whoops, I typoed my last email. I meant to write "this isn't the
greatest strategy for high *fixed* cost transforms", e.g. a transform that
takes 5 minutes to get set up and then maybe a microsecond per input

I suppose one solution is to move the responsibility for handling this kind
of situation to the user and expect users to use a bundling transform (e.g.
BatchElements [1]) followed by a Reshuffle+FlatMap. Is this what other
runners expect? Just want to make sure I'm not missing some smart generic
bundling strategy that might handle this for users.

[1]
https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements


On Thu, Sep 21, 2023 at 7:23 PM Joey Tran <joey.t...@schrodinger.com> wrote:

> Writing a runner and the first strategy for determining bundling size was
> to just start with a bundle size of one and double it until we reach a size
> that we expect to take some targets per-bundle runtime (e.g. maybe 10
> minutes). I realize that this isn't the greatest strategy for high sized
> cost transforms. I'm curious what kind of strategies other runners take?
>

Reply via email to