Whoops, I typoed my last email. I meant to write "this isn't the greatest strategy for high *fixed* cost transforms", e.g. a transform that takes 5 minutes to get set up and then maybe a microsecond per input
I suppose one solution is to move the responsibility for handling this kind of situation to the user and expect users to use a bundling transform (e.g. BatchElements [1]) followed by a Reshuffle+FlatMap. Is this what other runners expect? Just want to make sure I'm not missing some smart generic bundling strategy that might handle this for users. [1] https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements On Thu, Sep 21, 2023 at 7:23 PM Joey Tran <joey.t...@schrodinger.com> wrote: > Writing a runner and the first strategy for determining bundling size was > to just start with a bundle size of one and double it until we reach a size > that we expect to take some targets per-bundle runtime (e.g. maybe 10 > minutes). I realize that this isn't the greatest strategy for high sized > cost transforms. I'm curious what kind of strategies other runners take? >