Depends entirely on the use case really.

Currently for the Prism runner I'm working on for the Go SDK is "bundles
are the size of ready data", which will do OK for having lower latency for
downstream transforms.  It will also tell the SDK to split bundles if an
element takes longer than 200milliseconds to process.

Dataflow Batch jobs will generally start with extremely large bundle sizes
and then use channel splitting and Sub Element splitting to divide work
further than the initial splits. This is basically the opposite strategy
your initial strategy takes

Dataflow streaming tends to do hundreds of single elements bundles per
worker to reduce processing latency.

I can't speak to the Flink and Spark strategies.

Robert Burke
Beam Go Busybody

On Thu, Sep 21, 2023, 4:24 PM Joey Tran <joey.t...@schrodinger.com> wrote:

> Writing a runner and the first strategy for determining bundling size was
> to just start with a bundle size of one and double it until we reach a size
> that we expect to take some targets per-bundle runtime (e.g. maybe 10
> minutes). I realize that this isn't the greatest strategy for high sized
> cost transforms. I'm curious what kind of strategies other runners take?
>

Reply via email to