You can consider using dynamic destinations [1] and providing a destination
function [2] that keeps track of the sizes of elements already written to a
given destination.
Note that this might have performance implications (due to extra
computations to keep track of element sizes).

You are correct regarding the default behaviour where the number of shards
of sinks is determined by the runner based on the parallelism of the
corresponding step.

Thanks,
Cham

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L222
[2]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileIO.java#L988

On Wed, Jul 29, 2020 at 12:28 PM [email protected] <[email protected]>
wrote:

> We would like to use ParquetIO but limit individual files written out a
> maximum size. Don’t see any easy way to do this, and it seems like default
> behavior is to split based on parallelism? Anyone have any guidance on
> this?
>

Reply via email to