So when beam writes out to a specific shard number, as I understand it does
a few things:

- Assigns a shard key to each record (reduces parallelism)
- Shuffles and Groups by the shard key to colocate all records
- Writes out to each shard file within a single DoFn per key...

When thinking about this, I believe we might be able to eliminate the
GroupByKey to go ahead and write out to each file with its records with
only a DoFn after the shard key is assigned.

As long as the shard key is the actual key of the PCollection, then could
we use a state variable to force all keys that are the same to process to
share state with each other?

On a DoFn can we use the setup to hold a Map of files being written to
within bundles on that instance, and on teardown can we close all files
within the map?

If this is the case does it reduce the need for a shuffle and allow a DoFn
to safely write out in append mode to a file, batch, etc held in state?

It doesn't really decrease parallelism after the key is assigned since it
can parallelize over each key within its state window. Which is the same
level of parallelism we achieve by doing a GroupByKey and doing a for loop
over the result. So performance shouldn't be impacted if this holds true.

It's kind of like combining both the shuffle and the data write in the same
step?

This does however have a significant cost reduction by eliminating a
compute based shuffle and also eliminating a Dataflow shuffle service call
if shuffle service is enabled.

Thoughts?

Thanks,
Shannon Duncan

Reply via email to