nehsyc commented on a change in pull request #14164:
URL: https://github.com/apache/beam/pull/14164#discussion_r593385604
##########
File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java
##########
@@ -301,6 +321,21 @@
return toBuilder().setMaxNumWritersPerBundle(-1).build();
}
+ /**
+ * Returns a new {@link WriteFiles} that will write to the current {@link
FileBasedSink} with
+ * runner-determined sharding for unbounded data specifically. Currently
manual sharding is
+ * required for writing unbounded data with a fixed number of shards or a
predefined sharding
+ * function. This option allows the runners to get around that requirement
and perform automatic
+ * sharding.
+ *
+ * <p>Intended to only be used by runners. Users should use {@link
Review comment:
How does a runner using FnAPI typically override a non-standard
transform? Or it always requires a transform to be added to FnAPI for runner to
do something different?
This is what I am going to do to set this in Dataflow runner:
https://github.com/apache/beam/pull/14164/commits/3382d706ff62518fa3c8f450faa5fafc2d534d5c.
The main reason I added this was that `WriteFiles` already has an interface
`withRunnerDeterminedSharding` but it is disabled for streaming. Removing the
condition to allow `withRunnerDeterminedSharding` for streaming will enable the
new implementation for every runner - for those who don't support dynamic
sharding the default implementation might perform badly. Is there a better way
to allow runners to choose whether they support this option?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]