[GitHub] [beam] nehsyc commented on a change in pull request #14164: [BEAM-11934] Add runner determined sharding option for unbounded data to WriteFiles (Java)

GitBox Fri, 12 Mar 2021 11:04:32 -0800


nehsyc commented on a change in pull request #14164:
URL: https://github.com/apache/beam/pull/14164#discussion_r593385604




##########
File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java
##########
@@ -301,6 +321,21 @@
     return toBuilder().setMaxNumWritersPerBundle(-1).build();
   }
 
+  /**
+   * Returns a new {@link WriteFiles} that will write to the current {@link 
FileBasedSink} with
+   * runner-determined sharding for unbounded data specifically. Currently 
manual sharding is
+   * required for writing unbounded data with a fixed number of shards or a 
predefined sharding
+   * function. This option allows the runners to get around that requirement 
and perform automatic
+   * sharding.
+   *
+   * <p>Intended to only be used by runners. Users should use {@link

Review comment:
       How does a runner using FnAPI typically override a non-standard 
transform? Or it always requires a transform to be added to FnAPI for runner to 
do something different?
   
   This is what I am going to do to set this in Dataflow runner: 
https://github.com/apache/beam/pull/14164/commits/3382d706ff62518fa3c8f450faa5fafc2d534d5c.
   
   The main reason I added this was that `WriteFiles` already has an interface 
`withRunnerDeterminedSharding` but it is disabled for streaming. Removing the 
condition to allow `withRunnerDeterminedSharding` for streaming will enable the 
new implementation for every runner - for those who don't support dynamic 
sharding the default implementation might perform badly. Is there a better way 
to allow runners to choose whether they support this option? 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] nehsyc commented on a change in pull request #14164: [BEAM-11934] Add runner determined sharding option for unbounded data to WriteFiles (Java)

Reply via email to