scwhittle commented on code in PR #35110:
URL: https://github.com/apache/beam/pull/35110#discussion_r2120811649


##########
sdks/python/apache_beam/io/textio.py:
##########
@@ -1056,6 +1056,8 @@ def WriteToJson(
       kwargs['num_shards'] = num_shards
     if file_naming is not None:
       kwargs['file_naming'] = file_naming
+    if 'max_writers_per_bundle' not in kwargs:
+      kwargs['max_writers_per_bundle'] = 0

Review Comment:
   I believe that for batch pipelines the existing default makes sense (as the 
batch bundles are large to process this will avoid shuffling data and still 
generate large files).  It may also make sense to keep the default for other 
runners.
   
   So I think that we might want to modify the dataflow runner so that it 
changes the default to zero if it was otherwise unspecified.  I think that 
would be done by modifying by visiting the graph something like 
   
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L334
  
   
   @tvalentyn for thoughts



##########
sdks/python/apache_beam/dataframe/io.py:
##########
@@ -668,6 +668,9 @@ def expand(self, pcoll):
     return pcoll | fileio.WriteToFiles(
         path=dir,
         shards=self.kwargs.pop('num_shards', None),
+        max_writers_per_bundle=self.kwargs.pop(
+            'max_writers_per_bundle',
+            fileio.WriteToFiles.MAX_NUM_WRITERS_PER_BUNDLE),

Review Comment:
   passing None seems preferrable? Then I think WriteToFiles would use it's 
default without respecifying it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to