scwhittle commented on code in PR #35110:
URL: https://github.com/apache/beam/pull/35110#discussion_r2120811649
##########
sdks/python/apache_beam/io/textio.py:
##########
@@ -1056,6 +1056,8 @@ def WriteToJson(
kwargs['num_shards'] = num_shards
if file_naming is not None:
kwargs['file_naming'] = file_naming
+ if 'max_writers_per_bundle' not in kwargs:
+ kwargs['max_writers_per_bundle'] = 0
Review Comment:
I believe that for batch pipelines the existing default makes sense (as the
batch bundles are large to process this will avoid shuffling data and still
generate large files). It may also make sense to keep the default for other
runners.
So I think that we might want to modify the dataflow runner so that it
changes the default to zero if it was otherwise unspecified. I think that
would be done by modifying by visiting the graph something like
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L334
@tvalentyn for thoughts
##########
sdks/python/apache_beam/dataframe/io.py:
##########
@@ -668,6 +668,9 @@ def expand(self, pcoll):
return pcoll | fileio.WriteToFiles(
path=dir,
shards=self.kwargs.pop('num_shards', None),
+ max_writers_per_bundle=self.kwargs.pop(
+ 'max_writers_per_bundle',
+ fileio.WriteToFiles.MAX_NUM_WRITERS_PER_BUNDLE),
Review Comment:
passing None seems preferrable? Then I think WriteToFiles would use it's
default without respecifying it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]