[
https://issues.apache.org/jira/browse/BEAM-10075?focusedWorklogId=437394&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-437394
]
ASF GitHub Bot logged work on BEAM-10075:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 27/May/20 03:43
Start Date: 27/May/20 03:43
Worklog Time Spent: 10m
Work Description: steveniemitz commented on a change in pull request
#11811:
URL: https://github.com/apache/beam/pull/11811#discussion_r430834337
##########
File path:
runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/options/DataflowPipelineDebugOptions.java
##########
@@ -177,6 +177,20 @@ public Dataflow create(PipelineOptions options) {
void setNumberOfWorkerHarnessThreads(int value);
+ /**
+ * Size (in MB) of each grouping table used to pre-combine elements. If
unset, defaults to 100 MB.
+ *
+ * <p>CAUTION: If set too large, workers may run into OOM conditions more
easily, each worker may
+ * have many grouping tables in-memory concurrently.
+ */
+ @Description(
+ "The size (in MB) of the grouping tables used to pre-combine elements
before "
+ + "shuffling. Larger values may reduce the amount of data
shuffled.")
+ @Default.Long(100)
+ Long getGroupingTableMaxSizeMb();
+
+ void setGroupingTableMaxSizeMb(Long value);
Review comment:
👍 done!
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 437394)
Time Spent: 20m (was: 10m)
> Allow users to tune the grouping table size in batch dataflow pipelines
> -----------------------------------------------------------------------
>
> Key: BEAM-10075
> URL: https://issues.apache.org/jira/browse/BEAM-10075
> Project: Beam
> Issue Type: Improvement
> Components: runner-dataflow
> Reporter: Steve Niemitz
> Assignee: Steve Niemitz
> Priority: P2
> Fix For: 2.23.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> The dataflow worker hard-codes the grouping table size to 100 MB. We should
> allow users to specify this as a pipeline parameter.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)