[ https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329835#comment-16329835 ]
ASF GitHub Bot commented on DRILL-6071: --------------------------------------- Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/1091#discussion_r162225383 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractBase.java --- @@ -29,9 +28,12 @@ public static long INIT_ALLOCATION = 1_000_000L; public static long MAX_ALLOCATION = 10_000_000_000L; + // Default output batch size, 512MB + public static long OUTPUT_BATCH_SIZE = 512 * 1024 * 1024L; --- End diff -- Too large. The sort & hash agg operators often receive just 20-40 MB on a large cluster. (That is, itself, an issue, but one that has proven very difficult to resolve.) So, the output batch size must be no larger than 1/3 this size (for sort). Probably some team discussion is required to agree on a good number, and on the work needed to ensure that sort, hash agg and hash join are given sufficient memory for the selected batch size. > Limit batch size for flatten operator > ------------------------------------- > > Key: DRILL-6071 > URL: https://issues.apache.org/jira/browse/DRILL-6071 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow > Affects Versions: 1.12.0 > Reporter: Padma Penumarthy > Assignee: Padma Penumarthy > Priority: Major > Fix For: 1.13.0 > > > flatten currently uses an adaptive algorithm to control the outgoing batch > size. > While processing the input batch, it adjusts the number of records in > outgoing batch based on memory usage so far. Once memory usage exceeds the > configured limit for a batch, the algorithm becomes more proactive and > adjusts the limit half way through and end of every batch. All this periodic > checking of memory usage is unnecessary overhead and impacts performance. > Also, we will know only after the fact. > Instead, figure out how many rows should be there in the outgoing batch from > incoming batch. > The way to do that would be to figure out average row size of the outgoing > batch and based on that figure out how many rows can be there for a given > amount of memory. value vectors provide us the necessary information to be > able to figure this out. -- This message was sent by Atlassian JIRA (v7.6.3#76005)