[ 
https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329835#comment-16329835
 ] 

ASF GitHub Bot commented on DRILL-6071:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1091#discussion_r162225383
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractBase.java
 ---
    @@ -29,9 +28,12 @@
     
       public static long INIT_ALLOCATION = 1_000_000L;
       public static long MAX_ALLOCATION = 10_000_000_000L;
    +  // Default output batch size, 512MB
    +  public static long OUTPUT_BATCH_SIZE = 512 * 1024 * 1024L;
    --- End diff --
    
    Too large. The sort & hash agg operators often receive just 20-40 MB on a 
large cluster. (That is, itself, an issue, but one that has proven very 
difficult to resolve.) So, the output batch size must be no larger than 1/3 
this size (for sort). Probably some team discussion is required to agree on a 
good number, and on the work needed to ensure that sort, hash agg and hash join 
are given sufficient memory for the selected batch size.


> Limit batch size for flatten operator
> -------------------------------------
>
>                 Key: DRILL-6071
>                 URL: https://issues.apache.org/jira/browse/DRILL-6071
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.12.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.13.0
>
>
> flatten currently uses an adaptive algorithm to control the outgoing batch 
> size. 
> While processing the input batch, it adjusts the number of records in 
> outgoing batch based on memory usage so far. Once memory usage exceeds the 
> configured limit for a batch, the algorithm becomes more proactive and 
> adjusts the limit half way through  and end of every batch. All this periodic 
> checking of memory usage is unnecessary overhead and impacts performance. 
> Also, we will know only after the fact. 
> Instead, figure out how many rows should be there in the outgoing batch from 
> incoming batch.
> The way to do that would be to figure out average row size of the outgoing 
> batch and based on that figure out how many rows can be there for a given 
> amount of memory. value vectors provide us the necessary information to be 
> able to figure this out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to