[jira] [Commented] (DRILL-6307) Handle empty batches in record batch sizer correctly

ASF GitHub Bot (JIRA) Sun, 22 Apr 2018 19:30:49 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16447476#comment-16447476
 ]


ASF GitHub Bot commented on DRILL-6307:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1228#discussion_r183264996
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
    @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
        */
       private int netRowWidth;
       private int netRowWidthCap50;
    +
    +  /**
    +   * actual row size if input is not empty. Otherwise, standard size.
    +   */
    +  private int rowAllocSize;
    --- End diff --
    
    I wonder if this all would be clearer if we handed it at size estimation 
time. If the row count is 0, set up everything using the standard sizes. (Note: 
the whole reason this class exists is that the standard sizes turned out to be 
*very* poor estimators of actual size.)
    
    So, if we have no data, guess the same size as `AllocationHelper`, else use 
real sizes.
    
    And, again the question: under what situation do we want to use the sizer 
if we don't actually have any data? For the first batch, if no data, just throw 
away the empty batch and don't size it. Turn around and get another until we 
receive a non-empty batch.
    
    If we've already received at least one non-empty batch, then we receive an 
empty batch, we should just retain the estimates from the non-empty batch since 
they will be much better than just making up numbers.


> Handle empty batches in record batch sizer correctly
> ----------------------------------------------------
>
>                 Key: DRILL-6307
>                 URL: https://issues.apache.org/jira/browse/DRILL-6307
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.13.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.14.0
>
>
> when we get empty batch, record batch sizer calculates row width as zero. In 
> that case, we do not do accounting and memory allocation correctly for 
> outgoing batches. 
> For example, in merge join, for outer left join, if right side batch is 
> empty, we still have to include the right side columns as null in outgoing 
> batch. 
> Say first batch is empty. Then, for outgoing, we allocate empty vectors with 
> zero capacity.  When we read the next batch with data, we will end up going 
> through realloc loop. If we use right side row width as 0 in outgoing row 
> width calculation, number of rows we will calculate will be higher and later 
> when we get a non empty batch, we might exceed the memory limits. 
> One possible workaround/solution : Allocate memory based on std size for 
> empty input batch. Use allocation width as width of the batch in number of 
> rows calculation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6307) Handle empty batches in record batch sizer correctly

Reply via email to