[jira] [Commented] (DRILL-6307) Handle empty batches in record batch sizer correctly

ASF GitHub Bot (JIRA) Wed, 25 Apr 2018 20:56:49 -0700

    [ 
https://issues.apache.org/jira/browse/DRILL-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453462#comment-16453462
 ]


ASF GitHub Bot commented on DRILL-6307:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1228#discussion_r184264961
  
    --- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
    @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
        */
       private int netRowWidth;
       private int netRowWidthCap50;
    +
    +  /**
    +   * actual row size if input is not empty. Otherwise, standard size.
    +   */
    +  private int rowAllocSize;
    --- End diff --
    
    I see. In this case, however, arrays (repeated values) will be empty. If we 
have 10 such rows, there is no reason to have 50 "inner" values. Also, for 
VarChar, no values will be stored; all columns will be null. (If we are 
handling non-null columns, then the non-null VarChar will be an empty string.)
    
    So, we probably need a bit of a special case: prepare data for a run of 
null rows (with arrays and VarChar of length 0) vs. take our best guess with no 
knowledge at all about lengths (which may be non-empty.)
    
    Probably not a huge issue if you only need to handle a single row. But, 
creating a batch with only one row will cause all kinds of performance issues 
downstream. (I found that out the hard way when a but in sort produced a series 
of one-row batches...)


> Handle empty batches in record batch sizer correctly
> ----------------------------------------------------
>
>                 Key: DRILL-6307
>                 URL: https://issues.apache.org/jira/browse/DRILL-6307
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.13.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.14.0
>
>
> when we get empty batch, record batch sizer calculates row width as zero. In 
> that case, we do not do accounting and memory allocation correctly for 
> outgoing batches. 
> For example, in merge join, for outer left join, if right side batch is 
> empty, we still have to include the right side columns as null in outgoing 
> batch. 
> Say first batch is empty. Then, for outgoing, we allocate empty vectors with 
> zero capacity.  When we read the next batch with data, we will end up going 
> through realloc loop. If we use right side row width as 0 in outgoing row 
> width calculation, number of rows we will calculate will be higher and later 
> when we get a non empty batch, we might exceed the memory limits. 
> One possible workaround/solution : Allocate memory based on std size for 
> empty input batch. Use allocation width as width of the batch in number of 
> rows calculation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DRILL-6307) Handle empty batches in record batch sizer correctly

Reply via email to