[ 
https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341663#comment-16341663
 ] 

ASF GitHub Bot commented on DRILL-6071:
---------------------------------------

Github user ppadma commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1091#discussion_r164233329
  
    --- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/unit/TestOutputBatchSize.java
 ---
    @@ -0,0 +1,498 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + * <p/>
    + * http://www.apache.org/licenses/LICENSE-2.0
    + * <p/>
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.drill.exec.physical.unit;
    +
    +import com.google.common.collect.Lists;
    +import org.apache.drill.common.expression.SchemaPath;
    +
    +import org.apache.drill.exec.physical.base.AbstractBase;
    +import org.apache.drill.exec.physical.base.PhysicalOperator;
    +import org.apache.drill.exec.physical.config.FlattenPOP;
    +import org.apache.drill.exec.physical.impl.ScanBatch;
    +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer;
    +import org.apache.drill.exec.record.RecordBatch;
    +import org.apache.drill.exec.record.VectorAccessible;
    +import org.apache.drill.exec.util.JsonStringArrayList;
    +import org.apache.drill.exec.util.JsonStringHashMap;
    +import org.apache.drill.exec.util.Text;
    +import org.junit.Ignore;
    +import org.junit.Test;
    +
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.List;
    +
    +public class TestOutputBatchSize extends PhysicalOpUnitTestBase {
    --- End diff --
    
    I added a test case like this, testFlattenLargeRecords and there are bunch 
of other test cases as well.
    All the tests are verifying the batch sizes and number of batches.


> Limit batch size for flatten operator
> -------------------------------------
>
>                 Key: DRILL-6071
>                 URL: https://issues.apache.org/jira/browse/DRILL-6071
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow
>    Affects Versions: 1.12.0
>            Reporter: Padma Penumarthy
>            Assignee: Padma Penumarthy
>            Priority: Major
>             Fix For: 1.13.0
>
>
> flatten currently uses an adaptive algorithm to control the outgoing batch 
> size. 
>  While processing the input batch, it adjusts the number of records in 
> outgoing batch based on memory usage so far. Once memory usage exceeds the 
> configured limit for a batch, the algorithm becomes more proactive and 
> adjusts the limit half way through and end of every batch. All this periodic 
> checking of memory usage is unnecessary overhead and impacts performance. 
> Also, we will know only after the fact.
> Instead, figure out how many rows should be there in the outgoing batch from 
> incoming batch.
>  The way to do that would be to figure out average row size of the outgoing 
> batch and based on that figure out how many rows can be there for a given 
> amount of memory. value vectors provide us the necessary information to be 
> able to figure this out.
> Row count in output batch should be decided based on memory (with min 1 and 
> max 64k rows) and not hard coded (to 4K) in code. Memory for output batch 
> should be configurable system option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to