[ https://issues.apache.org/jira/browse/DRILL-6071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16341663#comment-16341663 ]
ASF GitHub Bot commented on DRILL-6071: --------------------------------------- Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1091#discussion_r164233329 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/unit/TestOutputBatchSize.java --- @@ -0,0 +1,498 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * <p/> + * http://www.apache.org/licenses/LICENSE-2.0 + * <p/> + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.physical.unit; + +import com.google.common.collect.Lists; +import org.apache.drill.common.expression.SchemaPath; + +import org.apache.drill.exec.physical.base.AbstractBase; +import org.apache.drill.exec.physical.base.PhysicalOperator; +import org.apache.drill.exec.physical.config.FlattenPOP; +import org.apache.drill.exec.physical.impl.ScanBatch; +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer; +import org.apache.drill.exec.record.RecordBatch; +import org.apache.drill.exec.record.VectorAccessible; +import org.apache.drill.exec.util.JsonStringArrayList; +import org.apache.drill.exec.util.JsonStringHashMap; +import org.apache.drill.exec.util.Text; +import org.junit.Ignore; +import org.junit.Test; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +public class TestOutputBatchSize extends PhysicalOpUnitTestBase { --- End diff -- I added a test case like this, testFlattenLargeRecords and there are bunch of other test cases as well. All the tests are verifying the batch sizes and number of batches. > Limit batch size for flatten operator > ------------------------------------- > > Key: DRILL-6071 > URL: https://issues.apache.org/jira/browse/DRILL-6071 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Flow > Affects Versions: 1.12.0 > Reporter: Padma Penumarthy > Assignee: Padma Penumarthy > Priority: Major > Fix For: 1.13.0 > > > flatten currently uses an adaptive algorithm to control the outgoing batch > size. > While processing the input batch, it adjusts the number of records in > outgoing batch based on memory usage so far. Once memory usage exceeds the > configured limit for a batch, the algorithm becomes more proactive and > adjusts the limit half way through and end of every batch. All this periodic > checking of memory usage is unnecessary overhead and impacts performance. > Also, we will know only after the fact. > Instead, figure out how many rows should be there in the outgoing batch from > incoming batch. > The way to do that would be to figure out average row size of the outgoing > batch and based on that figure out how many rows can be there for a given > amount of memory. value vectors provide us the necessary information to be > able to figure this out. > Row count in output batch should be decided based on memory (with min 1 and > max 64k rows) and not hard coded (to 4K) in code. Memory for output batch > should be configurable system option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)