[
https://issues.apache.org/jira/browse/DRILL-5209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Rogers reassigned DRILL-5209:
----------------------------------
Assignee: Paul Rogers
> Standardize Drill's batch size
> ------------------------------
>
> Key: DRILL-5209
> URL: https://issues.apache.org/jira/browse/DRILL-5209
> Project: Apache Drill
> Issue Type: Improvement
> Affects Versions: 1.9.0
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Priority: Minor
>
> Drill is columnar, implemented as a set of value vectors. Value vectors
> consume memory, which is a fixed resource on each Drillbit. Effective
> resource management requires the ability to control (or at least predict)
> resource usage.
> Most data consists of more than one column. A collection of columns (or rows,
> depending on your perspective) is a record batch.
> Many parts of Drill use 64K rows as the target size of a record batch. The
> Flatten operator targets batch sizes of 512 MB. The text scan operator
> appears to target batch sizes of 128 MB. Other operators may use other sizes.
> Operators that target 64K rows use, essentially, unknown and potentially
> unlimited amounts of memory. While 64K rows of an integer each is fine, 64K
> rows of Varchar columns of 50K each leads to a batch of 3.2 GB in size, which
> is rather large.
> This ticket requests three improvements.
> 1. Define a preferred batch size which is a balance between various needs:
> memory use, network efficiency, benefits of vector operations, etc.
> 2. Provide a reliable way to learn the size of each row as it is added to a
> batch.
> 3. Use the above to limit batches to the preferred batch size.
> The above will go a long way to easing the task of managing memory because
> the planner will have some hope of understanding how much memory to allocate
> to various operations.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)