Paul Rogers created DRILL-5282:
----------------------------------
Summary: Rationalize record batch sizes in all readers and
operators
Key: DRILL-5282
URL: https://issues.apache.org/jira/browse/DRILL-5282
Project: Apache Drill
Issue Type: Improvement
Affects Versions: 1.10.0
Reporter: Paul Rogers
Drill uses record batches to process data. A record batch consists of a
"bundle" of vectors that, combined, hold the data for some number of records.
The key consideration for a record batch is memory consumed. Various operators
and readers have vastly different ideas of the size of a batch. The text reader
can produce batches of 100s of K, while the flatten operator produces batches
of half a GB. Other operators are randomly in between. Some readers produce
batches of unlimited size driven by average row width.
Another key consideration is record count. Batches have a hard physical limit
of 64K (the number indexed by a two-byte selection vector.) Some operators
produce this much, others far less. In one case, we saw a reader that produced
64K+1 records.
A final consideration is the size of individual vectors. Drill incurs severe
memory fragmentation when vectors grow above 16 MB.
In some cases, operators (such as the Parquet reader) allocate large batches,
but only partially fill them, creating a large amount of wasted space. That
space adds up when we must buffer it during a sort.
This ticket asks to research an optimal batch size. Create a framework to build
such batches. Retrofit all operators that produce batches to use that framework
to produce uniform batches.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)