Paul Rogers created DRILL-5282:
----------------------------------

             Summary: Rationalize record batch sizes in all readers and 
operators
                 Key: DRILL-5282
                 URL: https://issues.apache.org/jira/browse/DRILL-5282
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.10.0
            Reporter: Paul Rogers


Drill uses record batches to process data. A record batch consists of a 
"bundle" of vectors that, combined, hold the data for some number of records.

The key consideration for a record batch is memory consumed. Various operators 
and readers have vastly different ideas of the size of a batch. The text reader 
can produce batches of 100s of K, while the flatten operator produces batches 
of half a GB. Other operators are randomly in between. Some readers produce 
batches of unlimited size driven by average row width.

Another key consideration is record count. Batches have a hard physical limit 
of 64K (the number indexed by a two-byte selection vector.) Some operators 
produce this much, others far less. In one case, we saw a reader that produced 
64K+1 records.

A final consideration is the size of individual vectors. Drill incurs severe 
memory fragmentation when vectors grow above 16 MB.

In some cases, operators (such as the Parquet reader) allocate large batches, 
but only partially fill them, creating a large amount of wasted space. That 
space adds up when we must buffer it during a sort.

This ticket asks to research an optimal batch size. Create a framework to build 
such batches. Retrofit all operators that produce batches to use that framework 
to produce uniform batches.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to