Addressing memory fragmentation in Drill

Paul Rogers Tue, 13 Jun 2017 13:28:48 -0700

Hi All,

Those of you that were able to join the Drill Hangout today got a brief 
introduction to the memory fragmentation issue we wish to resolve. For others, 
below is a very brief overview of the issue. Please consult the documents in 
DRILL-5211 for more information. Since we are proposing a number of changes, it 
would be great to get many eyes looking at both the problem and solution.


Drill uses a two-tier memory allocator. Netty handles allocations up to 16 MB; 
Java Unsafe handles allocations of 32 MB or larger directly from native memory. 
(All allocations are done in power-of-two sizes.) When freeing blocks, 16 MB 
and smaller blocks go onto the Netty free list, 32 MB and larger blocks go back 
to the native memory pool. Eventually, all memory is free in the form of 16 MB 
Netty blocks. We try to do a 32 MB allocation, which fails, because all native 
memory is allocated to Netty (where it is in Netty’s free list.) The result is 
many GB of free memory, but an OOM error.

Many solutions are possible: extend the Netty block size, force Netty to 
release memory back to the native pool, etc. It turns out that the Netty 
allocator is 1000x faster than the native allocator so we would prefer to use 
the Netty allocator for most allocations, which rules out many possible 
solutions.

Therefore, we have found that our best path forward is to limit individual 
value vectors to 16 MB in size. Various low-level changes enable this limit. 
(See PR 840, DRILL-5517.) On top of that, we created a modified version of the 
“vector writers” that are size aware. Finally, we created a new scan “mutator” 
that handles limits. (This structure follows the existing structures already in 
Drill.)

Scanners can read data once, in the general case. So, how do we handle the case 
where we have 20 columns, we’ve copied the first 10 columns into vectors, but 
the 11th column overflows? The new mutator implements an “overflow row” by 
creating a new, “look ahead” batch, moving the partially-written overflow row 
to the new batch, and letting the reader complete adding columns to the 
overflow row. The reader then sends the full batch downstream. On the next call 
to read a batch, reading starts with the first row already in the new batch.

This change allows readers to handle vector limits transparently. But, each 
reader has implemented vector writing its own way. Some use the vector writers, 
Parquet has its own vector writers, some write to vectors (without the vector 
writers), and some bypass vectors entirely to write directly into the 
underlying direct memory. So, we need to standardize on a single size-aware 
mechanism. Plus, readers need to handle “missing” columns, implicit and 
partition columns, etc. This common logic should also be standardized.

The resulting refactoring leaves Drill readers with only the task of reading 
data from a data source and loading data into vectors using the new vector 
writers. A nice side effect of this change is that readers become very simple, 
easy to write, and easy to test, which, in turn, should encourage more people 
to contribute storage plugins.
 
Specs are posted to DRILL-5211 for all of the above. Please review at your 
convenience. Working code exists for the above also, PRs will be issued one 
after another, as each depends on code in a previous one. The specs point to my 
working branch for those that want an early peek at the code without waiting 
for the PRs.

Once the readers limit vector sizes, we’ll need a solution for other operators 
such as flatten, project and others that can potentially create large vectors. 
That is an open topic for which we have only a very general outline of a 
solution.

An advantage of the vector-size-limit approach is that we can extend it to 
limit batch size to improve Drill’s ability to manage memory. For example, 
receivers must accept three incoming batches before back pressure kicks in. 
But, currently batches are of unlimited size, meaning that receivers don’t know 
how much memory to allocate to buffer the required three batches. A standard 
batch size will resolve this issue, among others.

The above is the proposal in a nutshell. Please consult the documents for 
details. To help us track your comments, please post comment to DRILL-5211 
instead of replying here.

Thanks,

- Paul

Addressing memory fragmentation in Drill

Reply via email to