Hi All, Those of you that were able to join the Drill Hangout today got a brief introduction to the memory fragmentation issue we wish to resolve. For others, below is a very brief overview of the issue. Please consult the documents in DRILL-5211 for more information. Since we are proposing a number of changes, it would be great to get many eyes looking at both the problem and solution.
Drill uses a two-tier memory allocator. Netty handles allocations up to 16 MB; Java Unsafe handles allocations of 32 MB or larger directly from native memory. (All allocations are done in power-of-two sizes.) When freeing blocks, 16 MB and smaller blocks go onto the Netty free list, 32 MB and larger blocks go back to the native memory pool. Eventually, all memory is free in the form of 16 MB Netty blocks. We try to do a 32 MB allocation, which fails, because all native memory is allocated to Netty (where it is in Netty’s free list.) The result is many GB of free memory, but an OOM error. Many solutions are possible: extend the Netty block size, force Netty to release memory back to the native pool, etc. It turns out that the Netty allocator is 1000x faster than the native allocator so we would prefer to use the Netty allocator for most allocations, which rules out many possible solutions. Therefore, we have found that our best path forward is to limit individual value vectors to 16 MB in size. Various low-level changes enable this limit. (See PR 840, DRILL-5517.) On top of that, we created a modified version of the “vector writers” that are size aware. Finally, we created a new scan “mutator” that handles limits. (This structure follows the existing structures already in Drill.) Scanners can read data once, in the general case. So, how do we handle the case where we have 20 columns, we’ve copied the first 10 columns into vectors, but the 11th column overflows? The new mutator implements an “overflow row” by creating a new, “look ahead” batch, moving the partially-written overflow row to the new batch, and letting the reader complete adding columns to the overflow row. The reader then sends the full batch downstream. On the next call to read a batch, reading starts with the first row already in the new batch. This change allows readers to handle vector limits transparently. But, each reader has implemented vector writing its own way. Some use the vector writers, Parquet has its own vector writers, some write to vectors (without the vector writers), and some bypass vectors entirely to write directly into the underlying direct memory. So, we need to standardize on a single size-aware mechanism. Plus, readers need to handle “missing” columns, implicit and partition columns, etc. This common logic should also be standardized. The resulting refactoring leaves Drill readers with only the task of reading data from a data source and loading data into vectors using the new vector writers. A nice side effect of this change is that readers become very simple, easy to write, and easy to test, which, in turn, should encourage more people to contribute storage plugins. Specs are posted to DRILL-5211 for all of the above. Please review at your convenience. Working code exists for the above also, PRs will be issued one after another, as each depends on code in a previous one. The specs point to my working branch for those that want an early peek at the code without waiting for the PRs. Once the readers limit vector sizes, we’ll need a solution for other operators such as flatten, project and others that can potentially create large vectors. That is an open topic for which we have only a very general outline of a solution. An advantage of the vector-size-limit approach is that we can extend it to limit batch size to improve Drill’s ability to manage memory. For example, receivers must accept three incoming batches before back pressure kicks in. But, currently batches are of unlimited size, meaning that receivers don’t know how much memory to allocate to buffer the required three batches. A standard batch size will resolve this issue, among others. The above is the proposal in a nutshell. Please consult the documents for details. To help us track your comments, please post comment to DRILL-5211 instead of replying here. Thanks, - Paul
