Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/914
  
    Finally, a note on the fragmentation issue. As you noted, this is a subtle 
issue. It is true that Netty maintains a memory pool, based on binary 
allocations, that minimizes the normal kind of fragmentation that results from 
random sized allocations from a common pool.
    
    The cost of the binary structure is _internal_ fragmentation. Today, Drill 
vectors have, on average, 25% internal fragmentation. This PR does not address 
this issue per-se, but sets us on the road toward a solution.
    
    The key fragmentation issue that this PR _does_ deal with is that which 
occurs when allocations exceed the 16 MB (default) Netty block size. In that 
case, Netty does, in fact, go to the OS. The OS does a fine job of coalescing 
large blocks to prevent fragmentation. The problem, however, is that, over 
time, more and more memory resides in the Netty free list. Eventually, there 
simply is not enough memory left outside of Netty to service a jumbo (> 16MB) 
block. Drill gets an OOM error though Netty has many GB of memory free; just 
none available in the 32+ MB size we want.
    
    We could force Netty to release unused memory. In fact, the original 
[JE-Malloc 
paper](https://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf) 
(that you provided way back when, thanks) points out that the allocator should 
monitor its pools and release memory back to the system when a pool usage drops 
to zero. It does not appear that `PooledByteBufAllocatorL` implemented this 
feature, so the allocator never releases memory once it lands in the 
allocator's free list. We could certainly fix this; the JE-Malloc paper 
provides suggestions.
    
    Still, however, we could end up with usage patterns in which some slice of 
memory is used from each chunk, blocking any chunk from being released to the 
OS, and thereby blocking a "jumbo" block allocation, again though much memory 
is free on the free list. This is yet another form of fragmentation.
    
    Finally, as you point out, all of this assumes that we want to continue to 
allocate "jumbo" blocks. But, as we discovered in the managed sort work, and 
the hash agg spill work, Drill has two conflicting tendencies. On the one hand, 
"managed" operators wish to operate within a constrained memory footprint. 
(Which seems to often end up being on the order of 30 MB for the sort for 
various reasons.) If the scan operator, say, decides to allocate a batch that 
contains 32 MB vectors, then the sort can't accept even one of those batches an 
an OOM ensues.
    
    So, rather than solve our memory fragmentation issues by mucking with Netty 
(force free of unused chunks, increase chunk size, etc.) The preferred solution 
is to live within a budget: both the constraints of the Netty chunk size *and* 
the constraints placed on Drill operator memory usage.
    
    In short, we started by wanting to solve the fragmentation issue, but we 
realized that the best solution is to also solve the unlimited-batch-size 
issue, hence this PR.


---

Reply via email to