[
https://issues.apache.org/jira/browse/DRILL-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15833200#comment-15833200
]
Paul Rogers edited comment on DRILL-5211 at 1/22/17 12:44 AM:
--------------------------------------------------------------
Actually, the problem appears to be related to the cache of allocated memory
chunks. Consider this code:
{code}
private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int
maxCapacity) {
PoolThreadCache cache = threadCache.get();
PoolArena<ByteBuffer> directArena = cache.directArena;
if (directArena != null) {
if (initialCapacity > directArena.chunkSize) {
// This is beyond chunk size so we'll allocate separately.
ByteBuf buf =
UnpooledByteBufAllocator.DEFAULT.directBuffer(initialCapacity, maxCapacity); //
<-- FAILED HERE
{code}
When the OOM occurs, the call stack shows that we are at the line indicated
above. This said that the allocation request is larger than any chunk in the
cache (I think.) Dumping the cache we see:
{code}
Chunk(s) at 0~25%:
none
Chunk(s) at 0~50%:
Chunk(12122cd0: 1%, 40960/16777216)
Chunk(432d1a8f: 12%, 1998848/16777216)
Chunk(6bc20246: 0%, 0/16777216)
Chunk(5b40b4e5: 0%, 0/16777216)
Chunk(58b777f1: 0%, 0/16777216)
Chunk(73e5e70a: 0%, 0/16777216)
Chunk(84e1b02: 0%, 0/16777216)
Chunk(56777172: 0%, 0/16777216)
Chunk(359c8cb2: 0%, 0/16777216)
Chunk(699df0bc: 0%, 0/16777216)
Chunk(11f36086: 0%, 0/16777216)
Chunk(7ce26f2b: 0%, 0/16777216)
Chunk(2d4a8519: 0%, 0/16777216)
Chunk(2bd4881c: 0%, 0/16777216)
Chunk(21293ab0: 2%, 237568/16777216)
Chunk(4edd8289: 1%, 8192/16777216)
Chunk(37c6b406: 17%, 2744320/16777216)
Chunk(385d5e8a: 1%, 32768/16777216)
Chunk(50490f8b: 0%, 0/16777216)
Chunk(72a206c1: 0%, 0/16777216)
Chunk(7046ea17: 0%, 0/16777216)
Chunk(22bd539b: 0%, 0/16777216)
Chunk(3a902510: 0%, 0/16777216)
Chunk(5866a88d: 0%, 0/16777216)
Chunk(1fb7f7c4: 0%, 0/16777216)
Chunk(57de5e22: 0%, 0/16777216)
Chunk(6c5d496c: 0%, 0/16777216)
Chunk(192a6aa: 0%, 0/16777216)
Chunk(213b688b: 0%, 0/16777216)
Chunk(4b10dc0: 0%, 0/16777216)
Chunk(2212213: 0%, 0/16777216)
Chunk(1692730b: 0%, 0/16777216)
Chunk(6c173e62: 0%, 0/16777216)
Chunk(60c4f12d: 0%, 0/16777216)
Chunk(s) at 25~75%:
Chunk(6bfe669c: 0%, 0/16777216)
Chunk(6e715ac3: 0%, 0/16777216)
Chunk(3bc09d41: 0%, 0/16777216)
Chunk(7c4a4e8d: 0%, 0/16777216)
Chunk(64981d1e: 0%, 0/16777216)
Chunk(dbe40c: 0%, 0/16777216)
Chunk(3fce5bc3: 0%, 0/16777216)
Chunk(s) at 50~100%:
none
Chunk(s) at 75~100%:
Chunk(115e4491: 0%, 0/16777216)
Chunk(350acb49: 0%, 0/16777216)
Chunk(6a2ea260: 0%, 0/16777216)
Chunk(2773fca5: 0%, 0/16777216)
Chunk(446a4e16: 0%, 0/16777216)
Chunk(27d99551: 0%, 0/16777216)
Chunk(38fb1e68: 0%, 0/16777216)
Chunk(d54b06: 0%, 0/16777216)
Chunk(16d9aff4: 0%, 0/16777216)
Chunk(7dc1c363: 0%, 0/16777216)
Chunk(1da99aed: 0%, 0/16777216)
Chunk(378e6f25: 0%, 0/16777216)
Chunk(6cf3d02f: 0%, 0/16777216)
Chunk(1f5adc09: 0%, 0/16777216)
Chunk(4e7553fd: 0%, 0/16777216)
Chunk(a46ea51: 0%, 0/16777216)
Chunk(78c6219e: 0%, 0/16777216)
Chunk(31b5001b: 0%, 0/16777216)
Chunk(55bb476b: 0%, 0/16777216)
Chunk(68123bef: 0%, 0/16777216)
Chunk(21913da2: 0%, 0/16777216)
Chunk(383d4453: 0%, 0/16777216)
Chunk(3732cc20: 0%, 0/16777216)
Chunk(4e86446a: 0%, 0/16777216)
Chunk(66d21c35: 0%, 0/16777216)
Chunk(349fd360: 0%, 0/16777216)
Chunk(156d4a1f: 0%, 0/16777216)
Chunk(69b4e9cc: 0%, 0/16777216)
Chunk(1f71737b: 0%, 0/16777216)
Chunk(55bfa726: 0%, 0/16777216)
Chunk(2a7d323c: 0%, 0/16777216)
Chunk(64c94436: 0%, 0/16777216)
Chunk(70b7097f: 0%, 0/16777216)
Chunk(581906d8: 0%, 0/16777216)
Chunk(1b362335: 0%, 0/16777216)
Chunk(35f03c91: 0%, 0/16777216)
Chunk(7d4437a1: 0%, 0/16777216)
Chunk(6d7bd117: 0%, 0/16777216)
Chunk(47fe7806: 0%, 0/16777216)
Chunk(735ec0dc: 0%, 0/16777216)
Chunk(2ffb0829: 0%, 0/16777216)
Chunk(1cbb97a8: 0%, 0/16777216)
Chunk(28b1f271: 0%, 0/16777216)
Chunk(2d6c9f9b: 0%, 0/16777216)
Chunk(5a21605f: 0%, 0/16777216)
Chunk(1a67aa64: 0%, 0/16777216)
Chunk(3d62e123: 0%, 0/16777216)
Chunk(74bb2153: 0%, 0/16777216)
Chunk(25498403: 0%, 0/16777216)
Chunk(2da3e44: 0%, 0/16777216)
Chunk(281bbcc5: 0%, 0/16777216)
Chunk(587b12c: 0%, 0/16777216)
Chunk(6c874403: 0%, 0/16777216)
Chunk(3ffc7fc9: 0%, 0/16777216)
Chunk(4af41167: 0%, 0/16777216)
Chunk(72c2d7c4: 0%, 0/16777216)
Chunk(243332c3: 0%, 0/16777216)
Chunk(78ed13bb: 0%, 0/16777216)
Chunk(12f84ae8: 0%, 0/16777216)
Chunk(7660c384: 0%, 0/16777216)
Chunk(4bf852a1: 0%, 0/16777216)
Chunk(5b98f0ae: 0%, 0/16777216)
Chunk(be74e3f: 0%, 0/16777216)
Chunk(7b6bd024: 0%, 0/16777216)
Chunk(720ff8b2: 0%, 0/16777216)
Chunk(6e0e7bdd: 0%, 0/16777216)
Chunk(5fa94695: 0%, 0/16777216)
Chunk(7ae647b4: 0%, 0/16777216)
Chunk(77a1ea32: 0%, 0/16777216)
Chunk(6aecb788: 0%, 0/16777216)
Chunk(7fe4c9ae: 0%, 0/16777216)
Chunk(3777ea01: 0%, 0/16777216)
Chunk(4f7f76a7: 0%, 0/16777216)
Chunk(4020d837: 0%, 0/16777216)
Chunk(1950c024: 0%, 0/16777216)
Chunk(117f16ed: 0%, 0/16777216)
Chunk(2501802b: 0%, 0/16777216)
Chunk(63a605dc: 0%, 0/16777216)
Chunk(7ce8b86c: 0%, 0/16777216)
Chunk(15490162: 0%, 0/16777216)
Chunk(3c60db38: 0%, 0/16777216)
Chunk(6fbbb18d: 0%, 0/16777216)
Chunk(56a94fce: 0%, 0/16777216)
Chunk(bb61668: 0%, 0/16777216)
Chunk(3135b53d: 0%, 0/16777216)
Chunk(3b05d4f: 0%, 0/16777216)
Chunk(1f7ba5c8: 0%, 0/16777216)
Chunk(24c5e519: 0%, 0/16777216)
Chunk(38c520e1: 0%, 0/16777216)
Chunk(399e4893: 0%, 0/16777216)
Chunk(7b89ef8d: 0%, 0/16777216)
Chunk(706f30c8: 0%, 0/16777216)
Chunk(613cc40c: 0%, 0/16777216)
Chunk(2aadc268: 0%, 0/16777216)
Chunk(1eecb537: 0%, 0/16777216)
Chunk(178c3f52: 0%, 0/16777216)
Chunk(1017850b: 0%, 0/16777216)
Chunk(54edabe3: 0%, 0/16777216)
Chunk(2f53f944: 0%, 0/16777216)
Chunk(59532553: 0%, 0/16777216)
Chunk(7540ccaf: 0%, 0/16777216)
Chunk(4c4bc357: 0%, 0/16777216)
Chunk(7c629a43: 0%, 0/16777216)
Chunk(3cdb5121: 0%, 0/16777216)
Chunk(4f8dd7a1: 0%, 0/16777216)
Chunk(5d4ee47c: 0%, 0/16777216)
Chunk(3596dd14: 0%, 0/16777216)
Chunk(53a2d0de: 0%, 0/16777216)
Chunk(s) at 100%:
none
tiny subpages:
1: (2052: 2/512, offset: 32768, length: 8192, elemSize: 16)
2: (2915: 1/256, offset: 7102464, length: 8192, elemSize: 32)
4: (2473: 2/128, offset: 3481600, length: 8192, elemSize: 64)
8: (2049: 2/64, offset: 8192, length: 8192, elemSize: 128)
16: (2053: 2/32, offset: 40960, length: 8192, elemSize: 256)
small subpages:
1: (2048: 1/8, offset: 0, length: 8192, elemSize: 1024)
3: (2096: 1/2, offset: 393216, length: 8192, elemSize: 4096)
{code}
This allocator has chunks of 16,777,216 (0x100_0000). These are all smaller
than the 58,257,868 we want to allocate. So, we ask the system for more memory.
But, we may have done that too many times and the system has nothing more to
give. Result: OOM when plenty of memory is available.
Doing a bit of Python on the above, we find we have:
* 152 chunks
* 5,062,656 allocated(?)
* 2,499,805,184 total in the chunks
The total count agrees, more-or-less, with the peak memory observed in the run.
The 5 MB allocated does *not*, however, agree with the 1,192,350,366 reported
by {{allocator.getAllocatedMemory()}}.
Total memory in use seems to be:
* 2,499,805,184 total in the chunks
* 1,192,350,366 reported by allocator, less
* 5,062,656 allocated in the chunks
* = 3,687,092,894 total.
This is quite close to the 3,800,769,206 reported in use by {{Bits}}, assuming
various other allocations in Drill.
So, our problem is memory fragmentation: once we fragment memory, we just can't
allocate any more large chunks and Bad Things Happen (TM).
was (Author: paul-rogers):
Actually, the problem appears to be related to the cache of allocated memory
chunks.
{code}
Chunk(s) at 0~25%:
none
Chunk(s) at 0~50%:
Chunk(12122cd0: 1%, 40960/16777216)
Chunk(432d1a8f: 12%, 1998848/16777216)
Chunk(6bc20246: 0%, 0/16777216)
Chunk(5b40b4e5: 0%, 0/16777216)
Chunk(58b777f1: 0%, 0/16777216)
Chunk(73e5e70a: 0%, 0/16777216)
Chunk(84e1b02: 0%, 0/16777216)
Chunk(56777172: 0%, 0/16777216)
Chunk(359c8cb2: 0%, 0/16777216)
Chunk(699df0bc: 0%, 0/16777216)
Chunk(11f36086: 0%, 0/16777216)
Chunk(7ce26f2b: 0%, 0/16777216)
Chunk(2d4a8519: 0%, 0/16777216)
Chunk(2bd4881c: 0%, 0/16777216)
Chunk(21293ab0: 2%, 237568/16777216)
Chunk(4edd8289: 1%, 8192/16777216)
Chunk(37c6b406: 17%, 2744320/16777216)
Chunk(385d5e8a: 1%, 32768/16777216)
Chunk(50490f8b: 0%, 0/16777216)
Chunk(72a206c1: 0%, 0/16777216)
Chunk(7046ea17: 0%, 0/16777216)
Chunk(22bd539b: 0%, 0/16777216)
Chunk(3a902510: 0%, 0/16777216)
Chunk(5866a88d: 0%, 0/16777216)
Chunk(1fb7f7c4: 0%, 0/16777216)
Chunk(57de5e22: 0%, 0/16777216)
Chunk(6c5d496c: 0%, 0/16777216)
Chunk(192a6aa: 0%, 0/16777216)
Chunk(213b688b: 0%, 0/16777216)
Chunk(4b10dc0: 0%, 0/16777216)
Chunk(2212213: 0%, 0/16777216)
Chunk(1692730b: 0%, 0/16777216)
Chunk(6c173e62: 0%, 0/16777216)
Chunk(60c4f12d: 0%, 0/16777216)
Chunk(s) at 25~75%:
Chunk(6bfe669c: 0%, 0/16777216)
Chunk(6e715ac3: 0%, 0/16777216)
Chunk(3bc09d41: 0%, 0/16777216)
Chunk(7c4a4e8d: 0%, 0/16777216)
Chunk(64981d1e: 0%, 0/16777216)
Chunk(dbe40c: 0%, 0/16777216)
Chunk(3fce5bc3: 0%, 0/16777216)
Chunk(s) at 50~100%:
none
Chunk(s) at 75~100%:
Chunk(115e4491: 0%, 0/16777216)
Chunk(350acb49: 0%, 0/16777216)
Chunk(6a2ea260: 0%, 0/16777216)
Chunk(2773fca5: 0%, 0/16777216)
Chunk(446a4e16: 0%, 0/16777216)
Chunk(27d99551: 0%, 0/16777216)
Chunk(38fb1e68: 0%, 0/16777216)
Chunk(d54b06: 0%, 0/16777216)
Chunk(16d9aff4: 0%, 0/16777216)
Chunk(7dc1c363: 0%, 0/16777216)
Chunk(1da99aed: 0%, 0/16777216)
Chunk(378e6f25: 0%, 0/16777216)
Chunk(6cf3d02f: 0%, 0/16777216)
Chunk(1f5adc09: 0%, 0/16777216)
Chunk(4e7553fd: 0%, 0/16777216)
Chunk(a46ea51: 0%, 0/16777216)
Chunk(78c6219e: 0%, 0/16777216)
Chunk(31b5001b: 0%, 0/16777216)
Chunk(55bb476b: 0%, 0/16777216)
Chunk(68123bef: 0%, 0/16777216)
Chunk(21913da2: 0%, 0/16777216)
Chunk(383d4453: 0%, 0/16777216)
Chunk(3732cc20: 0%, 0/16777216)
Chunk(4e86446a: 0%, 0/16777216)
Chunk(66d21c35: 0%, 0/16777216)
Chunk(349fd360: 0%, 0/16777216)
Chunk(156d4a1f: 0%, 0/16777216)
Chunk(69b4e9cc: 0%, 0/16777216)
Chunk(1f71737b: 0%, 0/16777216)
Chunk(55bfa726: 0%, 0/16777216)
Chunk(2a7d323c: 0%, 0/16777216)
Chunk(64c94436: 0%, 0/16777216)
Chunk(70b7097f: 0%, 0/16777216)
Chunk(581906d8: 0%, 0/16777216)
Chunk(1b362335: 0%, 0/16777216)
Chunk(35f03c91: 0%, 0/16777216)
Chunk(7d4437a1: 0%, 0/16777216)
Chunk(6d7bd117: 0%, 0/16777216)
Chunk(47fe7806: 0%, 0/16777216)
Chunk(735ec0dc: 0%, 0/16777216)
Chunk(2ffb0829: 0%, 0/16777216)
Chunk(1cbb97a8: 0%, 0/16777216)
Chunk(28b1f271: 0%, 0/16777216)
Chunk(2d6c9f9b: 0%, 0/16777216)
Chunk(5a21605f: 0%, 0/16777216)
Chunk(1a67aa64: 0%, 0/16777216)
Chunk(3d62e123: 0%, 0/16777216)
Chunk(74bb2153: 0%, 0/16777216)
Chunk(25498403: 0%, 0/16777216)
Chunk(2da3e44: 0%, 0/16777216)
Chunk(281bbcc5: 0%, 0/16777216)
Chunk(587b12c: 0%, 0/16777216)
Chunk(6c874403: 0%, 0/16777216)
Chunk(3ffc7fc9: 0%, 0/16777216)
Chunk(4af41167: 0%, 0/16777216)
Chunk(72c2d7c4: 0%, 0/16777216)
Chunk(243332c3: 0%, 0/16777216)
Chunk(78ed13bb: 0%, 0/16777216)
Chunk(12f84ae8: 0%, 0/16777216)
Chunk(7660c384: 0%, 0/16777216)
Chunk(4bf852a1: 0%, 0/16777216)
Chunk(5b98f0ae: 0%, 0/16777216)
Chunk(be74e3f: 0%, 0/16777216)
Chunk(7b6bd024: 0%, 0/16777216)
Chunk(720ff8b2: 0%, 0/16777216)
Chunk(6e0e7bdd: 0%, 0/16777216)
Chunk(5fa94695: 0%, 0/16777216)
Chunk(7ae647b4: 0%, 0/16777216)
Chunk(77a1ea32: 0%, 0/16777216)
Chunk(6aecb788: 0%, 0/16777216)
Chunk(7fe4c9ae: 0%, 0/16777216)
Chunk(3777ea01: 0%, 0/16777216)
Chunk(4f7f76a7: 0%, 0/16777216)
Chunk(4020d837: 0%, 0/16777216)
Chunk(1950c024: 0%, 0/16777216)
Chunk(117f16ed: 0%, 0/16777216)
Chunk(2501802b: 0%, 0/16777216)
Chunk(63a605dc: 0%, 0/16777216)
Chunk(7ce8b86c: 0%, 0/16777216)
Chunk(15490162: 0%, 0/16777216)
Chunk(3c60db38: 0%, 0/16777216)
Chunk(6fbbb18d: 0%, 0/16777216)
Chunk(56a94fce: 0%, 0/16777216)
Chunk(bb61668: 0%, 0/16777216)
Chunk(3135b53d: 0%, 0/16777216)
Chunk(3b05d4f: 0%, 0/16777216)
Chunk(1f7ba5c8: 0%, 0/16777216)
Chunk(24c5e519: 0%, 0/16777216)
Chunk(38c520e1: 0%, 0/16777216)
Chunk(399e4893: 0%, 0/16777216)
Chunk(7b89ef8d: 0%, 0/16777216)
Chunk(706f30c8: 0%, 0/16777216)
Chunk(613cc40c: 0%, 0/16777216)
Chunk(2aadc268: 0%, 0/16777216)
Chunk(1eecb537: 0%, 0/16777216)
Chunk(178c3f52: 0%, 0/16777216)
Chunk(1017850b: 0%, 0/16777216)
Chunk(54edabe3: 0%, 0/16777216)
Chunk(2f53f944: 0%, 0/16777216)
Chunk(59532553: 0%, 0/16777216)
Chunk(7540ccaf: 0%, 0/16777216)
Chunk(4c4bc357: 0%, 0/16777216)
Chunk(7c629a43: 0%, 0/16777216)
Chunk(3cdb5121: 0%, 0/16777216)
Chunk(4f8dd7a1: 0%, 0/16777216)
Chunk(5d4ee47c: 0%, 0/16777216)
Chunk(3596dd14: 0%, 0/16777216)
Chunk(53a2d0de: 0%, 0/16777216)
Chunk(s) at 100%:
none
tiny subpages:
1: (2052: 2/512, offset: 32768, length: 8192, elemSize: 16)
2: (2915: 1/256, offset: 7102464, length: 8192, elemSize: 32)
4: (2473: 2/128, offset: 3481600, length: 8192, elemSize: 64)
8: (2049: 2/64, offset: 8192, length: 8192, elemSize: 128)
16: (2053: 2/32, offset: 40960, length: 8192, elemSize: 256)
small subpages:
1: (2048: 1/8, offset: 0, length: 8192, elemSize: 1024)
3: (2096: 1/2, offset: 393216, length: 8192, elemSize: 4096)
{code}
> External sort fails to allocate merge memory when plenty is free
> ----------------------------------------------------------------
>
> Key: DRILL-5211
> URL: https://issues.apache.org/jira/browse/DRILL-5211
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Paul Rogers
> Assignee: Paul Rogers
> Fix For: 1.9.0
>
>
> Consider a test of the external sort as follows:
> * Direct memory: 3GB
> * Input file: 18 GB, with one Varchar column of 8K width
> The sort runs, spilling to disk. Once all data arrives, the sort beings to
> merge the results. But, to do that, it must first do an intermediate merge.
> For example, in this sort, there are 190 spill files, but only 19 can be
> merged at a time. (Each merge file contains 128 MB batches, and only 19 can
> fit in memory, giving a total footprint of 2.5 GB, well below the 3 GB limit.
> Yet, when loading batch xx, Drill fails with an OOM error. At that point,
> total available direct memory is 3,817,865,216. (Obtained from {{maxMemory}}
> in the {{Bits}} class in the JDK.)
> It appears that Drill wants to allocate 58,257,868 bytes, but the
> {{totalCapacity}} (again in {{Bits}}) is already 3,800,769,206, causing an
> OOM.
> The problem is that, at this point, the external sort should not ask the
> system for more memory. The allocator for the external sort is at just
> 1,192,350,366 before the allocation request. Plenty of spare memory should be
> available, released when the in-memory batches were spilled to disk prior to
> merging. Indeed, earlier in the run, the sort had reached a peak memory usage
> of 2,710,716,416 bytes. This memory should be available for reuse during
> merging, and is plenty sufficient to fill the particular request in question.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)