Do you have a trace? i.e. what's the source of `io.netty.*` calls?

And have you tried bumping `-XX:MaxDirectMemorySize`?

On Tue, Mar 6, 2018 at 12:45 AM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:

> Hi All
>
> I have a job which processes a large dataset.  All items in the dataset
> are unrelated.  To save on cluster resources,  I process these items in
> chunks.  Since chunks are independent of each other,  I start and shut down
> the spark context for each chunk.  This allows me to keep DAG smaller and
> not retry the entire DAG in case of failures.   This mechanism used to work
> fine with Spark 1.6.  Now,  as we have moved to 2.2,  the job started
> failing with OutOfDirectMemoryError error.
>
> 2018-03-03 22:00:59,687 WARN  [rpc-server-48-1]
> server.TransportChannelHandler 
> (TransportChannelHandler.java:exceptionCaught(78))
> - Exception in connection from /10.66.73.27:60374
>
> io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 8388608
> byte(s) of direct memory (used: 1023410176, max: 1029177344)
>
> at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(
> PlatformDependent.java:506)
>
> at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(
> PlatformDependent.java:460)
>
> at io.netty.buffer.PoolArena$DirectArena.allocateDirect(
> PoolArena.java:701)
>
> at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:690)
>
> at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237)
>
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:213)
>
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:141)
>
> at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(
> PooledByteBufAllocator.java:271)
>
> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(
> AbstractByteBufAllocator.java:177)
>
> at io.netty.buffer.AbstractByteBufAllocator.directBuffer(
> AbstractByteBufAllocator.java:168)
>
> at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(
> AbstractByteBufAllocator.java:129)
>
> at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(
> AdaptiveRecvByteBufAllocator.java:104)
>
> at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(
> AbstractNioByteChannel.java:117)
>
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(
> NioEventLoop.java:564)
>
> I got some clue on what is causing this from https://github.com/netty/
> netty/issues/6343,  However I am not able to add up numbers on what is
> causing 1 GB of Direct Memory to fill up.
>
> Output from jmap
>
>
> 7: 22230 1422720 io.netty.buffer.PoolSubpage
>
> 12: 1370 804640 io.netty.buffer.PoolSubpage[]
>
> 41: 3600 144000 io.netty.buffer.PoolChunkList
>
> 98: 1440 46080 io.netty.buffer.PoolThreadCache$SubPageMemoryRegionCache
>
> 113: 300 40800 io.netty.buffer.PoolArena$HeapArena
>
> 114: 300 40800 io.netty.buffer.PoolArena$DirectArena
>
> 192: 198 15840 io.netty.buffer.PoolChunk
>
> 274: 120 8320 io.netty.buffer.PoolThreadCache$MemoryRegionCache[]
>
> 406: 120 3840 io.netty.buffer.PoolThreadCache$NormalMemoryRegionCache
>
> 422: 72 3552 io.netty.buffer.PoolArena[]
>
> 458: 30 2640 io.netty.buffer.PooledUnsafeDirectByteBuf
>
> 500: 36 2016 io.netty.buffer.PooledByteBufAllocator
>
> 529: 32 1792 io.netty.buffer.UnpooledUnsafeHeapByteBuf
>
> 589: 20 1440 io.netty.buffer.PoolThreadCache
>
> 630: 37 1184 io.netty.buffer.EmptyByteBuf
>
> 703: 36 864 io.netty.buffer.PooledByteBufAllocator$PoolThreadLocalCache
>
> 852: 22 528 io.netty.buffer.AdvancedLeakAwareByteBuf
>
> 889: 10 480 io.netty.buffer.SlicedAbstractByteBuf
>
> 917: 8 448 io.netty.buffer.UnpooledHeapByteBuf
>
> 1018: 20 320 io.netty.buffer.PoolThreadCache$1
>
> 1305: 4 128 io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry
>
> 1404: 1 80 io.netty.buffer.PooledUnsafeHeapByteBuf
>
> 1473: 3 72 io.netty.buffer.PoolArena$SizeClass
>
> 1529: 1 64 io.netty.buffer.AdvancedLeakAwareCompositeByteBuf
>
> 1541: 2 64 io.netty.buffer.CompositeByteBuf$Component
>
> 1568: 1 56 io.netty.buffer.CompositeByteBuf
>
> 1896: 1 32 io.netty.buffer.PoolArena$SizeClass[]
>
> 2042: 1 24 io.netty.buffer.PooledUnsafeDirectByteBuf$1
>
> 2046: 1 24 io.netty.buffer.UnpooledByteBufAllocator
>
> 2051: 1 24 io.netty.buffer.PoolThreadCache$MemoryRegionCache$1
>
> 2078: 1 24 io.netty.buffer.PooledHeapByteBuf$1
>
> 2135: 1 24 io.netty.buffer.PooledUnsafeHeapByteBuf$1
>
> 2302: 1 16 io.netty.buffer.ByteBufUtil$1
>
> 2769: 1 16 io.netty.util.internal.__matchers__.io.netty.buffer.
> ByteBufMatcher
>
>
>
> My Driver machine has 32 CPUs,  and as of now i have 15 machines in my
> cluster.   As of now, the error happens on processing 5th or 6th chunk.  I
> suspect the error is dependent on number of Executors and would happen
> early if we add more executors.
>
>
> I am trying to come up an explanation of what is filling up the Direct
> Memory and how to quanitfy it as factor of Number of Executors.  Our
> cluster is shared cluster,  And we need to understand how much Driver
> Memory to allocate for most of the jobs.
>
>
>
>
>
> Regards
> Sumit Chawla
>
>

Reply via email to