[jira] [Created] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory

Sumanth Pasupuleti (JIRA) Sun, 28 Oct 2018 22:51:01 -0700

Sumanth Pasupuleti created CASSANDRA-14855:
----------------------------------------------


             Summary: Message Flusher scheduling fell off the event loop, 
resulting in out of memory
                 Key: CASSANDRA-14855
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14855
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Sumanth Pasupuleti
             Fix For: 3.0.17
         Attachments: blocked_thread_pool.png, cpu.png, 
eventloop_scheduledtasks.png, flusher running state.png, heap.png, 
heap_dump.png, read_latency.png

We recently had a production issue where about 10 nodes in a 96 node cluster 
ran out of heap. 

>From heap dump analysis, I believe there is enough evidence to indicate 
>`queued` data member of the Flusher got too big, resulting in out of memory.
Below are specifics on what we found from the heap dump (relevant screenshots 
attached):
* non-empty "queued" data member of Flusher having retaining heap of 0.5GB, and 
multiple such instances.
* "running" data member of Flusher having "true" value
* Size of scheduledTasks on the eventloop was 0.

We suspect something (maybe an exception) caused the Flusher running state to 
continue to be true, but was not able to schedule itself with the event loop.
Could not find any ERROR in the system.log, except for following INFO logs 
around the incident time.


{code:java}
INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - 
Unexpected exception during request; channel = [id: 0x8d288811, 
L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886]
io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: 
Connection timed out
 at io.netty.channel.unix.Errors.newIOException(Errors.java:117) 
~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.unix.Errors.ioResult(Errors.java:138) 
~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) 
~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at 
io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238)
 ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926)
 ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) 
[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) 
[netty-all-4.0.44.Final.jar:4.0.44.Final]
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
 [netty-all-4.0.44.Final.jar:4.0.44.Final]
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
 [netty-all-4.0.44.Final.jar:4.0.44.Final]
{code}

I would like to pursue the following proposals to fix this issue:
# ImmediateFlusher: Backport trunk's ImmediateFlusher to 3.0.x and maybe to 
other versions as well, since ImmediateFlusher seems to be more robust than the 
existing Flusher as it does not depend on any running state/scheduling.
# Make "queued" data member of the Flusher bounded to avoid any potential of 
causing out of memory due to otherwise unbounded nature.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory

Reply via email to