[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692447#comment-16692447 ] Sumanth Pasupuleti edited comment on CASSANDRA-14855 at 11/20/18 12:38 AM: --- Appreciate your thoughts [~benedict] . Trying to figure out a way forward since there have not been inputs from anyone else. I also like the suggestion of keeping the existing flusher ON by default, and making ImmediateFlusher usage optional (through yaml property like native_transport_flush_immediate which is set to false by default) - I can work on a patch for that. Let me know. was (Author: sumanth.pasupuleti): Appreciate your thoughts [~benedict] . Trying to figure out a way forward since there have not been inputs from anyone else. I also like the suggestion of keeping the existing flusher ON by default, and making ImmediateFlusher usage optional (through yaml property like native_transport_flush_in_batches_immediate which is set to false by default) - I can work on a patch for that. Let me know. > Message Flusher scheduling fell off the event loop, resulting in out of memory > -- > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Sumanth Pasupuleti >Assignee: Sumanth Pasupuleti >Priority: Major > Fix For: 3.0.17 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For
[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692447#comment-16692447 ] Sumanth Pasupuleti edited comment on CASSANDRA-14855 at 11/20/18 12:29 AM: --- Appreciate your thoughts [~benedict] . Trying to figure out a way forward since there have not been inputs from anyone else. I also like the suggestion of keeping the existing flusher ON by default, and making ImmediateFlusher usage optional (through yaml property like native_transport_flush_in_batches_immediate which is set to false by default) - I can work on a patch for that. Let me know. was (Author: sumanth.pasupuleti): Appreciate your thoughts [~benedict] . Trying to figure out a way forward since there have not been inputs from anyone else. I also like the suggestion of keeping the existing flusher ON by default, and making ImmediateFlusher usage optional (through yaml property like use_immediate_flusher which is set to false by default) - I can work on a patch for that. Let me know. > Message Flusher scheduling fell off the event loop, resulting in out of memory > -- > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Sumanth Pasupuleti >Assignee: Sumanth Pasupuleti >Priority: Major > Fix For: 3.0.17 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional
[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16692447#comment-16692447 ] Sumanth Pasupuleti edited comment on CASSANDRA-14855 at 11/20/18 12:07 AM: --- Appreciate your thoughts [~benedict] . Trying to figure out a way forward since there have not been inputs from anyone else. I also like the suggestion of keeping the existing flusher ON by default, and making ImmediateFlusher usage optional (through yaml property like use_immediate_flusher which is set to false by default) - I can work on a patch for that. Let me know. was (Author: sumanth.pasupuleti): Appreciate your thoughts [~benedict] . Trying to figure out a way forward since there have not been inputs from anyone else. I also like the suggestion of keeping the existing flusher ON by default, and making immediate_flusher usage optional (through yaml property like use_immediate_flusher which is set to false by default) - I can work on a patch for that. Let me know. > Message Flusher scheduling fell off the event loop, resulting in out of memory > -- > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Sumanth Pasupuleti >Assignee: Sumanth Pasupuleti >Priority: Major > Fix For: 3.0.17 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail:
[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681949#comment-16681949 ] Sumanth Pasupuleti edited comment on CASSANDRA-14855 at 11/9/18 9:26 PM: - [~zznate] Yes, this was the first time we saw this (and there was a second incident that followed with similar characteristics on the same cluster). This happened only on one cluster (this cluster is our most read heavy 3.0 CQL cluster), and following are the characteristics: * 3.0.17 C* version * CQL * Relatively high read traffic (~60k rps at peak at coordinator level) * Has client side wire compression (LZ4) enabled * Total outbound traffic of ~4Gbps across the cluster was (Author: sumanth.pasupuleti): [~zznate] Yes, this was the first time we saw this (and there was a second incident with similar characteristics on the same cluster). This happened only on one cluster (this cluster is our most read heavy 3.0 CQL cluster), and following are the characteristics: * 3.0.17 C* version * CQL * Relatively high read traffic (~60k rps at peak at coordinator level) * Has client side wire compression (LZ4) enabled * Total outbound traffic of ~4Gbps across the cluster > Message Flusher scheduling fell off the event loop, resulting in out of memory > -- > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Sumanth Pasupuleti >Priority: Major > Fix For: 3.0.17 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681949#comment-16681949 ] Sumanth Pasupuleti edited comment on CASSANDRA-14855 at 11/9/18 9:27 PM: - [~zznate] Yes, this was the first time we saw this (and there was a second incident that followed with similar characteristics on the same cluster). This happened only on one cluster (this is our most read heavy 3.0 CQL cluster), and following are the characteristics: * 3.0.17 C* version * CQL * Relatively high read traffic (~60k rps at peak at coordinator level) * Has client side wire compression (LZ4) enabled * Total outbound traffic of ~4Gbps across the cluster was (Author: sumanth.pasupuleti): [~zznate] Yes, this was the first time we saw this (and there was a second incident that followed with similar characteristics on the same cluster). This happened only on one cluster (this cluster is our most read heavy 3.0 CQL cluster), and following are the characteristics: * 3.0.17 C* version * CQL * Relatively high read traffic (~60k rps at peak at coordinator level) * Has client side wire compression (LZ4) enabled * Total outbound traffic of ~4Gbps across the cluster > Message Flusher scheduling fell off the event loop, resulting in out of memory > -- > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Sumanth Pasupuleti >Priority: Major > Fix For: 3.0.17 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16681689#comment-16681689 ] Nate McCall edited comment on CASSANDRA-14855 at 11/9/18 4:46 PM: -- [~sumanth.pasupuleti] Thanks for the detailed analysis and patch. Was this the first time your team has seen this or has it potentially manifested in other clusters previously? was (Author: zznate): [~sumanth.pasupuleti] Thanks for the detailed analysis. Was this the first time your team has seen this or has it potentially manifested in other clusters previously? > Message Flusher scheduling fell off the event loop, resulting in out of memory > -- > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Sumanth Pasupuleti >Priority: Major > Fix For: 3.0.17 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory
[ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16671132#comment-16671132 ] Sumanth Pasupuleti edited comment on CASSANDRA-14855 at 11/1/18 5:36 AM: - Backport of ImmediateFlusher to 3.0 https://github.com/sumanth-pasupuleti/cassandra/tree/3.0_backport_immediate_flusher Passing UTs: https://circleci.com/gh/sumanth-pasupuleti/cassandra/139 was (Author: sumanth.pasupuleti): Backport of ImmediateFlusher to 3.0 https://github.com/sumanth-pasupuleti/cassandra/tree/3.0_backport_immediate_flusher > Message Flusher scheduling fell off the event loop, resulting in out of memory > -- > > Key: CASSANDRA-14855 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14855 > Project: Cassandra > Issue Type: Bug > Components: Core >Reporter: Sumanth Pasupuleti >Priority: Major > Fix For: 3.0.17 > > Attachments: blocked_thread_pool.png, cpu.png, > eventloop_scheduledtasks.png, flusher running state.png, heap.png, > heap_dump.png, read_latency.png > > > We recently had a production issue where about 10 nodes in a 96 node cluster > ran out of heap. > From heap dump analysis, I believe there is enough evidence to indicate > `queued` data member of the Flusher got too big, resulting in out of memory. > Below are specifics on what we found from the heap dump (relevant screenshots > attached): > * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, > and multiple such instances. > * "running" data member of Flusher having "true" value > * Size of scheduledTasks on the eventloop was 0. > We suspect something (maybe an exception) caused the Flusher running state to > continue to be true, but was not able to schedule itself with the event loop. > Could not find any ERROR in the system.log, except for following INFO logs > around the incident time. > {code:java} > INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - > Unexpected exception during request; channel = [id: 0x8d288811, > L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886] > io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: > Connection timed out > at io.netty.channel.unix.Errors.newIOException(Errors.java:117) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.Errors.ioResult(Errors.java:138) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) > ~[netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) > [netty-all-4.0.44.Final.jar:4.0.44.Final] > {code} > I would like to pursue the following proposals to fix this issue: > # ImmediateFlusher: Backport trunk's ImmediateFlusher ( > [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] > https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec) > to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems > to be more robust than the existing Flusher as it does not depend on any > running state/scheduling. > # Make "queued" data member of the Flusher bounded to avoid any potential of > causing out of memory due to otherwise unbounded nature. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org