[
https://issues.apache.org/jira/browse/CASSANDRA-14355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992973#comment-16992973
]
Chris Kistner commented on CASSANDRA-14355:
-------------------------------------------
[~benedict], thank you for the feedback.
After we've done some extensive investigation on our end, we pinned our issue
to Cassandra Reaper terminating the Cassandra repair sessions if they don't
finish within 30 minutes ("hangingRepairTimeoutMins" default), and then those
repair threads aren't closed by Cassandra 3.11.4 and thus retain ~200MB of data
on heap in our case. The other annoying part when this happens is that both
Cassandra & Cassandra Reaper logs it as if the repair session was successful,
unless you go and look at the logs in detail.
I think I should rather create a new bug for Cassandra not closing the Repair
session thread correctly, since the original issue's
"io.netty.util.concurrent.FastThreadLocalThread" class was referenced from
"Native-Transport-Requests" and "ReadStage" threads, where as ours were all
from "Repair" threads.
We might be able to see if we can reproduce the Cassandra issue where it is not
closing Repair threads properly on 3.11.4 and then on 3.11.5.
So far we've only experienced these OOME issues in client production
environments and right now all our clients are in code/change freeze, so we
won't be able to test 3.11.5 for about a month in a large production
environment.
So for now we'll just schedule per host & per table repairs using cron scripts
and doing a different host each day of the week instead of using Cassandra
Reaper that might terminate our repair sessions.
> Memory leak
> -----------
>
> Key: CASSANDRA-14355
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14355
> Project: Cassandra
> Issue Type: Bug
> Components: Legacy/Core
> Environment: Debian Jessie, OpenJDK 1.8.0_151
> Reporter: Eric Evans
> Priority: Normal
> Fix For: 3.11.x
>
> Attachments: 01_Screenshot from 2018-04-04 14-24-00.png,
> 02_Screenshot from 2018-04-04 14-28-33.png, 03_Screenshot from 2018-04-04
> 14-24-50.png, LongGC_Dominator-Tree.png, LongGC_Histogram.png,
> LongGC_Problem-Suspect-1_FastThreadLocalThread.png, LongGC_nodetool_info.txt
>
>
> We're seeing regular, frequent {{OutOfMemoryError}} exceptions. Similar to
> CASSANDRA-13754, an analysis of the heap dumps shows the heap consumed by the
> {{threadLocals}} member of the instances of
> {{io.netty.util.concurrent.FastThreadLocalThread}}.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]