[ 
https://issues.apache.org/jira/browse/CASSANDRA-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180704#comment-14180704
 ] 

Oleg Kibirev commented on CASSANDRA-5039:
-----------------------------------------

I no longer work on this particular project, but basically the problem happened 
when I ran 3 nodes on different disks of the same machine, loaded them heavily 
with inserts and the pulled out one of the disks. The remaining nodes would run 
out of memory as they would queue many operations before discovering that the 
destination has died.

Setting a limit on the corresponding queue allowed the system to remain 
operational.

We have seen similar OOMs in production. That's about all the details I 
remember about this experiment.

> Make sure all instances of BlockingQueue have configurable and sane limits
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5039
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5039
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.1.7
>            Reporter: Oleg Kibirev
>            Priority: Minor
>              Labels: performance
>
> Currently, most BlockingQueues in cassandra are creating without any limits 
> (execution stages) or with limits high enough to consume gigabytes of heap 
> (PeriodicCommitLogExecutorService). I have observed many cases where a single 
> unresponsive node can bring down entire cluster because others accumulate 
> huge backlogs of operations.
> We need to make sure each queue is configurable through a yaml entry or a 
> system property and defaults are chosen so that any given queue doesn't 
> consume more than 100M of heap. I have successfully tested that adding these 
> limits makes cluster resistant to heavy load or a bad node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to