Darla Baker created CASSANDRA-7552:
--------------------------------------

             Summary: Compactions Pending build up when using LCS
                 Key: CASSANDRA-7552
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7552
             Project: Cassandra
          Issue Type: Bug
          Components: Core
            Reporter: Darla Baker


We seem to be hitting an issue with LeveledCompactionStrategy while running 
performance tests on a 4 node cassandra installation. We are currently using 
Cassandra 2.0.7.

In summary, we run a tests consisting of approximatively, 8000 inserts/sec, 
16,000 gets/sec, and 8,000 deletes/sec. We have a grace period of 12 hours on 
our column families.

At this rate, we observe a stable pending compaction tasks for about 22 to 26 
hours. After that period, something happens and the pending compaction tasks 
starts to increase rapidly, sometimes on one or two servers, but sometimes on 
all four of them. This goes on until the uncompacted SStables start consuming 
all the disk space, after which the cassandra cluster generally fails.

When this occurs, the Compaction completed tasks rate is usually reducing over 
time, which seems to indicate that it takes more and more time to run the 
existing compaction tasks.

At different occasions, I can reproduce a similar issue in less than 12 hours. 
While the traffic rate remains constant, we seem to be hitting this at various 
intervals. Yesterday I could reproduce in less than 6 hours.

We have two different deployments on which we have tested this issue: 
1. 4x IBM HS22, using RAMDISK as cassandra data directory (thus eliminating 
disk I/O) 
2. 8x IBM HS23, with SSD disks, deployed in two "geo-redundant" data centers of 
4 nodes each, and a latency of 50ms between the data centers.

I can reproduce the "compaction tasks falling behind" on both these setup, 
although they could be occurring for different reasons. Because of #1, I do not 
believe we are hitting an I/O bottleneck just yet.

As an additional interesting node, if I artificially pause the traffic when I 
see the pending compaction task issue occurring, then: 

1. The pending compaction tasks obviously stops to increase, but stay at the 
same number for 15 minutes (as if nothing is running). 
2. The completed compaction tasks falls to 0 for 15 minutes 
3. After 15 to 20 minutes, out of the blue, all compaction completes in less 
than 2 minutes.

If I restart the traffic after that, the system is stable for a few hours, but 
the issue always comes back.

We have written a small test tool that reproduce our application's Cassandra 
interaction.

We have not successfully run a test for more than 30 hours under load, and 
every failure after that time would follow a similar pattern.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to