[jira] [Updated] (CASSANDRA-9033) Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes unresponsive

Philip Thompson (JIRA) Wed, 25 Mar 2015 07:44:21 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Philip Thompson updated CASSANDRA-9033:
---------------------------------------
    Assignee: Marcus Eriksson

> Upgrading from 2.1.1 to 2.1.3 with LCS  and many sstable files makes nodes 
> unresponsive
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-9033
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9033
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: * Ubuntu 14.04.2 - Linux ip-10-0-2-122 3.13.0-46-generic 
> #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> * EC2 m2-xlarge instances [4cpu, 16GB RAM, 1TB storage on 3 platters]
> * 12 nodes running a mix of 2.1.1 and 2.1.3
> * 8GB stack size with offheap objects
>            Reporter: Brent Haines
>            Assignee: Marcus Eriksson
>            Priority: Blocker
>         Attachments: cassandra-env.sh, cassandra.yaml, system.log.1.zip
>
>
> We have an Event Log table using LCS that has grown fast. There are more than 
> 100K sstable files that are around 1KB. Increasing compactors and adjusting 
> compaction throttling upward doesn't make a difference. It has been running 
> great though until we upgraded to 2.1.3. Those nodes needed more RAM for the 
> stack (12 GB) to even have a prayer of responding to queries. They bog down 
> and become unresponsive. There are no GC messages that I can see, and no 
> compaction either. 
> The only work-around I have found is to decommission, blow away the big CF 
> and rejoin. That happens in about 20 minutes and everything is freaking happy 
> again. The size of the files is more like what I'd expect as well. 
> Our schema: 
> {code}
> cqlsh> describe columnfamily data.stories
> CREATE TABLE data.stories (
>     id timeuuid PRIMARY KEY,
>     action_data timeuuid,
>     action_name text,
>     app_id timeuuid,
>     app_instance_id timeuuid,
>     data map<text, text>,
>     objects set<timeuuid>,
>     time_stamp timestamp,
>     user_id timeuuid
> ) WITH bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = 'Stories represent the timeline and are placed in the 
> dashboard for the brand manager to see'
>     AND compaction = {'min_threshold': '4', 'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32'}
>     AND compression = {'sstable_compression': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
> cqlsh> 
> {code}
> There were no log entries that stood out. It pretty much consisted of "x is 
> down" "x is up" repeated ad infinitum. I have attached the zipped system.log 
> that has the situation after the upgrade and then after I stopped, removed 
> system, system_traces, OpsCenter, and data/stories-/* and restarted. 
> It has rejoined the cluster now and is busy read-repairing to recover its 
> data.
> On another note, we see a lot of this during repair now (on all the nodes): 
> {code}
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,207 RepairSession.java:303 
> - [repair #c5043c40-d260-11e4-a2f2-8bb3e2bbdb35] session completed with the 
> following error
> java.io.IOException: Failed during snapshot creation.
>         at 
> org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
> ~[guava-16.0.jar:na]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,208 
> CassandraDaemon.java:167 - Exception in thread 
> Thread[AntiEntropySessions:5,5,RMI Runtime]
> java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
> creation.
>         at com.google.common.base.Throwables.propagate(Throwables.java:160) 
> ~[guava-16.0.jar:na]
>         at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
> ~[na:1.7.0_55]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
> ~[na:1.7.0_55]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  ~[na:1.7.0_55]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> Caused by: java.io.IOException: Failed during snapshot creation.
>         at 
> org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
> ~[guava-16.0.jar:na]
>         ... 3 common frames omitted
> ERROR [RepairJobTask:2] 2015-03-24 20:03:20,227 RepairJob.java:145 - Error 
> occurred during snapshot phase
> java.lang.RuntimeException: Could not create snapshot at /10.0.2.144
>         at 
> org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:349) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
> ~[na:1.7.0_55]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
> ~[na:1.7.0_55]
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code} 
> I am thinking that this means that my work-around for blowing away and 
> rebuilding the CF may not be working anymore. I don't know of another way to 
> force LCS compaction. The node doesn't ever seem to recover enough to compact 
> on its own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (CASSANDRA-9033) Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes unresponsive

Reply via email to