[jira] [Updated] (CASSANDRA-9033) Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes unresponsive

2016-09-05 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-9033:

Labels:   (was: lcs)

> Upgrading from 2.1.1 to 2.1.3 with LCS  and many sstable files makes nodes 
> unresponsive
> ---
>
> Key: CASSANDRA-9033
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9033
> Project: Cassandra
>  Issue Type: Bug
> Environment: * Ubuntu 14.04.2 - Linux ip-10-0-2-122 3.13.0-46-generic 
> #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> * EC2 m2-xlarge instances [4cpu, 16GB RAM, 1TB storage on 3 platters]
> * 12 nodes running a mix of 2.1.1 and 2.1.3
> * 8GB stack size with offheap objects
>Reporter: Brent Haines
>Assignee: Marcus Eriksson
> Attachments: cassandra-env.sh, cassandra.yaml, system.log.1.zip
>
>
> We have an Event Log table using LCS that has grown fast. There are more than 
> 100K sstable files that are around 1KB. Increasing compactors and adjusting 
> compaction throttling upward doesn't make a difference. It has been running 
> great though until we upgraded to 2.1.3. Those nodes needed more RAM for the 
> stack (12 GB) to even have a prayer of responding to queries. They bog down 
> and become unresponsive. There are no GC messages that I can see, and no 
> compaction either. 
> The only work-around I have found is to decommission, blow away the big CF 
> and rejoin. That happens in about 20 minutes and everything is freaking happy 
> again. The size of the files is more like what I'd expect as well. 
> Our schema: 
> {code}
> cqlsh> describe columnfamily data.stories
> CREATE TABLE data.stories (
> id timeuuid PRIMARY KEY,
> action_data timeuuid,
> action_name text,
> app_id timeuuid,
> app_instance_id timeuuid,
> data map,
> objects set,
> time_stamp timestamp,
> user_id timeuuid
> ) WITH bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = 'Stories represent the timeline and are placed in the 
> dashboard for the brand manager to see'
> AND compaction = {'min_threshold': '4', 'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32'}
> AND compression = {'sstable_compression': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
> cqlsh> 
> {code}
> There were no log entries that stood out. It pretty much consisted of "x is 
> down" "x is up" repeated ad infinitum. I have attached the zipped system.log 
> that has the situation after the upgrade and then after I stopped, removed 
> system, system_traces, OpsCenter, and data/stories-/* and restarted. 
> It has rejoined the cluster now and is busy read-repairing to recover its 
> data.
> On another note, we see a lot of this during repair now (on all the nodes): 
> {code}
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,207 RepairSession.java:303 
> - [repair #c5043c40-d260-11e4-a2f2-8bb3e2bbdb35] session completed with the 
> following error
> java.io.IOException: Failed during snapshot creation.
> at 
> org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
> at 
> org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
> ~[guava-16.0.jar:na]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,208 
> CassandraDaemon.java:167 - Exception in thread 
> Thread[AntiEntropySessions:5,5,RMI Runtime]
> java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
> creation.
> at com.google.common.base.Throwables.propagate(Throwables.java:160) 
> ~[guava-16.0.jar:na]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
> ~[na:1.7.0_55]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
> ~[na:1.7.0_55]
> at 
> 

[jira] [Updated] (CASSANDRA-9033) Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes unresponsive

2016-07-21 Thread Wei Deng (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Deng updated CASSANDRA-9033:

Labels: lcs  (was: )

> Upgrading from 2.1.1 to 2.1.3 with LCS  and many sstable files makes nodes 
> unresponsive
> ---
>
> Key: CASSANDRA-9033
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9033
> Project: Cassandra
>  Issue Type: Bug
> Environment: * Ubuntu 14.04.2 - Linux ip-10-0-2-122 3.13.0-46-generic 
> #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> * EC2 m2-xlarge instances [4cpu, 16GB RAM, 1TB storage on 3 platters]
> * 12 nodes running a mix of 2.1.1 and 2.1.3
> * 8GB stack size with offheap objects
>Reporter: Brent Haines
>Assignee: Marcus Eriksson
>  Labels: lcs
> Attachments: cassandra-env.sh, cassandra.yaml, system.log.1.zip
>
>
> We have an Event Log table using LCS that has grown fast. There are more than 
> 100K sstable files that are around 1KB. Increasing compactors and adjusting 
> compaction throttling upward doesn't make a difference. It has been running 
> great though until we upgraded to 2.1.3. Those nodes needed more RAM for the 
> stack (12 GB) to even have a prayer of responding to queries. They bog down 
> and become unresponsive. There are no GC messages that I can see, and no 
> compaction either. 
> The only work-around I have found is to decommission, blow away the big CF 
> and rejoin. That happens in about 20 minutes and everything is freaking happy 
> again. The size of the files is more like what I'd expect as well. 
> Our schema: 
> {code}
> cqlsh> describe columnfamily data.stories
> CREATE TABLE data.stories (
> id timeuuid PRIMARY KEY,
> action_data timeuuid,
> action_name text,
> app_id timeuuid,
> app_instance_id timeuuid,
> data map,
> objects set,
> time_stamp timestamp,
> user_id timeuuid
> ) WITH bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = 'Stories represent the timeline and are placed in the 
> dashboard for the brand manager to see'
> AND compaction = {'min_threshold': '4', 'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32'}
> AND compression = {'sstable_compression': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
> cqlsh> 
> {code}
> There were no log entries that stood out. It pretty much consisted of "x is 
> down" "x is up" repeated ad infinitum. I have attached the zipped system.log 
> that has the situation after the upgrade and then after I stopped, removed 
> system, system_traces, OpsCenter, and data/stories-/* and restarted. 
> It has rejoined the cluster now and is busy read-repairing to recover its 
> data.
> On another note, we see a lot of this during repair now (on all the nodes): 
> {code}
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,207 RepairSession.java:303 
> - [repair #c5043c40-d260-11e4-a2f2-8bb3e2bbdb35] session completed with the 
> following error
> java.io.IOException: Failed during snapshot creation.
> at 
> org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
>  ~[apache-cassandra-2.1.3.jar:2.1.3]
> at 
> org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
> ~[guava-16.0.jar:na]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  [na:1.7.0_55]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,208 
> CassandraDaemon.java:167 - Exception in thread 
> Thread[AntiEntropySessions:5,5,RMI Runtime]
> java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
> creation.
> at com.google.common.base.Throwables.propagate(Throwables.java:160) 
> ~[guava-16.0.jar:na]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
> ~[apache-cassandra-2.1.3.jar:2.1.3]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
> ~[na:1.7.0_55]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
> ~[na:1.7.0_55]
> at 
> 

[jira] [Updated] (CASSANDRA-9033) Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes unresponsive

2015-03-26 Thread Marcus Eriksson (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Eriksson updated CASSANDRA-9033:
---
Priority: Major  (was: Blocker)

Lowering prio as the actual problem is that you have that many tiny files on 
your node. Question is how you ended up with that many files.

Did you run repairs prior to the number of files exploded?

Do you have graphs over how many files you have on the node? Is there a gradual 
increase over time or did it happen over night?

 Upgrading from 2.1.1 to 2.1.3 with LCS  and many sstable files makes nodes 
 unresponsive
 ---

 Key: CASSANDRA-9033
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9033
 Project: Cassandra
  Issue Type: Bug
 Environment: * Ubuntu 14.04.2 - Linux ip-10-0-2-122 3.13.0-46-generic 
 #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
 * EC2 m2-xlarge instances [4cpu, 16GB RAM, 1TB storage on 3 platters]
 * 12 nodes running a mix of 2.1.1 and 2.1.3
 * 8GB stack size with offheap objects
Reporter: Brent Haines
Assignee: Marcus Eriksson
 Attachments: cassandra-env.sh, cassandra.yaml, system.log.1.zip


 We have an Event Log table using LCS that has grown fast. There are more than 
 100K sstable files that are around 1KB. Increasing compactors and adjusting 
 compaction throttling upward doesn't make a difference. It has been running 
 great though until we upgraded to 2.1.3. Those nodes needed more RAM for the 
 stack (12 GB) to even have a prayer of responding to queries. They bog down 
 and become unresponsive. There are no GC messages that I can see, and no 
 compaction either. 
 The only work-around I have found is to decommission, blow away the big CF 
 and rejoin. That happens in about 20 minutes and everything is freaking happy 
 again. The size of the files is more like what I'd expect as well. 
 Our schema: 
 {code}
 cqlsh describe columnfamily data.stories
 CREATE TABLE data.stories (
 id timeuuid PRIMARY KEY,
 action_data timeuuid,
 action_name text,
 app_id timeuuid,
 app_instance_id timeuuid,
 data maptext, text,
 objects settimeuuid,
 time_stamp timestamp,
 user_id timeuuid
 ) WITH bloom_filter_fp_chance = 0.01
 AND caching = '{keys:ALL, rows_per_partition:NONE}'
 AND comment = 'Stories represent the timeline and are placed in the 
 dashboard for the brand manager to see'
 AND compaction = {'min_threshold': '4', 'class': 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
 'max_threshold': '32'}
 AND compression = {'sstable_compression': 
 'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 864000
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';
 cqlsh 
 {code}
 There were no log entries that stood out. It pretty much consisted of x is 
 down x is up repeated ad infinitum. I have attached the zipped system.log 
 that has the situation after the upgrade and then after I stopped, removed 
 system, system_traces, OpsCenter, and data/stories-/* and restarted. 
 It has rejoined the cluster now and is busy read-repairing to recover its 
 data.
 On another note, we see a lot of this during repair now (on all the nodes): 
 {code}
 ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,207 RepairSession.java:303 
 - [repair #c5043c40-d260-11e4-a2f2-8bb3e2bbdb35] session completed with the 
 following error
 java.io.IOException: Failed during snapshot creation.
 at 
 org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
  ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) 
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
 ~[guava-16.0.jar:na]
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  [na:1.7.0_55]
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  [na:1.7.0_55]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
 ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,208 
 CassandraDaemon.java:167 - Exception in thread 
 Thread[AntiEntropySessions:5,5,RMI Runtime]
 java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
 creation.
 at com.google.common.base.Throwables.propagate(Throwables.java:160) 
 ~[guava-16.0.jar:na]
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
 

[jira] [Updated] (CASSANDRA-9033) Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes unresponsive

2015-03-25 Thread Philip Thompson (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Thompson updated CASSANDRA-9033:
---
Assignee: Marcus Eriksson

 Upgrading from 2.1.1 to 2.1.3 with LCS  and many sstable files makes nodes 
 unresponsive
 ---

 Key: CASSANDRA-9033
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9033
 Project: Cassandra
  Issue Type: Bug
 Environment: * Ubuntu 14.04.2 - Linux ip-10-0-2-122 3.13.0-46-generic 
 #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
 * EC2 m2-xlarge instances [4cpu, 16GB RAM, 1TB storage on 3 platters]
 * 12 nodes running a mix of 2.1.1 and 2.1.3
 * 8GB stack size with offheap objects
Reporter: Brent Haines
Assignee: Marcus Eriksson
Priority: Blocker
 Attachments: cassandra-env.sh, cassandra.yaml, system.log.1.zip


 We have an Event Log table using LCS that has grown fast. There are more than 
 100K sstable files that are around 1KB. Increasing compactors and adjusting 
 compaction throttling upward doesn't make a difference. It has been running 
 great though until we upgraded to 2.1.3. Those nodes needed more RAM for the 
 stack (12 GB) to even have a prayer of responding to queries. They bog down 
 and become unresponsive. There are no GC messages that I can see, and no 
 compaction either. 
 The only work-around I have found is to decommission, blow away the big CF 
 and rejoin. That happens in about 20 minutes and everything is freaking happy 
 again. The size of the files is more like what I'd expect as well. 
 Our schema: 
 {code}
 cqlsh describe columnfamily data.stories
 CREATE TABLE data.stories (
 id timeuuid PRIMARY KEY,
 action_data timeuuid,
 action_name text,
 app_id timeuuid,
 app_instance_id timeuuid,
 data maptext, text,
 objects settimeuuid,
 time_stamp timestamp,
 user_id timeuuid
 ) WITH bloom_filter_fp_chance = 0.01
 AND caching = '{keys:ALL, rows_per_partition:NONE}'
 AND comment = 'Stories represent the timeline and are placed in the 
 dashboard for the brand manager to see'
 AND compaction = {'min_threshold': '4', 'class': 
 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
 'max_threshold': '32'}
 AND compression = {'sstable_compression': 
 'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 864000
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';
 cqlsh 
 {code}
 There were no log entries that stood out. It pretty much consisted of x is 
 down x is up repeated ad infinitum. I have attached the zipped system.log 
 that has the situation after the upgrade and then after I stopped, removed 
 system, system_traces, OpsCenter, and data/stories-/* and restarted. 
 It has rejoined the cluster now and is busy read-repairing to recover its 
 data.
 On another note, we see a lot of this during repair now (on all the nodes): 
 {code}
 ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,207 RepairSession.java:303 
 - [repair #c5043c40-d260-11e4-a2f2-8bb3e2bbdb35] session completed with the 
 following error
 java.io.IOException: Failed during snapshot creation.
 at 
 org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
  ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) 
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) 
 ~[guava-16.0.jar:na]
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  [na:1.7.0_55]
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  [na:1.7.0_55]
 at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
 ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,208 
 CassandraDaemon.java:167 - Exception in thread 
 Thread[AntiEntropySessions:5,5,RMI Runtime]
 java.lang.RuntimeException: java.io.IOException: Failed during snapshot 
 creation.
 at com.google.common.base.Throwables.propagate(Throwables.java:160) 
 ~[guava-16.0.jar:na]
 at 
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) 
 ~[apache-cassandra-2.1.3.jar:2.1.3]
 at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
 ~[na:1.7.0_55]
 at java.util.concurrent.FutureTask.run(FutureTask.java:262) 
 ~[na:1.7.0_55]
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  ~[na:1.7.0_55]