[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause
[ https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959057#comment-14959057 ] Robbie Strickland edited comment on CASSANDRA-10449 at 10/15/15 3:24 PM: - I discovered that an index on one of the tables has a wide row, and I'm wondering if that could be the root of the issue: Example: {noformat} Compacted partition minimum bytes: 125 Compacted partition maximum bytes: 10299432635 Compacted partition mean bytes: 253692309 {noformat} This seems like a problem in general for indexes, where the original data model may be well distributed but the index may have unpredictable distribution. was (Author: rstrickland): I discovered that an index on one of the tables has a wide row, and I'm assuming that to be the root of the issue: Example: {noformat} Compacted partition minimum bytes: 125 Compacted partition maximum bytes: 10299432635 Compacted partition mean bytes: 253692309 {noformat} This seems like a problem in general for indexes, where the original data model may be well distributed but the index may have unpredictable distribution. > OOM on bootstrap due to long GC pause > - > > Key: CASSANDRA-10449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10449 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.04, AWS >Reporter: Robbie Strickland > Labels: gc > Fix For: 2.1.x > > Attachments: system.log.10-05, thread_dump.log > > > I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and > 500-700GB per node. SSTable counts are <10 per table. I am attempting to > provision additional nodes, but bootstrapping OOMs every time after about 10 > hours with a sudden long GC pause: > {noformat} > INFO [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old > Generation GC in 1586126ms. G1 Old Gen: 49213756976 -> 49072277176; > ... > ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 > CassandraDaemon.java:223 - Exception in thread > Thread[MemtableFlushWriter:454,5,main] > java.lang.OutOfMemoryError: Java heap space > {noformat} > I have tried increasing max heap to 48G just to get through the bootstrap, to > no avail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause
[ https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959057#comment-14959057 ] Robbie Strickland edited comment on CASSANDRA-10449 at 10/15/15 3:25 PM: - I discovered that an index on one of the tables has a wide row, and I'm wondering if that could be the root of the issue: Example from one node: {noformat} Compacted partition minimum bytes: 125 Compacted partition maximum bytes: 10299432635 Compacted partition mean bytes: 253692309 {noformat} This seems like a problem in general for indexes, where the original data model may be well distributed but the index may have unpredictable distribution. was (Author: rstrickland): I discovered that an index on one of the tables has a wide row, and I'm wondering if that could be the root of the issue: Example: {noformat} Compacted partition minimum bytes: 125 Compacted partition maximum bytes: 10299432635 Compacted partition mean bytes: 253692309 {noformat} This seems like a problem in general for indexes, where the original data model may be well distributed but the index may have unpredictable distribution. > OOM on bootstrap due to long GC pause > - > > Key: CASSANDRA-10449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10449 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.04, AWS >Reporter: Robbie Strickland > Labels: gc > Fix For: 2.1.x > > Attachments: system.log.10-05, thread_dump.log > > > I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and > 500-700GB per node. SSTable counts are <10 per table. I am attempting to > provision additional nodes, but bootstrapping OOMs every time after about 10 > hours with a sudden long GC pause: > {noformat} > INFO [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old > Generation GC in 1586126ms. G1 Old Gen: 49213756976 -> 49072277176; > ... > ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 > CassandraDaemon.java:223 - Exception in thread > Thread[MemtableFlushWriter:454,5,main] > java.lang.OutOfMemoryError: Java heap space > {noformat} > I have tried increasing max heap to 48G just to get through the bootstrap, to > no avail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause
[ https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948566#comment-14948566 ] Robbie Strickland edited comment on CASSANDRA-10449 at 10/8/15 2:00 PM: Unfortunately increasing streaming_socket_timeout_in_ms and memtable_flush_writers resulted in OOMing again instead of hanging. It seems to be hanging/OOMing when it gets to larger sstables (30GB+). I will poke around some more today. was (Author: rstrickland): Unfortunately increasing streaming_socket_timeout_in_ms and memtable_flush_writers resulted in OOMing again instead of hanging. It seems to be hanging when it gets to larger sstables (30GB+). I will poke around some more today. > OOM on bootstrap due to long GC pause > - > > Key: CASSANDRA-10449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10449 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.04, AWS >Reporter: Robbie Strickland > Labels: gc > Fix For: 2.1.x > > Attachments: system.log.10-05, thread_dump.log > > > I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and > 500-700GB per node. SSTable counts are <10 per table. I am attempting to > provision additional nodes, but bootstrapping OOMs every time after about 10 > hours with a sudden long GC pause: > {noformat} > INFO [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old > Generation GC in 1586126ms. G1 Old Gen: 49213756976 -> 49072277176; > ... > ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 > CassandraDaemon.java:223 - Exception in thread > Thread[MemtableFlushWriter:454,5,main] > java.lang.OutOfMemoryError: Java heap space > {noformat} > I have tried increasing max heap to 48G just to get through the bootstrap, to > no avail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause
[ https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947793#comment-14947793 ] Yuki Morishita edited comment on CASSANDRA-10449 at 10/7/15 11:25 PM: -- There are couples of things going on. {code} ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - Exception in thread Thread[StreamReceiveTask:29,5,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException {code} When rebuilding secondary index after receiving files, bootstrapping node is experiencing TombstoneOverwhelmingException. This can make streaming to hang, as it never completes the receiving task. Streaming should tolerate secondary index build failure, instead of failing entire stream session, it should just warn user and go on, so that user can manually trigger secondary index rebuild later. I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is glowing and that is the cause of OOM. -If you can capture stack using jstack, that would be greate help.- Missed attachment, sorry. -I create separate JIRA for the former.- Created CASSANDRA-10474. was (Author: yukim): There are couples of things going on. {code} ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - Exception in thread Thread[StreamReceiveTask:29,5,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException {code} When rebuilding secondary index after receiving files, bootstrapping node is experiencing TombstoneOverwhelmingException. This can make streaming to hang, as it never completes the receiving task. Streaming should tolerate secondary index build failure, instead of failing entire stream session, it should just warn user and go on, so that user can manually trigger secondary index rebuild later. I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is glowing and that is the cause of OOM. If you can capture stack using jstack, that would be greate help. -I create separate JIRA for the former.- Created CASSANDRA-10474. > OOM on bootstrap due to long GC pause > - > > Key: CASSANDRA-10449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10449 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.04, AWS >Reporter: Robbie Strickland > Labels: gc > Fix For: 2.1.x > > Attachments: system.log.10-05, thread_dump.log > > > I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and > 500-700GB per node. SSTable counts are <10 per table. I am attempting to > provision additional nodes, but bootstrapping OOMs every time after about 10 > hours with a sudden long GC pause: > {noformat} > INFO [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old > Generation GC in 1586126ms. G1 Old Gen: 49213756976 -> 49072277176; > ... > ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 > CassandraDaemon.java:223 - Exception in thread > Thread[MemtableFlushWriter:454,5,main] > java.lang.OutOfMemoryError: Java heap space > {noformat} > I have tried increasing max heap to 48G just to get through the bootstrap, to > no avail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause
[ https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947793#comment-14947793 ] Yuki Morishita edited comment on CASSANDRA-10449 at 10/7/15 11:23 PM: -- There are couples of things going on. {code} ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - Exception in thread Thread[StreamReceiveTask:29,5,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException {code} When rebuilding secondary index after receiving files, bootstrapping node is experiencing TombstoneOverwhelmingException. This can make streaming to hang, as it never completes the receiving task. Streaming should tolerate secondary index build failure, instead of failing entire stream session, it should just warn user and go on, so that user can manually trigger secondary index rebuild later. I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is glowing and that is the cause of OOM. If you can capture stack using jstack, that would be greate help. -I create separate JIRA for the former.- Created CASSANDRA-10474. was (Author: yukim): There are couples of things going on. {code} ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - Exception in thread Thread[StreamReceiveTask:29,5,main] java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: org.apache.cassandra.db.filter.TombstoneOverwhelmingException {code} When rebuilding secondary index after receiving files, bootstrapping node is experiencing TombstoneOverwhelmingException. This can make streaming to hang, as it never completes the receiving task. Streaming should tolerate secondary index build failure, instead of failing entire stream session, it should just warn user and go on, so that user can manually trigger secondary index rebuild later. I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is glowing and that is the cause of OOM. If you can capture stack using jstack, that would be greate help. I create separate JIRA for the former. > OOM on bootstrap due to long GC pause > - > > Key: CASSANDRA-10449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10449 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.04, AWS >Reporter: Robbie Strickland > Labels: gc > Fix For: 2.1.x > > Attachments: system.log.10-05, thread_dump.log > > > I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and > 500-700GB per node. SSTable counts are <10 per table. I am attempting to > provision additional nodes, but bootstrapping OOMs every time after about 10 > hours with a sudden long GC pause: > {noformat} > INFO [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old > Generation GC in 1586126ms. G1 Old Gen: 49213756976 -> 49072277176; > ... > ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 > CassandraDaemon.java:223 - Exception in thread > Thread[MemtableFlushWriter:454,5,main] > java.lang.OutOfMemoryError: Java heap space > {noformat} > I have tried increasing max heap to 48G just to get through the bootstrap, to > no avail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause
[ https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1497#comment-1497 ] Philip Thompson edited comment on CASSANDRA-10449 at 10/6/15 3:18 AM: -- Can you attach a system.log from a node that fails to bootstrap? [~JoshuaMcKenzie], who should this be assigned to? was (Author: philipthompson): Can you attach a system.log from a node that fails to bootstrap? > OOM on bootstrap due to long GC pause > - > > Key: CASSANDRA-10449 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10449 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: Ubuntu 14.04, AWS >Reporter: Robbie Strickland > Labels: gc > Fix For: 2.1.x > > > I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and > 500-700GB per node. SSTable counts are <10 per table. I am attempting to > provision additional nodes, but bootstrapping OOMs every time after about 10 > hours with a sudden long GC pause: > {noformat} > INFO [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old > Generation GC in 1586126ms. G1 Old Gen: 49213756976 -> 49072277176; > ... > ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 > CassandraDaemon.java:223 - Exception in thread > Thread[MemtableFlushWriter:454,5,main] > java.lang.OutOfMemoryError: Java heap space > {noformat} > I have tried increasing max heap to 48G just to get through the bootstrap, to > no avail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)