[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause

2015-10-15 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959057#comment-14959057
 ] 

Robbie Strickland edited comment on CASSANDRA-10449 at 10/15/15 3:24 PM:
-

I discovered that an index on one of the tables has a wide row, and I'm 
wondering if that could be the root of the issue:

Example:
{noformat}
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 10299432635
Compacted partition mean bytes: 253692309
{noformat}

This seems like a problem in general for indexes, where the original data model 
may be well distributed but the index may have unpredictable distribution.


was (Author: rstrickland):
I discovered that an index on one of the tables has a wide row, and I'm 
assuming that to be the root of the issue:

Example:
{noformat}
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 10299432635
Compacted partition mean bytes: 253692309
{noformat}

This seems like a problem in general for indexes, where the original data model 
may be well distributed but the index may have unpredictable distribution.

> OOM on bootstrap due to long GC pause
> -
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
> Attachments: system.log.10-05, thread_dump.log
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 
> 500-700GB per node.  SSTable counts are <10 per table.  I am attempting to 
> provision additional nodes, but bootstrapping OOMs every time after about 10 
> hours with a sudden long GC pause:
> {noformat}
> INFO  [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old 
> Generation GC in 1586126ms.  G1 Old Gen: 49213756976 -> 49072277176;
> ...
> ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[MemtableFlushWriter:454,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {noformat}
> I have tried increasing max heap to 48G just to get through the bootstrap, to 
> no avail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause

2015-10-15 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959057#comment-14959057
 ] 

Robbie Strickland edited comment on CASSANDRA-10449 at 10/15/15 3:25 PM:
-

I discovered that an index on one of the tables has a wide row, and I'm 
wondering if that could be the root of the issue:

Example from one node:
{noformat}
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 10299432635
Compacted partition mean bytes: 253692309
{noformat}

This seems like a problem in general for indexes, where the original data model 
may be well distributed but the index may have unpredictable distribution.


was (Author: rstrickland):
I discovered that an index on one of the tables has a wide row, and I'm 
wondering if that could be the root of the issue:

Example:
{noformat}
Compacted partition minimum bytes: 125
Compacted partition maximum bytes: 10299432635
Compacted partition mean bytes: 253692309
{noformat}

This seems like a problem in general for indexes, where the original data model 
may be well distributed but the index may have unpredictable distribution.

> OOM on bootstrap due to long GC pause
> -
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
> Attachments: system.log.10-05, thread_dump.log
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 
> 500-700GB per node.  SSTable counts are <10 per table.  I am attempting to 
> provision additional nodes, but bootstrapping OOMs every time after about 10 
> hours with a sudden long GC pause:
> {noformat}
> INFO  [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old 
> Generation GC in 1586126ms.  G1 Old Gen: 49213756976 -> 49072277176;
> ...
> ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[MemtableFlushWriter:454,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {noformat}
> I have tried increasing max heap to 48G just to get through the bootstrap, to 
> no avail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause

2015-10-08 Thread Robbie Strickland (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948566#comment-14948566
 ] 

Robbie Strickland edited comment on CASSANDRA-10449 at 10/8/15 2:00 PM:


Unfortunately increasing streaming_socket_timeout_in_ms and 
memtable_flush_writers resulted in OOMing again instead of hanging.  It seems 
to be hanging/OOMing when it gets to larger sstables (30GB+).  I will poke 
around some more today.


was (Author: rstrickland):
Unfortunately increasing streaming_socket_timeout_in_ms and 
memtable_flush_writers resulted in OOMing again instead of hanging.  It seems 
to be hanging when it gets to larger sstables (30GB+).  I will poke around some 
more today.

> OOM on bootstrap due to long GC pause
> -
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
> Attachments: system.log.10-05, thread_dump.log
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 
> 500-700GB per node.  SSTable counts are <10 per table.  I am attempting to 
> provision additional nodes, but bootstrapping OOMs every time after about 10 
> hours with a sudden long GC pause:
> {noformat}
> INFO  [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old 
> Generation GC in 1586126ms.  G1 Old Gen: 49213756976 -> 49072277176;
> ...
> ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[MemtableFlushWriter:454,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {noformat}
> I have tried increasing max heap to 48G just to get through the bootstrap, to 
> no avail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause

2015-10-07 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947793#comment-14947793
 ] 

Yuki Morishita edited comment on CASSANDRA-10449 at 10/7/15 11:25 PM:
--

There are couples of things going on.

{code}
ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - 
Exception in thread Thread[StreamReceiveTask:29,5,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.db.filter.TombstoneOverwhelmingException
{code}

When rebuilding secondary index after receiving files, bootstrapping node is 
experiencing TombstoneOverwhelmingException.
This can make streaming to hang, as it never completes the receiving task.
Streaming should tolerate secondary index build failure, instead of failing 
entire stream session, it should just warn user and go on, so that user can 
manually trigger secondary index rebuild later.

I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is 
glowing and that is the cause of OOM.
-If you can capture stack using jstack, that would be greate help.- Missed 
attachment, sorry.

-I create separate JIRA for the former.- Created CASSANDRA-10474.


was (Author: yukim):
There are couples of things going on.

{code}
ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - 
Exception in thread Thread[StreamReceiveTask:29,5,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.db.filter.TombstoneOverwhelmingException
{code}

When rebuilding secondary index after receiving files, bootstrapping node is 
experiencing TombstoneOverwhelmingException.
This can make streaming to hang, as it never completes the receiving task.
Streaming should tolerate secondary index build failure, instead of failing 
entire stream session, it should just warn user and go on, so that user can 
manually trigger secondary index rebuild later.

I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is 
glowing and that is the cause of OOM.
If you can capture stack using jstack, that would be greate help.

-I create separate JIRA for the former.- Created CASSANDRA-10474.

> OOM on bootstrap due to long GC pause
> -
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
> Attachments: system.log.10-05, thread_dump.log
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 
> 500-700GB per node.  SSTable counts are <10 per table.  I am attempting to 
> provision additional nodes, but bootstrapping OOMs every time after about 10 
> hours with a sudden long GC pause:
> {noformat}
> INFO  [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old 
> Generation GC in 1586126ms.  G1 Old Gen: 49213756976 -> 49072277176;
> ...
> ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[MemtableFlushWriter:454,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {noformat}
> I have tried increasing max heap to 48G just to get through the bootstrap, to 
> no avail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause

2015-10-07 Thread Yuki Morishita (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14947793#comment-14947793
 ] 

Yuki Morishita edited comment on CASSANDRA-10449 at 10/7/15 11:23 PM:
--

There are couples of things going on.

{code}
ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - 
Exception in thread Thread[StreamReceiveTask:29,5,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.db.filter.TombstoneOverwhelmingException
{code}

When rebuilding secondary index after receiving files, bootstrapping node is 
experiencing TombstoneOverwhelmingException.
This can make streaming to hang, as it never completes the receiving task.
Streaming should tolerate secondary index build failure, instead of failing 
entire stream session, it should just warn user and go on, so that user can 
manually trigger secondary index rebuild later.

I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is 
glowing and that is the cause of OOM.
If you can capture stack using jstack, that would be greate help.

-I create separate JIRA for the former.- Created CASSANDRA-10474.


was (Author: yukim):
There are couples of things going on.

{code}
ERROR [StreamReceiveTask:29] 2015-10-05 14:46:17,090 CassandraDaemon.java:223 - 
Exception in thread Thread[StreamReceiveTask:29,5,main]
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: 
org.apache.cassandra.db.filter.TombstoneOverwhelmingException
{code}

When rebuilding secondary index after receiving files, bootstrapping node is 
experiencing TombstoneOverwhelmingException.
This can make streaming to hang, as it never completes the receiving task.
Streaming should tolerate secondary index build failure, instead of failing 
entire stream session, it should just warn user and go on, so that user can 
manually trigger secondary index rebuild later.

I'm not sure the above relates to OOM. From StatusLogger, FlushWriter task is 
glowing and that is the cause of OOM.
If you can capture stack using jstack, that would be greate help.

I create separate JIRA for the former.

> OOM on bootstrap due to long GC pause
> -
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
> Attachments: system.log.10-05, thread_dump.log
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 
> 500-700GB per node.  SSTable counts are <10 per table.  I am attempting to 
> provision additional nodes, but bootstrapping OOMs every time after about 10 
> hours with a sudden long GC pause:
> {noformat}
> INFO  [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old 
> Generation GC in 1586126ms.  G1 Old Gen: 49213756976 -> 49072277176;
> ...
> ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[MemtableFlushWriter:454,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {noformat}
> I have tried increasing max heap to 48G just to get through the bootstrap, to 
> no avail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10449) OOM on bootstrap due to long GC pause

2015-10-05 Thread Philip Thompson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1497#comment-1497
 ] 

Philip Thompson edited comment on CASSANDRA-10449 at 10/6/15 3:18 AM:
--

Can you attach a system.log from a node that fails to bootstrap? 

[~JoshuaMcKenzie], who should this be assigned to?


was (Author: philipthompson):
Can you attach a system.log from a node that fails to bootstrap? 

> OOM on bootstrap due to long GC pause
> -
>
> Key: CASSANDRA-10449
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10449
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Ubuntu 14.04, AWS
>Reporter: Robbie Strickland
>  Labels: gc
> Fix For: 2.1.x
>
>
> I have a 20-node cluster (i2.4xlarge) with vnodes (default of 256) and 
> 500-700GB per node.  SSTable counts are <10 per table.  I am attempting to 
> provision additional nodes, but bootstrapping OOMs every time after about 10 
> hours with a sudden long GC pause:
> {noformat}
> INFO  [Service Thread] 2015-10-05 23:33:33,373 GCInspector.java:252 - G1 Old 
> Generation GC in 1586126ms.  G1 Old Gen: 49213756976 -> 49072277176;
> ...
> ERROR [MemtableFlushWriter:454] 2015-10-05 23:33:33,380 
> CassandraDaemon.java:223 - Exception in thread 
> Thread[MemtableFlushWriter:454,5,main]
> java.lang.OutOfMemoryError: Java heap space
> {noformat}
> I have tried increasing max heap to 48G just to get through the bootstrap, to 
> no avail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)