[jira] [Commented] (CASSANDRA-11117) ColUpdateTimeDeltaHistogram histogram overflow

2016-07-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15380172#comment-15380172
 ] 

Jeff Griffith commented on CASSANDRA-7:
---

Yes, same here. Upgraded from 2.1.

> ColUpdateTimeDeltaHistogram histogram overflow
> --
>
> Key: CASSANDRA-7
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Chris Lohfink
>Assignee: Joel Knighton
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
>
> {code}
> getting attribute Mean of 
> org.apache.cassandra.metrics:type=ColumnFamily,name=ColUpdateTimeDeltaHistogram
>  threw an exceptionjavax.management.RuntimeMBeanException: 
> java.lang.IllegalStateException: Unable to compute ceiling for max when 
> histogram overflowed
> {code}
> Although the fact that this histogram has 164 buckets already, I wonder if 
> there is something weird with the computation thats causing this to be so 
> large? It appears to be coming from updates to system.local
> {code}
> org.apache.cassandra.metrics:type=Table,keyspace=system,scope=local,name=ColUpdateTimeDeltaHistogram
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11117) ColUpdateTimeDeltaHistogram histogram overflow

2016-05-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280185#comment-15280185
 ] 

Jeff Griffith commented on CASSANDRA-7:
---

the code that updates this is here in ColmnFamilyStore.java:

{code}
public void apply(DecoratedKey key, ColumnFamily columnFamily, 
SecondaryIndexManager.Updater indexer, OpOrder.Group opGroup, ReplayPosition 
replayPosition)
{
long start = System.nanoTime();
Memtable mt = data.getMemtableFor(opGroup, replayPosition);
final long timeDelta = mt.put(key, columnFamily, indexer, opGroup);
maybeUpdateRowCache(key);
metric.samplers.get(Sampler.WRITES).addSample(key.getKey(), 
key.hashCode(), 1);
metric.writeLatency.addNano(System.nanoTime() - start);
if(timeDelta < Long.MAX_VALUE)
metric.colUpdateTimeDeltaHistogram.update(timeDelta);
}
{code}

That "if (timeDelta < Long.MAX_VALUE)" looks ill-conceived since there are no 
longs > max long, but i don't really know what exactly is overflowing in the 
histogram.


> ColUpdateTimeDeltaHistogram histogram overflow
> --
>
> Key: CASSANDRA-7
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Chris Lohfink
>Assignee: Joel Knighton
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
>
> {code}
> getting attribute Mean of 
> org.apache.cassandra.metrics:type=ColumnFamily,name=ColUpdateTimeDeltaHistogram
>  threw an exceptionjavax.management.RuntimeMBeanException: 
> java.lang.IllegalStateException: Unable to compute ceiling for max when 
> histogram overflowed
> {code}
> Although the fact that this histogram has 164 buckets already, I wonder if 
> there is something weird with the computation thats causing this to be so 
> large? It appears to be coming from updates to system.local
> {code}
> org.apache.cassandra.metrics:type=Table,keyspace=system,scope=local,name=ColUpdateTimeDeltaHistogram
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (CASSANDRA-11117) ColUpdateTimeDeltaHistogram histogram overflow

2016-05-11 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-7:
--
Comment: was deleted

(was: the code that updates this is here in ColmnFamilyStore.java:
{code}
public void apply(DecoratedKey key, ColumnFamily columnFamily, 
SecondaryIndexManager.Updater indexer, OpOrder.Group opGroup, ReplayPosition 
replayPosition)
{
long start = System.nanoTime();
Memtable mt = data.getMemtableFor(opGroup, replayPosition);
final long timeDelta = mt.put(key, columnFamily, indexer, opGroup);
maybeUpdateRowCache(key);
metric.samplers.get(Sampler.WRITES).addSample(key.getKey(), 
key.hashCode(), 1);
metric.writeLatency.addNano(System.nanoTime() - start);
if(timeDelta < Long.MAX_VALUE)
metric.colUpdateTimeDeltaHistogram.update(timeDelta);
}
{code}

That "if (timeDelta < Long.MAX_VALUE)" looks ill-conceived since there are no 
longs > max long, but i don't really know what exactly is overflowing in the 
histogram.

)

> ColUpdateTimeDeltaHistogram histogram overflow
> --
>
> Key: CASSANDRA-7
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Chris Lohfink
>Assignee: Joel Knighton
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
>
> {code}
> getting attribute Mean of 
> org.apache.cassandra.metrics:type=ColumnFamily,name=ColUpdateTimeDeltaHistogram
>  threw an exceptionjavax.management.RuntimeMBeanException: 
> java.lang.IllegalStateException: Unable to compute ceiling for max when 
> histogram overflowed
> {code}
> Although the fact that this histogram has 164 buckets already, I wonder if 
> there is something weird with the computation thats causing this to be so 
> large? It appears to be coming from updates to system.local
> {code}
> org.apache.cassandra.metrics:type=Table,keyspace=system,scope=local,name=ColUpdateTimeDeltaHistogram
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-11117) ColUpdateTimeDeltaHistogram histogram overflow

2016-05-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280152#comment-15280152
 ] 

Jeff Griffith edited comment on CASSANDRA-7 at 5/11/16 1:59 PM:


the code that updates this is here in ColmnFamilyStore.java:
{code}
public void apply(DecoratedKey key, ColumnFamily columnFamily, 
SecondaryIndexManager.Updater indexer, OpOrder.Group opGroup, ReplayPosition 
replayPosition)
{
long start = System.nanoTime();
Memtable mt = data.getMemtableFor(opGroup, replayPosition);
final long timeDelta = mt.put(key, columnFamily, indexer, opGroup);
maybeUpdateRowCache(key);
metric.samplers.get(Sampler.WRITES).addSample(key.getKey(), 
key.hashCode(), 1);
metric.writeLatency.addNano(System.nanoTime() - start);
if(timeDelta < Long.MAX_VALUE)
metric.colUpdateTimeDeltaHistogram.update(timeDelta);
}
{code}

That "if (timeDelta < Long.MAX_VALUE)" looks ill-conceived since there are no 
longs > max long, but i don't really know what exactly is overflowing in the 
histogram.




was (Author: jeffery.griffith):
the code that updates this is here:
{code}
public void apply(DecoratedKey key, ColumnFamily columnFamily, 
SecondaryIndexManager.Updater indexer, OpOrder.Group opGroup, ReplayPosition 
replayPosition)
{
long start = System.nanoTime();
Memtable mt = data.getMemtableFor(opGroup, replayPosition);
final long timeDelta = mt.put(key, columnFamily, indexer, opGroup);
maybeUpdateRowCache(key);
metric.samplers.get(Sampler.WRITES).addSample(key.getKey(), 
key.hashCode(), 1);
metric.writeLatency.addNano(System.nanoTime() - start);
if(timeDelta < Long.MAX_VALUE)
metric.colUpdateTimeDeltaHistogram.update(timeDelta);
}
{code}

That "if (timeDelta < Long.MAX_VALUE)" looks ill-conceived since there are no 
longs > max long, but i don't really know what exactly is overflowing in the 
histogram.



> ColUpdateTimeDeltaHistogram histogram overflow
> --
>
> Key: CASSANDRA-7
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Chris Lohfink
>Assignee: Joel Knighton
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
>
> {code}
> getting attribute Mean of 
> org.apache.cassandra.metrics:type=ColumnFamily,name=ColUpdateTimeDeltaHistogram
>  threw an exceptionjavax.management.RuntimeMBeanException: 
> java.lang.IllegalStateException: Unable to compute ceiling for max when 
> histogram overflowed
> {code}
> Although the fact that this histogram has 164 buckets already, I wonder if 
> there is something weird with the computation thats causing this to be so 
> large? It appears to be coming from updates to system.local
> {code}
> org.apache.cassandra.metrics:type=Table,keyspace=system,scope=local,name=ColUpdateTimeDeltaHistogram
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11117) ColUpdateTimeDeltaHistogram histogram overflow

2016-05-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280152#comment-15280152
 ] 

Jeff Griffith commented on CASSANDRA-7:
---

the code that updates this is here:
{code}
public void apply(DecoratedKey key, ColumnFamily columnFamily, 
SecondaryIndexManager.Updater indexer, OpOrder.Group opGroup, ReplayPosition 
replayPosition)
{
long start = System.nanoTime();
Memtable mt = data.getMemtableFor(opGroup, replayPosition);
final long timeDelta = mt.put(key, columnFamily, indexer, opGroup);
maybeUpdateRowCache(key);
metric.samplers.get(Sampler.WRITES).addSample(key.getKey(), 
key.hashCode(), 1);
metric.writeLatency.addNano(System.nanoTime() - start);
if(timeDelta < Long.MAX_VALUE)
metric.colUpdateTimeDeltaHistogram.update(timeDelta);
}
{code}

That "if (timeDelta < Long.MAX_VALUE)" looks ill-conceived since there are no 
longs > max long, but i don't really know what exactly is overflowing in the 
histogram.



> ColUpdateTimeDeltaHistogram histogram overflow
> --
>
> Key: CASSANDRA-7
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Chris Lohfink
>Assignee: Joel Knighton
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
>
> {code}
> getting attribute Mean of 
> org.apache.cassandra.metrics:type=ColumnFamily,name=ColUpdateTimeDeltaHistogram
>  threw an exceptionjavax.management.RuntimeMBeanException: 
> java.lang.IllegalStateException: Unable to compute ceiling for max when 
> histogram overflowed
> {code}
> Although the fact that this histogram has 164 buckets already, I wonder if 
> there is something weird with the computation thats causing this to be so 
> large? It appears to be coming from updates to system.local
> {code}
> org.apache.cassandra.metrics:type=Table,keyspace=system,scope=local,name=ColUpdateTimeDeltaHistogram
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11751) Histogram overflow in metrics

2016-05-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280143#comment-15280143
 ] 

Jeff Griffith commented on CASSANDRA-11751:
---

Thanks [~tjake].  Sorry for the duplicate.

> Histogram overflow in metrics
> -
>
> Key: CASSANDRA-11751
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11751
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Cassandra 2.2.6 on Linux
>Reporter: Jeff Griffith
>
> One particular histogram in the cassandra metrics seems to overflow 
> preventing the calculation of the mean on the dropwizard "Snapshot". Here is 
> the exception that comes from the metrics library:
> {code}
> java.lang.IllegalStateException: Unable to compute ceiling for max when 
> histogram overflowed
> at 
> org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:232)
>  ~[apache-cassandra-2.2.6.jar:2.2.6-SNAPSHOT]
> at 
> org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103)
>  ~[apache-cassandra-2.2.6.jar:2.2.6-SNAPSHOT]
> at 
> com.addthis.metrics3.reporter.config.SplunkReporter.reportHistogram(SplunkReporter.java:155)
>  ~[reporter-config3-3.0.0.jar:3.0.0]
> at 
> com.addthis.metrics3.reporter.config.SplunkReporter.report(SplunkReporter.java:101)
>  ~[reporter-config3-3.0.0.jar:3.0.0]
> at 
> com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) 
> ~[metrics-core-3.1.0.jar:3.1.0]
> at 
> com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) 
> ~[metrics-core-3.1.0.jar:3.1.0]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> [na:1.8.0_72]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
> [na:1.8.0_72]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  [na:1.8.0_72]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  [na:1.8.0_72]
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  [na:1.8.0_72]
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  [na:1.8.0_72]
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
> {code}
> On deeper analysis, it seems like this is happening specifically on this 
> metric:
> {code}
> ColUpdateTimeDeltaHistogram
> {code}
> I think this is where it is updated in ColumnFamilyStore.java
> {code}
> public void apply(DecoratedKey key, ColumnFamily columnFamily, 
> SecondaryIndexManager.Updater indexer, OpOrder.Group opGroup, ReplayPosition 
> replayPosition)
> {
> long start = System.nanoTime();
> Memtable mt = data.getMemtableFor(opGroup, replayPosition);
> final long timeDelta = mt.put(key, columnFamily, indexer, opGroup);
> maybeUpdateRowCache(key);
> metric.samplers.get(Sampler.WRITES).addSample(key.getKey(), 
> key.hashCode(), 1);
> metric.writeLatency.addNano(System.nanoTime() - start);
> if(timeDelta < Long.MAX_VALUE)
> metric.colUpdateTimeDeltaHistogram.update(timeDelta);
> }
> {code}
> Considering it's calculating a mean, i don't know if perhaps a large sum 
> might be overflowing? But that "if (timeDelta < Long.MAX_VALUE)" looks 
> suspect, doesn't it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-11751) Histogram overflow in metrics

2016-05-11 Thread Jeff Griffith (JIRA)
Jeff Griffith created CASSANDRA-11751:
-

 Summary: Histogram overflow in metrics
 Key: CASSANDRA-11751
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11751
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.2.6 on Linux
Reporter: Jeff Griffith


One particular histogram in the cassandra metrics seems to overflow preventing 
the calculation of the mean on the dropwizard "Snapshot". Here is the exception 
that comes from the metrics library:

{code}
java.lang.IllegalStateException: Unable to compute ceiling for max when 
histogram overflowed
at 
org.apache.cassandra.utils.EstimatedHistogram.rawMean(EstimatedHistogram.java:232)
 ~[apache-cassandra-2.2.6.jar:2.2.6-SNAPSHOT]
at 
org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getMean(EstimatedHistogramReservoir.java:103)
 ~[apache-cassandra-2.2.6.jar:2.2.6-SNAPSHOT]
at 
com.addthis.metrics3.reporter.config.SplunkReporter.reportHistogram(SplunkReporter.java:155)
 ~[reporter-config3-3.0.0.jar:3.0.0]
at 
com.addthis.metrics3.reporter.config.SplunkReporter.report(SplunkReporter.java:101)
 ~[reporter-config3-3.0.0.jar:3.0.0]
at 
com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) 
~[metrics-core-3.1.0.jar:3.1.0]
at 
com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) 
~[metrics-core-3.1.0.jar:3.1.0]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_72]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_72]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_72]
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_72]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_72]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72]
{code}

On deeper analysis, it seems like this is happening specifically on this metric:
{code}
ColUpdateTimeDeltaHistogram
{code}

I think this is where it is updated in ColumnFamilyStore.java
{code}
public void apply(DecoratedKey key, ColumnFamily columnFamily, 
SecondaryIndexManager.Updater indexer, OpOrder.Group opGroup, ReplayPosition 
replayPosition)
{
long start = System.nanoTime();
Memtable mt = data.getMemtableFor(opGroup, replayPosition);
final long timeDelta = mt.put(key, columnFamily, indexer, opGroup);
maybeUpdateRowCache(key);
metric.samplers.get(Sampler.WRITES).addSample(key.getKey(), 
key.hashCode(), 1);
metric.writeLatency.addNano(System.nanoTime() - start);
if(timeDelta < Long.MAX_VALUE)
metric.colUpdateTimeDeltaHistogram.update(timeDelta);
}
{code}

Considering it's calculating a mean, i don't know if perhaps a large sum might 
be overflowing? But that "if (timeDelta < Long.MAX_VALUE)" looks suspect, 
doesn't it?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11504) Slow inter-node network growth & gc issues with uptime

2016-04-05 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-11504:
--
Environment: Cassandra 2.1.13 & 2.2.5  (was: Cassandra 2.1.13)

> Slow inter-node network growth & gc issues with uptime
> --
>
> Key: CASSANDRA-11504
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11504
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 2.1.13 & 2.2.5
>Reporter: Jeff Griffith
> Attachments: InterNodeTraffic.jpg
>
>
> We are looking for help troubleshooting our production environment where we 
> are experiencing GC problems. After much experimentation and troubleshooting 
> with various settings, the only correlation that we can find with a slow 
> growth in GC is a slow growth in network traffic BETWEEN cassandra nodes in 
> our cluster. As an example, I have attached an example where in a cluster of 
> 24 nodes, i restarted 23 of them. Note that the outgoing rate for that 24th 
> node remains high while all others drop after the restart. Also note that 
> this graph is ONLY traffic between cassandra nodes. Traffic from the clients 
> remains FLAT throughout. Analyzing column family stats shows they are flat 
> throughout. Cache hit rates are also consistent across nodes. GC is of course 
> its own can of worms so we are hoping this considerable increase in traffic 
> (more than double over the course of 6rs) between nodes explains it. We would 
> greatly appreciate any ideas as to why this extra network output correlates 
> to uptime or ideas on what to "diff" between the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11504) Slow inter-node network growth & gc issues with uptime

2016-04-05 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-11504:
--
Description: We are looking for help troubleshooting our production 
environment where we are experiencing GC problems. After much experimentation 
and troubleshooting with various settings, the only correlation that we can 
find with a slow growth in GC is a slow growth in network traffic BETWEEN 
cassandra nodes in our cluster. As an example, I have attached an example where 
in a cluster of 24 nodes, i restarted 23 of them. Note that the outgoing rate 
for that 24th node remains high while all others drop after the restart. Also 
note that this graph is ONLY traffic between cassandra nodes. Traffic from the 
clients remains FLAT throughout. Analyzing column family stats shows they are 
flat throughout. Cache hit rates are also consistent across nodes. GC is of 
course its own can of worms so we are hoping this considerable increase in 
traffic (more than double over the course of 6rs) between nodes explains it. We 
would greatly appreciate any ideas as to why this extra network output 
correlates to uptime or ideas on what to "diff" between the nodes.  (was: We 
are looking for help troubleshooting our production environment where we are 
experiencing GC problems. After much experimentation and troubleshooting with 
various settings, the only correlation that we can find with a slow growth in 
GC a slow growth in network traffic BETWEEN cassandra nodes in our cluster. As 
an example, I have attached an example where in a cluster of 24 nodes, i 
restarted 23 of them. Note that the outgoing rate for that 24th node remains 
high while all others drop after the restart. Also note that this graph is ONLY 
traffic between cassandra nodes. Traffic from the clients remains FLAT 
throughout. Analyzing column family stats shows they are flat throughout. Cache 
hit rates are also consistent across nodes. GC is of course its own can of 
worms so we are hoping this considerable increase in traffic (more than double 
over the course of 6rs) between nodes explains it. We would greatly appreciate 
any ideas as to why this extra network output correlates to uptime or ideas on 
what to "diff" between the nodes.)

> Slow inter-node network growth & gc issues with uptime
> --
>
> Key: CASSANDRA-11504
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11504
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 2.1.13
>Reporter: Jeff Griffith
> Attachments: InterNodeTraffic.jpg
>
>
> We are looking for help troubleshooting our production environment where we 
> are experiencing GC problems. After much experimentation and troubleshooting 
> with various settings, the only correlation that we can find with a slow 
> growth in GC is a slow growth in network traffic BETWEEN cassandra nodes in 
> our cluster. As an example, I have attached an example where in a cluster of 
> 24 nodes, i restarted 23 of them. Note that the outgoing rate for that 24th 
> node remains high while all others drop after the restart. Also note that 
> this graph is ONLY traffic between cassandra nodes. Traffic from the clients 
> remains FLAT throughout. Analyzing column family stats shows they are flat 
> throughout. Cache hit rates are also consistent across nodes. GC is of course 
> its own can of worms so we are hoping this considerable increase in traffic 
> (more than double over the course of 6rs) between nodes explains it. We would 
> greatly appreciate any ideas as to why this extra network output correlates 
> to uptime or ideas on what to "diff" between the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-11504) Slow inter-node network growth & gc issues with uptime

2016-04-05 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-11504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-11504:
--
Description: We are looking for help troubleshooting our production 
environment where we are experiencing GC problems. After much experimentation 
and troubleshooting with various settings, the only correlation that we can 
find with a slow growth in GC a slow growth in network traffic BETWEEN 
cassandra nodes in our cluster. As an example, I have attached an example where 
in a cluster of 24 nodes, i restarted 23 of them. Note that the outgoing rate 
for that 24th node remains high while all others drop after the restart. Also 
note that this graph is ONLY traffic between cassandra nodes. Traffic from the 
clients remains FLAT throughout. Analyzing column family stats shows they are 
flat throughout. Cache hit rates are also consistent across nodes. GC is of 
course its own can of worms so we are hoping this considerable increase in 
traffic (more than double over the course of 6rs) between nodes explains it. We 
would greatly appreciate any ideas as to why this extra network output 
correlates to uptime or ideas on what to "diff" between the nodes.  (was: We 
are looking for help troubleshooting our production environment where we are 
experiencing GC problems. After much experimentation and troubleshooting with 
various settings, the only correlation that we can find with a slow growth in 
GC a slow growth in network traffic BETWEEN cassandra nodes in our cluster. As 
an example, I have attached an example where in a cluster of 24 nodes, i 
restarted 23 of them. Note that the outgoing rate for that 24th node remains 
high while all others drop after the restart. Also note that this graph is ONLY 
traffic between cassandra nodes. Traffic from the clients remains FLAT 
throughout. Analyzing column family stats shows they are flat throughout. Cache 
hit rates are also consistent across nodes. GC is of course its own can of 
worms so we are hoping this considerable increase in traffic (more than double 
over the course of 6rs) between nodes explains it. We would greatly appreciate 
any ideas as to why this extra network output correlates to uptime.)

> Slow inter-node network growth & gc issues with uptime
> --
>
> Key: CASSANDRA-11504
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11504
> Project: Cassandra
>  Issue Type: Bug
> Environment: Cassandra 2.1.13
>Reporter: Jeff Griffith
> Attachments: InterNodeTraffic.jpg
>
>
> We are looking for help troubleshooting our production environment where we 
> are experiencing GC problems. After much experimentation and troubleshooting 
> with various settings, the only correlation that we can find with a slow 
> growth in GC a slow growth in network traffic BETWEEN cassandra nodes in our 
> cluster. As an example, I have attached an example where in a cluster of 24 
> nodes, i restarted 23 of them. Note that the outgoing rate for that 24th node 
> remains high while all others drop after the restart. Also note that this 
> graph is ONLY traffic between cassandra nodes. Traffic from the clients 
> remains FLAT throughout. Analyzing column family stats shows they are flat 
> throughout. Cache hit rates are also consistent across nodes. GC is of course 
> its own can of worms so we are hoping this considerable increase in traffic 
> (more than double over the course of 6rs) between nodes explains it. We would 
> greatly appreciate any ideas as to why this extra network output correlates 
> to uptime or ideas on what to "diff" between the nodes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-11504) Slow inter-node network growth & gc issues with uptime

2016-04-05 Thread Jeff Griffith (JIRA)
Jeff Griffith created CASSANDRA-11504:
-

 Summary: Slow inter-node network growth & gc issues with uptime
 Key: CASSANDRA-11504
 URL: https://issues.apache.org/jira/browse/CASSANDRA-11504
 Project: Cassandra
  Issue Type: Bug
 Environment: Cassandra 2.1.13
Reporter: Jeff Griffith
 Attachments: InterNodeTraffic.jpg

We are looking for help troubleshooting our production environment where we are 
experiencing GC problems. After much experimentation and troubleshooting with 
various settings, the only correlation that we can find with a slow growth in 
GC a slow growth in network traffic BETWEEN cassandra nodes in our cluster. As 
an example, I have attached an example where in a cluster of 24 nodes, i 
restarted 23 of them. Note that the outgoing rate for that 24th node remains 
high while all others drop after the restart. Also note that this graph is ONLY 
traffic between cassandra nodes. Traffic from the clients remains FLAT 
throughout. Analyzing column family stats shows they are flat throughout. Cache 
hit rates are also consistent across nodes. GC is of course its own can of 
worms so we are hoping this considerable increase in traffic (more than double 
over the course of 6rs) between nodes explains it. We would greatly appreciate 
any ideas as to why this extra network output correlates to uptime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-12-02 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035856#comment-15035856
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

That's correct. Two problems led independently to the build up:

Cause 1 fixed by Marcus: sstable leveling info was lost during sstable upgrade 
leading to thread contention due to large # of tables at L0.

Cause 2 fixed by Benedict: index out of bounds exception caused by integer 
overflow.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>  Labels: commitlog, triage
> Fix For: 3.0.1, 3.1, 2.1.x, 2.2.x
>
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10692) Don't remove level info when doing upgradesstables

2015-11-20 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018277#comment-15018277
 ] 

Jeff Griffith commented on CASSANDRA-10692:
---

[~krummas]

can you confirm when the last commit was for this fix on the cassandra-2.1 
branch? in the comments it looks you have pushed something else 2 days ago (nov 
18?) but all i is this back on Nov 12:

commit 246cb883ab09bc69e842b8124c1537b38bb54335
Author: Marcus Eriksson 
Date:   Thu Nov 12 08:12:01 2015 +0100

Thanks.

> Don't remove level info when doing upgradesstables
> --
>
> Key: CASSANDRA-10692
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10692
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
> Fix For: 2.1.12, 2.2.4
>
>
> Seems we blow away the level info when doing upgradesstables. Introduced in  
> CASSANDRA-8004



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10692) Don't remove level info when doing upgradesstables

2015-11-20 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15018277#comment-15018277
 ] 

Jeff Griffith edited comment on CASSANDRA-10692 at 11/20/15 4:41 PM:
-

[~krummas]

can you confirm when the last commit was for this fix on the cassandra-2.1 
branch? in the comments it looks you have pushed something else 2 days ago (nov 
18?) but all i is this back on Nov 12:

commit 246cb883ab09bc69e842b8124c1537b38bb54335
Author: Marcus Eriksson 
Date:   Thu Nov 12 08:12:01 2015 +0100

(asking because i produced my own build after the 12th but before the 18th)

Thanks.



was (Author: jeffery.griffith):
[~krummas]

can you confirm when the last commit was for this fix on the cassandra-2.1 
branch? in the comments it looks you have pushed something else 2 days ago (nov 
18?) but all i is this back on Nov 12:

commit 246cb883ab09bc69e842b8124c1537b38bb54335
Author: Marcus Eriksson 
Date:   Thu Nov 12 08:12:01 2015 +0100

Thanks.

> Don't remove level info when doing upgradesstables
> --
>
> Key: CASSANDRA-10692
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10692
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
> Fix For: 2.1.12, 2.2.4
>
>
> Seems we blow away the level info when doing upgradesstables. Introduced in  
> CASSANDRA-8004



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-11-12 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002048#comment-15002048
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Thanks [~krummas] i assume you mean this explains the large number of sstables 
(55k) we experienced? I see you've fixed it. I have moved to the latest 2.1 so 
this should help with our rollout.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>  Labels: commitlog, triage
> Fix For: 3.1, 2.1.x, 2.2.x
>
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-11-12 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002085#comment-15002085
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Good to know. We'll watch out for it and use the offline leveling trick you 
suggested.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Streaming and Messaging
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>  Labels: commitlog, triage
> Fix For: 3.1, 2.1.x, 2.2.x
>
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-11-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000585#comment-15000585
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

(i'll try to move things to 2.1.11 to simplify this)

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.12, 2.2.4
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> 

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-11-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000519#comment-15000519
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

hi again [~benedict]
sorry to bug you with this again, but could you pls confirm what is on your 
10579-fix branch? i'm trying to merge a few patches and it looks like there are 
several other things mixed in now. at one point, it was strictly based on 
2.1.10.
thx,
--jg

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.12, 2.2.4
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> 

[jira] [Comment Edited] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-11-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000585#comment-15000585
 ] 

Jeff Griffith edited comment on CASSANDRA-10579 at 11/11/15 4:29 PM:
-

(i'll try to move things to 2.1.11 to simplify this. looks like it's based on 
the 2.1 branch though.)


was (Author: jeffery.griffith):
(i'll try to move things to 2.1.11 to simplify this)

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.12, 2.2.4
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> 

[jira] [Comment Edited] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-11-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000585#comment-15000585
 ] 

Jeff Griffith edited comment on CASSANDRA-10579 at 11/11/15 4:29 PM:
-

(i'll try to move things to 2.1.11 to simplify this. looks like it's based on 
the main 2.1 branch though.)


was (Author: jeffery.griffith):
(i'll try to move things to 2.1.11 to simplify this. looks like it's based on 
the 2.1 branch though.)

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.12, 2.2.4
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> 

[jira] [Commented] (CASSANDRA-7408) System hints corruption - dataSize ... would be larger than file

2015-11-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000950#comment-15000950
 ] 

Jeff Griffith commented on CASSANDRA-7408:
--

no problem [~iamaleksey]  i seem to recall this being related to an issue i 
reported separately where a short integer was overflowing. pretty sure it's all 
good now.

> System hints corruption - dataSize ... would be larger than file
> 
>
> Key: CASSANDRA-7408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7408
> Project: Cassandra
>  Issue Type: Bug
> Environment: RHEL 6.5
> Cassandra 1.2.16
> RF=3
> Thrift
>Reporter: Jeff Griffith
>
> I've found several unresolved JIRA tickets related to SSTable corruption but 
> not sure if they apply to the case we are seeing in system/hints. We see 
> periodic exceptions such as:
> {noformat}
> dataSize of 144115248479299639 starting at 17209 would be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> {noformat}
> Is there something we could possibly be doing from the application to cause 
> this sort of corruption? We also see it on some of our own column families 
> also some *negative* lengths which are presumably a similar corruption.
> {noformat}
> ERROR [HintedHandoff:57] 2014-06-17 17:08:04,690 CassandraDaemon.java (line 
> 191) Exception in thread Thread[HintedHandoff:57,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> dataSize of 144115248479299639 starting at 17209 would be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> at 
> org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:441)
> at 
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:282)
> at 
> org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:90)
> at 
> org.apache.cassandra.db.HintedHandOffManager$4.run(HintedHandOffManager.java:508)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> dataSize of 144115248479299639 starting at 17209 would be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:188)
> at 
> org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:437)
> ... 6 more
> Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: 
> java.io.IOException: dataSize of 144115248479299639 starting at 17209 would 
> be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:167)
> at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:83)
> at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:69)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:180)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:155)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:142)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:38)
> at 
> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:145)
> at 
> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
> at 
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
> at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at 
> org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:145)
> at 
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> at 
> 

[jira] [Comment Edited] (CASSANDRA-7408) System hints corruption - dataSize ... would be larger than file

2015-11-11 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000950#comment-15000950
 ] 

Jeff Griffith edited comment on CASSANDRA-7408 at 11/11/15 7:46 PM:


no problem [~iamaleksey]  i seem to recall this being related to an issue i 
reported separately that was fixed where a short integer was overflowing. 
pretty sure it's all good now.


was (Author: jeffery.griffith):
no problem [~iamaleksey]  i seem to recall this being related to an issue i 
reported separately where a short integer was overflowing. pretty sure it's all 
good now.

> System hints corruption - dataSize ... would be larger than file
> 
>
> Key: CASSANDRA-7408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7408
> Project: Cassandra
>  Issue Type: Bug
> Environment: RHEL 6.5
> Cassandra 1.2.16
> RF=3
> Thrift
>Reporter: Jeff Griffith
>
> I've found several unresolved JIRA tickets related to SSTable corruption but 
> not sure if they apply to the case we are seeing in system/hints. We see 
> periodic exceptions such as:
> {noformat}
> dataSize of 144115248479299639 starting at 17209 would be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> {noformat}
> Is there something we could possibly be doing from the application to cause 
> this sort of corruption? We also see it on some of our own column families 
> also some *negative* lengths which are presumably a similar corruption.
> {noformat}
> ERROR [HintedHandoff:57] 2014-06-17 17:08:04,690 CassandraDaemon.java (line 
> 191) Exception in thread Thread[HintedHandoff:57,1,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> dataSize of 144115248479299639 starting at 17209 would be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> at 
> org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:441)
> at 
> org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:282)
> at 
> org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:90)
> at 
> org.apache.cassandra.db.HintedHandOffManager$4.run(HintedHandOffManager.java:508)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: 
> dataSize of 144115248479299639 starting at 17209 would be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> at java.util.concurrent.FutureTask.get(FutureTask.java:188)
> at 
> org.apache.cassandra.db.HintedHandOffManager.doDeliverHintsToEndpoint(HintedHandOffManager.java:437)
> ... 6 more
> Caused by: org.apache.cassandra.io.sstable.CorruptSSTableException: 
> java.io.IOException: dataSize of 144115248479299639 starting at 17209 would 
> be larger than file 
> /home/y/var/cassandra/data/system/hints/system-hints-ic-219-Data.db length 
> 35542
> at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:167)
> at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:83)
> at 
> org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIdentityIterator.java:69)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:180)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:155)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:142)
> at 
> org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:38)
> at 
> org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:145)
> at 
> org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:122)
> at 
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:96)
> at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
> at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
> at 
> 

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-11-02 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985217#comment-14985217
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

Does the NodeBuilder thing prevent me from going to prod with your branch 
[~benedict] ?

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-29 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980375#comment-14980375
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/29/15 12:58 PM:
--

[~blambov] [~benedict]

See image: Killed two birds with one stone here it seems. Looking at he logs 
before the commit log growth, it looks like the IndexOutOfBounds exceptions 
affected all nodes in this small cluster of 3 at the same time, with with RF=3 
that probably makes sense, doesn't it?

https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg


was (Author: jeffery.griffith):
[~blambov] [~benedict]

Killed two birds with one stone here it seems. Looking at he logs before the 
commit log growth, it looks like the IndexOutOfBounds exceptions affected all 
nodes in this small cluster of 3 at the same time, with with RF=3 that probably 
makes sense, doesn't it?


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-29 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: CASSANDRA-19579.jpg

[~blambov] [~benedict]

Killed two birds with one stone here it seems. Looking at he logs before the 
commit log growth, it looks like the IndexOutOfBounds exceptions affected all 
nodes in this small cluster of 3 at the same time, with with RF=3 that probably 
makes sense, doesn't it?


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-29 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14980375#comment-14980375
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/29/15 12:59 PM:
--

[~blambov] [~benedict]

See image: Killed two birds with one stone here it seems. Looking at the logs 
before the commit log growth, it looks like the IndexOutOfBounds exceptions 
affected all nodes in this small cluster of 3 at the same time, with with RF=3 
that probably makes sense, doesn't it?

https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg


was (Author: jeffery.griffith):
[~blambov] [~benedict]

See image: Killed two birds with one stone here it seems. Looking at he logs 
before the commit log growth, it looks like the IndexOutOfBounds exceptions 
affected all nodes in this small cluster of 3 at the same time, with with RF=3 
that probably makes sense, doesn't it?

https://issues.apache.org/jira/secure/attachment/12769525/CASSANDRA-19579.jpg

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CASSANDRA-19579.jpg, 
> CommitLogProblem.jpg, CommitLogSize.jpg, 
> MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, cassandra.yaml, 
> cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-28 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978368#comment-14978368
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the rest running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what saw, though, the symptoms seemed to match exactly what 
[~krummas] described.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-28 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978368#comment-14978368
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/28/15 12:58 PM:
--

Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the test running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what saw, though, the symptoms seemed to match exactly what 
[~krummas] described.


was (Author: jeffery.griffith):
Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the rest running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what saw, though, the symptoms seemed to match exactly what 
[~krummas] described.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-28 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14978368#comment-14978368
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/28/15 1:00 PM:
-

Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the test running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what I saw, though, the symptoms seemed to match exactly 
what [~krummas] described.


was (Author: jeffery.griffith):
Great info, thanks [~benedict] ! I guess my "all roads lead to Rome" analogy 
was a good one :-)

[~blambov] i have the test running now. normally happens a couple of times a 
day so i should know by this evening.

On that first form of growth, I unfortunately did not get a chance to try it 
before the problem corrected itself. I believe that gradually our 3 remaining 
problematic nodes too longer to reduce the # of L0 files than did the rest of 
our clusters. It took weeks rather than days and coincidentally I got involved 
near the end. From what saw, though, the symptoms seemed to match exactly what 
[~krummas] described.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976555#comment-14976555
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

So [~benedict] I rebuilt 2.1.10 with a merge of your diagnostic patch plus the 
changes you mention above for integer overflow. I tried this on a node where i 
had re-enabled assertions. i THINK but i am not certain that the assertions 
suppress seeing the commit log IndexOutOfBounds exception, i will confirm this. 
but the GOOD news is that this version DOES seem to fix the startup problem! I 
will confirm this on the next node that fails where assertions are off.


> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> 

[jira] [Comment Edited] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976555#comment-14976555
 ] 

Jeff Griffith edited comment on CASSANDRA-10579 at 10/27/15 3:34 PM:
-

So [~benedict] I rebuilt 2.1.10 with a merge of your diagnostic patch plus the 
changes you mention above for integer overflow. I tried this on a node where i 
had re-enabled assertions. i THINK but i am not certain that the assertions 
suppress seeing the commit log IndexOutOfBounds exception, i will confirm this. 
but the GOOD news is that this version DOES seem to fix the startup problem! I 
will confirm this on the next node that fails where assertions are off.  By the 
way, it seems like this may also be leading to sstable corruption (probably not 
surprising since it's flushing sstables when the IOOB exception happens?)




was (Author: jeffery.griffith):
So [~benedict] I rebuilt 2.1.10 with a merge of your diagnostic patch plus the 
changes you mention above for integer overflow. I tried this on a node where i 
had re-enabled assertions. i THINK but i am not certain that the assertions 
suppress seeing the commit log IndexOutOfBounds exception, i will confirm this. 
but the GOOD news is that this version DOES seem to fix the startup problem! I 
will confirm this on the next node that fails where assertions are off.


> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976664#comment-14976664
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

Yes, we are seeing sstable corruption also which we scrub. Not 100% certain it 
results from this index out of bounds problem though.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> 

[jira] [Comment Edited] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976864#comment-14976864
 ] 

Jeff Griffith edited comment on CASSANDRA-10579 at 10/27/15 6:09 PM:
-

perfect. thanks again.


was (Author: jeffery.griffith):
perfect. thanks.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976703#comment-14976703
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

my pleasure, thanks for the patch! we are running on 2.1.10. is the patch only 
for 2.1.11?

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> 

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14976864#comment-14976864
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

perfect. thanks.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> 

[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977052#comment-14977052
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

[~krummas] [~tjake] something interesting on this second form of commit log 
growth where all nodes had uncontrolled commit log growth unless the first 
example (many files in L0) where it was isolated nodes. for this latter case, I 
think i'm able to relate this to a separate problem with an index out of bounds 
exception. working with [~benedict] it seems like we have that one solved. i'm 
hopeful that patch will solve this growing commit log problem as well. it seems 
like all roads lead to rome where rome is commit log growth :-)

here is the other JIRA identifying an integer overflow in 
AbstractNativeCell.java
https://issues.apache.org/jira/browse/CASSANDRA-10579

Still uncertain how to proceed with the first form that seems to be starvation 
as you have described.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-27 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14977052#comment-14977052
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/27/15 7:59 PM:
-

[~krummas] [~tjake] something interesting on this second form of commit log 
growth where all nodes had uncontrolled commit log growth unlike the first 
example (many files in L0) where it was isolated nodes. for this latter case, I 
think i'm able to relate this to a separate problem with an index out of bounds 
exception. working with [~benedict] it seems like we have that one solved. i'm 
hopeful that patch will solve this growing commit log problem as well. it seems 
like all roads lead to rome where rome is commit log growth :-)

here is the other JIRA identifying an integer overflow in 
AbstractNativeCell.java
https://issues.apache.org/jira/browse/CASSANDRA-10579

Still uncertain how to proceed with the first form that seems to be starvation 
as you have described.



was (Author: jeffery.griffith):
[~krummas] [~tjake] something interesting on this second form of commit log 
growth where all nodes had uncontrolled commit log growth unless the first 
example (many files in L0) where it was isolated nodes. for this latter case, I 
think i'm able to relate this to a separate problem with an index out of bounds 
exception. working with [~benedict] it seems like we have that one solved. i'm 
hopeful that patch will solve this growing commit log problem as well. it seems 
like all roads lead to rome where rome is commit log growth :-)

here is the other JIRA identifying an integer overflow in 
AbstractNativeCell.java
https://issues.apache.org/jira/browse/CASSANDRA-10579

Still uncertain how to proceed with the first form that seems to be starvation 
as you have described.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-26 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974292#comment-14974292
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

thanks [~benedict]  i'll try to capture those. it seems to be tricky to 
identify the specific commit log causing the problem. i'm trying to do some 
math with the segment ID but haven't quite figured out how to isolate it. 
either way i'll try to attach something useful shortly.

re previous version, we have seen this before but since we just upgraded to 
2.1.10 it does seem to be becoming a more frequent occurrence.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> 

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-26 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974323#comment-14974323
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

Thanks for the refinement. I'll check on the assertions.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> 

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-26 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974335#comment-14974335
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

Yes, it doesn't look like we have -ea in our jvm opts.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> 

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-26 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974357#comment-14974357
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

I re-enabled assertions on this node and here is the first:

WARN  [SharedPool-Worker-7] 2015-10-26 15:10:44,777 
AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
Thread[SharedPool-Worker-7,5,main]: {}
java.lang.AssertionError: null
at 
org.apache.cassandra.db.AbstractNativeCell.checkPosition(AbstractNativeCell.java:585)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.getByteBuffer(AbstractNativeCell.java:657)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.get(AbstractNativeCell.java:304) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.get(AbstractNativeCell.java:291) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.sizeOf(AbstractNativeCell.java:132) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.(AbstractNativeCell.java:120) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.NativeCell.(NativeCell.java:40) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.memory.NativeAllocator.clone(NativeAllocator.java:72)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.NativeCell.localCopy(NativeCell.java:64) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:445)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:418)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.btree.NodeBuilder.addNewKey(NodeBuilder.java:322) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:190) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_31]
at 
org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]


> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> 

[jira] [Comment Edited] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-26 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974357#comment-14974357
 ] 

Jeff Griffith edited comment on CASSANDRA-10579 at 10/26/15 3:13 PM:
-

I re-enabled assertions on this node and here is the first:
{code}
WARN  [SharedPool-Worker-7] 2015-10-26 15:10:44,777 
AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
Thread[SharedPool-Worker-7,5,main]: {}
java.lang.AssertionError: null
at 
org.apache.cassandra.db.AbstractNativeCell.checkPosition(AbstractNativeCell.java:585)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.getByteBuffer(AbstractNativeCell.java:657)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.get(AbstractNativeCell.java:304) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.get(AbstractNativeCell.java:291) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.sizeOf(AbstractNativeCell.java:132) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.(AbstractNativeCell.java:120) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.NativeCell.(NativeCell.java:40) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.memory.NativeAllocator.clone(NativeAllocator.java:72)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.NativeCell.localCopy(NativeCell.java:64) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:445)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AtomicBTreeColumns$ColumnUpdater.apply(AtomicBTreeColumns.java:418)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.btree.NodeBuilder.addNewKey(NodeBuilder.java:322) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:190) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_31]
at 
org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
{code}


was (Author: jeffery.griffith):
I re-enabled assertions on this node and here is the first:

WARN  [SharedPool-Worker-7] 2015-10-26 15:10:44,777 
AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
Thread[SharedPool-Worker-7,5,main]: {}
java.lang.AssertionError: null
at 
org.apache.cassandra.db.AbstractNativeCell.checkPosition(AbstractNativeCell.java:585)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.getByteBuffer(AbstractNativeCell.java:657)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.get(AbstractNativeCell.java:304) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.get(AbstractNativeCell.java:291) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.sizeOf(AbstractNativeCell.java:132) 

[jira] [Comment Edited] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-26 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974335#comment-14974335
 ] 

Jeff Griffith edited comment on CASSANDRA-10579 at 10/26/15 2:57 PM:
-

Yes i think they are disabled. It doesn't look like we have -ea in our jvm opts.


was (Author: jeffery.griffith):
Yes, it doesn't look like we have -ea in our jvm opts.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> 

[jira] [Commented] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup (with offheap_objects)

2015-10-26 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974622#comment-14974622
 ] 

Jeff Griffith commented on CASSANDRA-10579:
---

Great thanks [~benedict]. i'll merge both changes in and give it a try.

> IndexOutOfBoundsException during memtable flushing at startup (with 
> offheap_objects)
> 
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>Assignee: Benedict
> Fix For: 2.1.x
>
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-23 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971417#comment-14971417
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:34 PM:
-

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}



was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:16 
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:18 
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:20 
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:22 
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:24 
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:26 
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:28 
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:30 
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:32 
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:34 
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:36 
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:38 
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:40 
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:43 
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
-rw-r--r--  1 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-23 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971417#comment-14971417
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:33 PM:
-

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

This happened twice during this period and here is the first one. Note the pid 
changed because our monitoring detected and restarted the node.

{code}
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:16 
tpstats_20151023-00:16:02_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:18 
tpstats_20151023-00:18:08_pid_37996_postpend_1.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:20 
tpstats_20151023-00:20:14_pid_37996_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:22 
tpstats_20151023-00:22:19_pid_37996_postpend_3.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:24 
tpstats_20151023-00:24:25_pid_37996_postpend_133.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:26 
tpstats_20151023-00:26:30_pid_37996_postpend_809.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:28 
tpstats_20151023-00:28:35_pid_37996_postpend_1596.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:30 
tpstats_20151023-00:30:39_pid_37996_postpend_2258.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:32 
tpstats_20151023-00:32:42_pid_37996_postpend_3095.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:34 
tpstats_20151023-00:34:45_pid_37996_postpend_3822.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:36 
tpstats_20151023-00:36:48_pid_37996_postpend_4593.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:38 
tpstats_20151023-00:38:52_pid_37996_postpend_5363.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:40 
tpstats_20151023-00:40:55_pid_37996_postpend_6212.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:43 
tpstats_20151023-00:42:59_pid_37996_postpend_7137.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:45 
tpstats_20151023-00:45:03_pid_37996_postpend_8559.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:47 
tpstats_20151023-00:47:06_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:49 
tpstats_20151023-00:49:09_pid_37996_postpend_9060.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:51 
tpstats_20151023-00:51:11_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2002 Oct 22 20:53 
tpstats_20151023-00:53:13_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:55 
tpstats_20151023-00:55:16_pid_48196_postpend_0.txt
-rw-r--r--  1 jgriffith  Y\Domain Users  2180 Oct 22 20:57 
tpstats_20151023-00:57:21_pid_48196_postpend_0.txt

{code}



was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-23 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971417#comment-14971417
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/23/15 5:31 PM:
-

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

https://issues.apache.org/jira/secure/attachment/12768344/MultinodeCommitLogGrowth-node1.tar.gz


was (Author: jeffery.griffith):
[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-23 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: MultinodeCommitLogGrowth-node1.tar.gz

[~krummas] [~tjake] Here is a separate instance of commit logs breaking our 12G 
setting but with different behavior. I have captured the whole thing with 
thread dumps and tpstats every two minutes. I've embedded pending numbers in 
the filenames for your convenience to make it easy to see where the backup 
starts. *-node1.tar.gz is the only one i uploaded since the files were so 
large, but note in the Dashboard.jpg file that all three nodes break the limit 
at about the same time. I can upload the others if it is useful. This case 
seems different from the previous case where there were lots of L0 files 
causing thread blocking, but even here it seems like the MemtablePostFlush is 
stopping on a countdownlatch.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, MultinodeCommitLogGrowth-node1.tar.gz, RUN3tpstats.jpg, 
> cassandra.yaml, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10579) IndexOutOfBoundsException during memtable flushing at startup

2015-10-23 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10579:
--
Summary: IndexOutOfBoundsException during memtable flushing at startup  
(was: IndexOutOfBoundsException)

> IndexOutOfBoundsException during memtable flushing at startup
> -
>
> Key: CASSANDRA-10579
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.1.10 on linux
>Reporter: Jeff Griffith
>
> Sometimes we have problems at startup where memtable flushes with an index 
> out of bounds exception as seen below. Cassandra is then dead in the water 
> until we track down the corresponding commit log via the segment ID and 
> remove it:
> {code}
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
> INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
> messaging version 8)
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
> reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
> INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
> /home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
> messaging version 8)
> WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
> AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
> Thread[SharedPool-Worker-5,5,main]: {}
> java.lang.ArrayIndexOutOfBoundsException: 6
> at 
> org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
>  ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
> ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_31]
> at 
> org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
>  

[jira] [Created] (CASSANDRA-10579) IndexOutOfBoundsException

2015-10-23 Thread Jeff Griffith (JIRA)
Jeff Griffith created CASSANDRA-10579:
-

 Summary: IndexOutOfBoundsException
 Key: CASSANDRA-10579
 URL: https://issues.apache.org/jira/browse/CASSANDRA-10579
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 2.1.10 on linux
Reporter: Jeff Griffith


Sometimes we have problems at startup where memtable flushes with an index out 
of bounds exception as seen below. Cassandra is then dead in the water until we 
track down the corresponding commit log via the segment ID and remove it:

{code}
INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:267 - Replaying 
/home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
INFO  [main] 2015-10-23 14:43:36,440 CommitLogReplayer.java:270 - Replaying 
/home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log (CL version 4, 
messaging version 8)
INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:478 - Finished 
reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832692.log
INFO  [main] 2015-10-23 14:43:36,594 CommitLogReplayer.java:267 - Replaying 
/home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
INFO  [main] 2015-10-23 14:43:36,595 CommitLogReplayer.java:270 - Replaying 
/home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log (CL version 4, 
messaging version 8)
INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:478 - Finished 
reading /home/y/var/cassandra/commitlog/CommitLog-4-1445474832693.log
INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:267 - Replaying 
/home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log
INFO  [main] 2015-10-23 14:43:36,699 CommitLogReplayer.java:270 - Replaying 
/home/y/var/cassandra/commitlog/CommitLog-4-1445474832694.log (CL version 4, 
messaging version 8)
WARN  [SharedPool-Worker-5] 2015-10-23 14:43:36,747 
AbstractTracingAwareExecutorService.java:169 - Uncaught exception on thread 
Thread[SharedPool-Worker-5,5,main]: {}
java.lang.ArrayIndexOutOfBoundsException: 6
at 
org.apache.cassandra.db.AbstractNativeCell.nametype(AbstractNativeCell.java:204)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AbstractNativeCell.isStatic(AbstractNativeCell.java:199)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.composites.AbstractCType.compare(AbstractCType.java:166)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:61)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.composites.AbstractCellNameType$1.compare(AbstractCellNameType.java:58)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.utils.btree.BTree.find(BTree.java:277) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:154) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.utils.btree.Builder.update(Builder.java:74) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.utils.btree.BTree.update(BTree.java:186) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.AtomicBTreeColumns.addAllWithSizeDelta(AtomicBTreeColumns.java:225)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Memtable.put(Memtable.java:210) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1225) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:396) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:359) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:455)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) 
~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_31]
at 
org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
 ~[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) 
[apache-cassandra-2.1.10.jar:2.1.10-SNAPSHOT]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-20 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965636#comment-14965636
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Hi again [~krummas]
Before trying the leveling, the remaining problematic clusters seemed to work 
out the # of files in L0 problem. They were all trending downward but there 
were several days where it was very frequent. Alas, the isolated node with 
large SSTable counts does not seem to be the only issue where commit logs break 
the limit. I'm tempted to open this as a separate issue, but let's see what you 
think first. In some cases, we see all 3 nodes in those small clusters break 
the limit at the same time. I will do better monitoring but I did manage to 
catch one in progress and here i observed. There was not a lot of blocked 
threads like before but it did have the MemtablePostFlusher blocked on the 
countdown latch. So here are the tpstats for that:
{code}
MemtableFlushWriter   830   7200 0  
   0
MemtablePostFlush 1 45879  16841 0  
   0
MemtableReclaimMemory 0 0   7199 0  
   0
{code}
With 46K pending. The only thread I see for that is here:
{code}
"MemtablePostFlush:3" #3054 daemon prio=5 os_prio=0 tid=0x7f806fb71000 
nid=0x2e5c waiting on condition [0x7f804366c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005de8976f8> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
I don't know who counts that latch down, but there were a couple of blocked 
threads, here:
{code}
"HintedHandoff:2" #1429 daemon prio=1 os_prio=4 tid=0x7f80895c4800 
nid=0x1242 waiting for monitor entry [0x7f804321b000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.cassandra.db.HintedHandOffManager.compact(HintedHandOffManager.java:267)
- waiting to lock <0x0004e2e689a8> (a 
org.apache.cassandra.db.HintedHandOffManager)
at 
org.apache.cassandra.db.HintedHandOffManager$5.run(HintedHandOffManager.java:561)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

"HintedHandoff:1" #1428 daemon prio=1 os_prio=4 tid=0x7f80895c3800 
nid=0x1241 waiting for monitor entry [0x7f7838855000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.cassandra.db.HintedHandOffManager.compact(HintedHandOffManager.java:267)
- waiting to lock <0x0004e2e689a8> (a 
org.apache.cassandra.db.HintedHandOffManager)
at 
org.apache.cassandra.db.HintedHandOffManager$5.run(HintedHandOffManager.java:561)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
and the lock was held here:
{code}
"HintedHandoffManager:1" #1430 daemon prio=1 os_prio=4 tid=0x7f808aaf1800 
nid=0x1243 waiting on condition [0x7f8043423000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00060bdc0b98> (a 
java.util.concurrent.FutureTask)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960606#comment-14960606
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:47 PM:
--

thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687



was (Author: jeffery.griffith):
thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: cfstats-clean.txt

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960606#comment-14960606
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cfstats-clean.txt, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: cassandra.yaml

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960606#comment-14960606
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:51 PM:
--

thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687

Also, I've attached our cassandra.yaml


was (Author: jeffery.griffith):
thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960606#comment-14960606
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:59 PM:
--

thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <--- indeed this is NOT the case on other 
nodes
SSTable count: 687

Also, I've attached our cassandra.yaml


was (Author: jeffery.griffith):
thanks [~krummas] see cfstats-clean.txt which i obfuscated and uploaded. we 
didn't actually name them CF001 ;-)

For your convenience I grabbed the sstable counts > 500:
SSTable count: 3454
SSTable count: 55392 <---
SSTable count: 687

Also, I've attached our cassandra.yaml

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960663#comment-14960663
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:14 PM:
-

yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. I will go 
back and check for large sstable counts to see if that explains all of them. 
after this first stage, we'll be rolling out to the larger 24-node clusters but 
we are pausing here on the small clusters until we figure this out.



was (Author: jeffery.griffith):
yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. I will go 
back and check for large sstable counts to see if that explains all of them. 
after this first stage, we'll be rolling out to the larger 24-node customers 
but we are pausing here on the small clusters until we figure this out.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: C5commitLogIncrease.jpg

[~krummas] I checked a different cluster for sstable counts. 
(C5commitLogIncrease.jpg) Here they all decided to break the limit at the same 
time. The largest sstable count in each is about 5K.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960663#comment-14960663
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:16 PM:
-

yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. these three 
still have not. I will go back and check for large sstable counts to see if 
that explains all of them. after this first stage, we'll be rolling out to the 
larger 24-node clusters but we are pausing here on the small clusters until we 
figure this out.



was (Author: jeffery.griffith):
yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. I will go 
back and check for large sstable counts to see if that explains all of them. 
after this first stage, we'll be rolling out to the larger 24-node clusters but 
we are pausing here on the small clusters until we figure this out.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960755#comment-14960755
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:07 PM:
-

[~krummas] I checked a different cluster for sstable counts. (see 
C5commitLogIncrease.jpg) Here they all decided to break the limit at the same 
time. The largest sstable count in each is about 5K.


was (Author: jeffery.griffith):
[~krummas] I checked a different cluster for sstable counts. 
(C5commitLogIncrease.jpg) Here they all decided to break the limit at the same 
time. The largest sstable count in each is about 5K.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960643#comment-14960643
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

No, that is one node.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960650#comment-14960650
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Indeed the symptoms here look like the other jira you mentioned. I have 
followed the thread dumps over time and it looks very much like it's spending a 
lot of time in the "overlapping" calculation as you see above.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960663#comment-14960663
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. I will go 
back and check for large sstable counts to see if that explains all of them. 
after this first stage, we'll be rolling out to the larger 24-node customers 
but we are pausing here on the small clusters until we figure this out.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960663#comment-14960663
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 1:17 PM:
-

yes, we just upgraded from 2.0. would that explain the 50k+? i will dig deeper 
on that CF. i have checked and that is NOT the case on the other nodes. yes, 
they are balanced in terms of data (40 core machines with lots of memory). in 
this stage of our rollout to 2.1, we have gone to 10 small clusters of 3 nodes 
each (30 nodes total). only THREE nodes of the thirty are now exhibiting this 
behavior. for the first few days, several others did however they seem to have 
self corrected. these three still have not. I will go back and check for large 
sstable counts to see if that explains all of them. after this first stage, 
we'll be rolling out to the larger 24-node clusters but we are pausing here on 
the small clusters until we figure this out.



was (Author: jeffery.griffith):
yes, we just upgraded from 2.0. would that explain the 50k+? i have checked and 
that is NOT the case on the other nodes. yes, they are balanced in terms of 
data (40 core machines with lots of memory). in this stage of our rollout to 
2.1, we have gone to 10 small clusters of 3 nodes each (30 nodes total). only 
THREE nodes of the thirty are now exhibiting this behavior. for the first few 
days, several others did however they seem to have self corrected. these three 
still have not. I will go back and check for large sstable counts to see if 
that explains all of them. after this first stage, we'll be rolling out to the 
larger 24-node clusters but we are pausing here on the small clusters until we 
figure this out.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960678#comment-14960678
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

great, thanks [~krummas]. please let me know if there is any more information i 
can provide to help resolve it.  i'll get more info across the 30 nodes on 
large sstables and make sure this correlates to the problem.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960763#comment-14960763
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

No, this is only the beginning (300M users and a petabyte to go :) ) . We were 
kind of pausing here since the bigger clusters carry so many more users but we 
can definitely do that if we move forward with the rollout. We'll apply the 
releveling where can and see how it behaviors. is the 5K sstable count enough 
to be concerned about? i'll do some more analysis on these and compare to 
clusters that have not yet been upgraded.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960763#comment-14960763
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:17 PM:
-

No, this is only the beginning (300M users and a petabyte to go :) ) . We were 
kind of pausing here since the bigger clusters carry so many more users but we 
can definitely do that if we move forward with the rollout. We'll apply the 
releveling where can and see how it behaves. is the 5K sstable count enough to 
be concerned about? i'll do some more analysis on these and compare to clusters 
that have not yet been upgraded.



was (Author: jeffery.griffith):
No, this is only the beginning (300M users and a petabyte to go :) ) . We were 
kind of pausing here since the bigger clusters carry so many more users but we 
can definitely do that if we move forward with the rollout. We'll apply the 
releveling where can and see how it behaviors. is the 5K sstable count enough 
to be concerned about? i'll do some more analysis on these and compare to 
clusters that have not yet been upgraded.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960769#comment-14960769
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

ok, that's good news then. we'll apply tools/bin/sstableofflinerelevel to 
everything above 1K or so?

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-16 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960769#comment-14960769
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 2:20 PM:
-

ok, that's good news then although the weird synchronization of that last 
example concerns me. we'll apply tools/bin/sstableofflinerelevel to everything 
above 1K or so?


was (Author: jeffery.griffith):
ok, that's good news then. we'll apply tools/bin/sstableofflinerelevel to 
everything above 1K or so?

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: C5commitLogIncrease.jpg, CommitLogProblem.jpg, 
> CommitLogSize.jpg, RUN3tpstats.jpg, cassandra.yaml, cfstats-clean.txt, 
> stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:00 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: jeffery.griffith):
[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow:

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 11:57 PM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow:

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: jeffery.griffith):
[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed:

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at 

[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: RUN3tpstats.jpg

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed:

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, 
> RUN3tpstats.jpg, stacktrace.txt, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:05 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


I see one thread for MemtablePostFlush and this is it:

{code}
"MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 
nid=0x2d93 waiting on condition [0x7fda4b46c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005838ba468> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


was (Author: jeffery.griffith):
[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:13 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


I see one thread for MemtablePostFlush and this is it:

{code}
"MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 
nid=0x2d93 waiting on condition [0x7fda4b46c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005838ba468> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

I followed it for a while longer after this and it really looks like the post 
flush stacks blocked on that latch forever:

{code}
00:01
MemtableFlushWriter   2 2   2024 0  
   0
MemtablePostFlush 1 47159   4277 0  
   0
MemtableReclaimMemory 0 0   2024 0  
   0


00:03
MemtableFlushWriter   3 3   2075 0  
   0
MemtablePostFlush

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959885#comment-14959885
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/16/15 12:14 AM:
--

[~tjake] I monitored live for a few hours to capture the behavior. See 
RUN3tpstats.jpg in the attachments:

Overview is:
Monitoring threads began to block before the memtable flushing did.
Memtable flushing seemed to be progressing slowly and then post flushing 
operations began to pile up. The primary things blocked were:
1. MemtableFlushWriter/handleNotif
2. CompactionExec/getNextBGTask
3. ServiceThread/getEstimatedRemTask

Those three blocked and never came unblocked so assume (?) the locker never 
completed or was very, very slow. Eventually a second MemtableFlushWriter 
thread blocks. I believe if I left it continue to run, all or many of them 
will. 

{code}
"CompactionExecutor:18" #1462 daemon prio=1 os_prio=4 tid=0x7fd96141 
nid=0x728b runnable [0x7fda4ae0b000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004a8bc5038> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004a8af17d0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004a894df10> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


I see one thread for MemtablePostFlush and this is it:

{code}
"MemtablePostFlush:8" #4866 daemon prio=5 os_prio=0 tid=0x7fd91c0c5800 
nid=0x2d93 waiting on condition [0x7fda4b46c000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0005838ba468> (a 
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
at 
org.apache.cassandra.db.ColumnFamilyStore$PostFlush.run(ColumnFamilyStore.java:998)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

I followed it for a while longer after this and it really looks like the post 
flush stays blocked on that latch forever:

{code}
00:01
MemtableFlushWriter   2 2   2024 0  
   0
MemtablePostFlush 1 47159   4277 0  
   0
MemtableReclaimMemory 0 0   2024 0  
   0


00:03
MemtableFlushWriter   3 3   2075 0  
   0
MemtablePostFlush 

[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959525#comment-14959525
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 8:06 PM:
-

Yeah doesn't look like the locking thread is deadlocked at all. I know this is 
a stretch, but considering we just migrated from 2.0.x, could there be 
something data specific that is confusing the compaction? Not sure where to 
check for slow flushes. Should i just watch tpstats?


was (Author: jeffery.griffith):
Yeah doesn't look blocked. How can i check for the slow flushes?

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959608#comment-14959608
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

I had restarted but I'll watch live the next iteration. As you see upwards in 
the comments though, they do start piling up:
MemtableFlushWriter   1 1   1574 0  
   0
MemtablePostFlush 1 13755 134889 0  
   0
MemtableReclaimMemory 0 0   1574 0  
   0

In the previous iteration, there were four threads for MemtableFlushWriter all 
blocked behind the runnable 
LeveledManifest.getCandidatesFor(LeveledManifest.java:572)


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959525#comment-14959525
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Yeah doesn't look blocked. How can i check for the slow flushes?

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959218#comment-14959218
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

btw, we tried commitlog_segment_recycling: false but we realized after this 
should already be the default. we briefly thought it made a difference after 
restart that node but the problem did return after several hours. there is some 
mention in another jira about tuning the number of memtable flush writers. 
could this be an issue? it's still difficult to explain though why we only see 
this in a few nodes in the ten clusters all with the same config.

will try to get the thread dump asap.


> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959208#comment-14959208
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

working on it [~mishail]

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959498#comment-14959498
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

A second iteration. Ran into a second instance of metrics via RMI but caught it 
very early when only a few were blocked behind the compaction. Still looks like 
the same general place:

{code}
"CompactionExecutor:16" #1502 daemon prio=1 os_prio=4 tid=0x7fb78c4f2000 
nid=0xf7ff runnable [0x7fb751941000]
   java.lang.Thread.State: RUNNABLE
at java.util.HashMap.putVal(HashMap.java:641)
at java.util.HashMap.put(HashMap.java:611)
at java.util.HashSet.add(HashSet.java:219)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:512)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x0004bcf24298> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004bcbec488> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x0004b98f1b00> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

{code}

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959313#comment-14959313
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 6:08 PM:
-

Oh look, the memtable flusher is blocked on the same lock:

{code}
"MemtableFlushWriter:1166" #18316 daemon prio=5 os_prio=0 
tid=0x7f33ac5f8800 nid=0xb649 waiting for monitor entry [0x7f31c5acc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.handleNotification(WrappingCompactionStrategy.java:250)
- waiting to lock <0x000498151af8> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at org.apache.cassandra.db.DataTracker.notifyAdded(DataTracker.java:518)

{code}

I don't know how hot that particular code is but every stacktrace showed the 
lock at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77) or deeper.


was (Author: jeffery.griffith):
Oh look, the memtable flusher is blocked on the same lock:

{code}
"MemtableFlushWriter:1166" #18316 daemon prio=5 os_prio=0 
tid=0x7f33ac5f8800 nid=0xb649 waiting for monitor entry [0x7f31c5acc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.handleNotification(WrappingCompactionStrategy.java:250)
- waiting to lock <0x000498151af8> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at org.apache.cassandra.db.DataTracker.notifyAdded(DataTracker.java:518)

{code}

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959279#comment-14959279
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

[~mishail] [~blambov] thread dump attached.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959301#comment-14959301
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

I followed the thead dumps over time. a lot of metrics (getting estimated 
pending tasks) blocked behind this thread:

{code}
"CompactionExecutor:11" #1591 daemon prio=1 os_prio=4 tid=0x7f30f1338800 
nid=0xba6b runnable [0x7f2e75bfd000]
   java.lang.Thread.State: RUNNABLE
at org.apache.cassandra.dht.Bounds.contains(Bounds.java:49)
at org.apache.cassandra.dht.Bounds.intersects(Bounds.java:77)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:511)
at 
org.apache.cassandra.db.compaction.LeveledManifest.overlapping(LeveledManifest.java:497)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:572)
at 
org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:346)
- locked <0x000498f172b0> (a 
org.apache.cassandra.db.compaction.LeveledManifest)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getMaximalTask(LeveledCompactionStrategy.java:101)
at 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:90)
- locked <0x0004989bb5c0> (a 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.getNextBackgroundTask(WrappingCompactionStrategy.java:84)
- locked <0x000498151af8> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at 
org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:230)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959313#comment-14959313
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Oh look, the memtable flusher is blocked on the same lock:

{code}
"MemtableFlushWriter:1166" #18316 daemon prio=5 os_prio=0 
tid=0x7f33ac5f8800 nid=0xb649 waiting for monitor entry [0x7f31c5acc000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy.handleNotification(WrappingCompactionStrategy.java:250)
- waiting to lock <0x000498151af8> (a 
org.apache.cassandra.db.compaction.WrappingCompactionStrategy)
at org.apache.cassandra.db.DataTracker.notifyAdded(DataTracker.java:518)

{code}

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959338#comment-14959338
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) 

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959338#comment-14959338
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 7:19 PM:
-

Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) i will disable all 
monitoring and restart to be sure. (UPDATE: looks like a cron job piled those 
up after things got stuck. i disabled it to be sure.)


was (Author: jeffery.griffith):
Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) i will disable all 
monitoring and restart to be sure.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: stacktrace.txt

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-15 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959338#comment-14959338
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/15/15 6:18 PM:
-

Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) i will disable all 
monitoring and restart to be sure.


was (Author: jeffery.griffith):
Yeah, i saw all the blocked threads behind it. checking to see what monitoring 
tools are not checking for the previous instance to finish. but this is just an 
ugly side effect, isn't it? (a side effect of lock?) 

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, stacktrace.txt, 
> system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-14 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14957289#comment-14957289
 ] 

Jeff Griffith commented on CASSANDRA-10515:
---

A test tried: lowered max commit log size to 6g from 12g. It respected the 
limit through 8 iterations then began to grow again (same behavior).

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-14 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: system.log.clean

I cleaned up a log and attached it for a period that begin already in trouble, 
but suddenly started (flushing?) and compacting again. At 11:25 there was a 
huge compaction. Things seemed normal for a while but the commit logs began to 
grow again. Note that I have cleaned out a bunch of warnings like this because 
we see them everywhere:

WARN  [SharedPool-Worker-4] 2015-10-14 11:21:22,940 BatchStatement.java:252 - 
Batch of prepared statements for [xx] is of size 5253, exceeding specified 
threshold of 5120 by 133.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-14 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: CommitLogSize.jpg

This matches the period in the log file system.log.clean

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-14 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956966#comment-14956966
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/14/15 1:49 PM:
-

I cleaned up a log and attached it (system.log.clean) for a period that begin 
already in trouble, but suddenly started (flushing?) and compacting again. At 
11:25 there was a huge compaction. Things seemed normal for a while but the 
commit logs began to grow again. Note that I have cleaned out a bunch of 
warnings like this because we see them everywhere:

WARN  [SharedPool-Worker-4] 2015-10-14 11:21:22,940 BatchStatement.java:252 - 
Batch of prepared statements for [xx] is of size 5253, exceeding specified 
threshold of 5120 by 133.


was (Author: jeffery.griffith):
I cleaned up a log and attached it for a period that begin already in trouble, 
but suddenly started (flushing?) and compacting again. At 11:25 there was a 
huge compaction. Things seemed normal for a while but the commit logs began to 
grow again. Note that I have cleaned out a bunch of warnings like this because 
we see them everywhere:

WARN  [SharedPool-Worker-4] 2015-10-14 11:21:22,940 BatchStatement.java:252 - 
Batch of prepared statements for [xx] is of size 5253, exceeding specified 
threshold of 5120 by 133.

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-14 Thread Jeff Griffith (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14956970#comment-14956970
 ] 

Jeff Griffith edited comment on CASSANDRA-10515 at 10/14/15 1:49 PM:
-

CommitLogSize.jpg matches the period in the log file system.log.clean


was (Author: jeffery.griffith):
This matches the period in the log file system.log.clean

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Assignee: Branimir Lambov
>Priority: Critical
>  Labels: commitlog, triage
> Attachments: CommitLogProblem.jpg, CommitLogSize.jpg, system.log.clean
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more before it restarts. Once it reaches the state of more than 12G 
> commit log files, "nodetool compactionstats" hangs. Eventually C* restarts 
> without errors (not sure yet whether it is crashing but I'm checking into it) 
> and the cleanup occurs and the commit logs shrink back down again. Here is 
> the nodetool compactionstats immediately after restart.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Compaction hangs with move to 2.1.10

2015-10-13 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Attachment: CommitLogProblem.jpg

> Compaction hangs with move to 2.1.10
> 
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Priority: Critical
> Attachments: CommitLogProblem.jpg
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more. Once it reaches this state, "nodetool compactionstats" hangs. I 
> watched the recovery live when compactions begin happening again. the 
> "nodetool compactionstats" suddenly completed to show the outstanding jobs 
> most in 100% completion state:
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  ContactInformationUpdates   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore CommEvents   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore   EndpointPrefixIndexMinimized6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore   EmailHistogramDeltas3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCoreContactPrefixBytesIndex2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore   EndpointProfiles   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore CommEvents   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore EndpointIndexIntId3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10515) Compaction hangs with move to 2.1.10

2015-10-13 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Description: 
After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems where 
some nodes break the 12G commit log max we configured and go as high as 65G or 
more. Once it reaches this state, "nodetool compactionstats" hangs. I watched 
the recovery live when compactions begin happening again. the "nodetool 
compactionstats" suddenly completed to show the outstanding jobs most in 100% 
completion state:

{code}
jgriffith@prod1xc1.c2.bf1:~$ ndc
pending tasks: 2185
   compaction type   keyspace  table completed  
totalunit   progress
Compaction   SyncCore  *cf1*   61251208033   
170643574558   bytes 35.89%
Compaction   SyncCore  *cf2*   19262483904
19266079916   bytes 99.98%
Compaction   SyncCore  *cf3*6592197093 
6592316682   bytes100.00%
Compaction   SyncCore  *cf4*3411039555 
3411039557   bytes100.00%
Compaction   SyncCore  *cf5*2879241009 
2879487621   bytes 99.99%
Compaction   SyncCore  *cf6*   21252493623
21252635196   bytes100.00%
Compaction   SyncCore  *cf7*   81009853587
81009854438   bytes100.00%
Compaction   SyncCore  *cf8*3005734580 
3005768582   bytes100.00%
Active compaction remaining time :n/a
{code}

I was also doing periodic "nodetool tpstats" which were working but not being 
logged in system.log on the StatusLogger thread until after the compaction 
started working again.


  was:
After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems where 
some nodes break the 12G commit log max we configured and go as high as 65G or 
more. Once it reaches this state, "nodetool compactionstats" hangs. I watched 
the recovery live when compactions begin happening again. the "nodetool 
compactionstats" suddenly completed to show the outstanding jobs most in 100% 
completion state:

{code}
jgriffith@prod1xc1.c2.bf1:~$ ndc
pending tasks: 2185
   compaction type   keyspace  table completed  
totalunit   progress
Compaction   SyncCore  ContactInformationUpdates   61251208033   
170643574558   bytes 35.89%
Compaction   SyncCore CommEvents   19262483904
19266079916   bytes 99.98%
Compaction   SyncCore   EndpointPrefixIndexMinimized6592197093 
6592316682   bytes100.00%
Compaction   SyncCore   EmailHistogramDeltas3411039555 
3411039557   bytes100.00%
Compaction   SyncCoreContactPrefixBytesIndex2879241009 
2879487621   bytes 99.99%
Compaction   SyncCore   EndpointProfiles   21252493623
21252635196   bytes100.00%
Compaction   SyncCore CommEvents   81009853587
81009854438   bytes100.00%
Compaction   SyncCore EndpointIndexIntId3005734580 
3005768582   bytes100.00%
Active compaction remaining time :n/a
{code}

I was also doing periodic "nodetool tpstats" which were working but not being 
logged in system.log on the StatusLogger thread until after the compaction 
started working again.



> Compaction hangs with move to 2.1.10
> 
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Priority: Critical
> Attachments: CommitLogProblem.jpg
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more. Once it reaches this state, "nodetool compactionstats" hangs. I 
> watched the recovery live when compactions begin happening again. the 
> "nodetool compactionstats" suddenly completed to show the outstanding jobs 
> most in 100% completion state:
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  

[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-13 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Description: 
After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems where 
some nodes break the 12G commit log max we configured and go as high as 65G or 
more. Once it reaches this state, "nodetool compactionstats" hangs. Eventually 
C* restarts without errors and the cleanup occurs and the commit logs shrink 
back down again.

{code}
jgriffith@prod1xc1.c2.bf1:~$ ndc
pending tasks: 2185
   compaction type   keyspace  table completed  
totalunit   progress
Compaction   SyncCore  *cf1*   61251208033   
170643574558   bytes 35.89%
Compaction   SyncCore  *cf2*   19262483904
19266079916   bytes 99.98%
Compaction   SyncCore  *cf3*6592197093 
6592316682   bytes100.00%
Compaction   SyncCore  *cf4*3411039555 
3411039557   bytes100.00%
Compaction   SyncCore  *cf5*2879241009 
2879487621   bytes 99.99%
Compaction   SyncCore  *cf6*   21252493623
21252635196   bytes100.00%
Compaction   SyncCore  *cf7*   81009853587
81009854438   bytes100.00%
Compaction   SyncCore  *cf8*3005734580 
3005768582   bytes100.00%
Active compaction remaining time :n/a
{code}

I was also doing periodic "nodetool tpstats" which were working but not being 
logged in system.log on the StatusLogger thread until after the compaction 
started working again.


  was:
After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems where 
some nodes break the 12G commit log max we configured and go as high as 65G or 
more. Once it reaches this state, "nodetool compactionstats" hangs. I watched 
the recovery live when compactions begin happening again. the "nodetool 
compactionstats" suddenly completed to show the outstanding jobs most in 100% 
completion state:

{code}
jgriffith@prod1xc1.c2.bf1:~$ ndc
pending tasks: 2185
   compaction type   keyspace  table completed  
totalunit   progress
Compaction   SyncCore  *cf1*   61251208033   
170643574558   bytes 35.89%
Compaction   SyncCore  *cf2*   19262483904
19266079916   bytes 99.98%
Compaction   SyncCore  *cf3*6592197093 
6592316682   bytes100.00%
Compaction   SyncCore  *cf4*3411039555 
3411039557   bytes100.00%
Compaction   SyncCore  *cf5*2879241009 
2879487621   bytes 99.99%
Compaction   SyncCore  *cf6*   21252493623
21252635196   bytes100.00%
Compaction   SyncCore  *cf7*   81009853587
81009854438   bytes100.00%
Compaction   SyncCore  *cf8*3005734580 
3005768582   bytes100.00%
Active compaction remaining time :n/a
{code}

I was also doing periodic "nodetool tpstats" which were working but not being 
logged in system.log on the StatusLogger thread until after the compaction 
started working again.



> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Priority: Critical
> Attachments: CommitLogProblem.jpg
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more. Once it reaches this state, "nodetool compactionstats" hangs. 
> Eventually C* restarts without errors and the cleanup occurs and the commit 
> logs shrink back down again.
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
>   

[jira] [Updated] (CASSANDRA-10515) Commit logs back up with move to 2.1.10

2015-10-13 Thread Jeff Griffith (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Griffith updated CASSANDRA-10515:
--
Summary: Commit logs back up with move to 2.1.10  (was: Compaction hangs 
with move to 2.1.10)

> Commit logs back up with move to 2.1.10
> ---
>
> Key: CASSANDRA-10515
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: redhat 6.5, cassandra 2.1.10
>Reporter: Jeff Griffith
>Priority: Critical
> Attachments: CommitLogProblem.jpg
>
>
> After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems 
> where some nodes break the 12G commit log max we configured and go as high as 
> 65G or more. Once it reaches this state, "nodetool compactionstats" hangs. I 
> watched the recovery live when compactions begin happening again. the 
> "nodetool compactionstats" suddenly completed to show the outstanding jobs 
> most in 100% completion state:
> {code}
> jgriffith@prod1xc1.c2.bf1:~$ ndc
> pending tasks: 2185
>compaction type   keyspace  table completed
>   totalunit   progress
> Compaction   SyncCore  *cf1*   61251208033   
> 170643574558   bytes 35.89%
> Compaction   SyncCore  *cf2*   19262483904
> 19266079916   bytes 99.98%
> Compaction   SyncCore  *cf3*6592197093
>  6592316682   bytes100.00%
> Compaction   SyncCore  *cf4*3411039555
>  3411039557   bytes100.00%
> Compaction   SyncCore  *cf5*2879241009
>  2879487621   bytes 99.99%
> Compaction   SyncCore  *cf6*   21252493623
> 21252635196   bytes100.00%
> Compaction   SyncCore  *cf7*   81009853587
> 81009854438   bytes100.00%
> Compaction   SyncCore  *cf8*3005734580
>  3005768582   bytes100.00%
> Active compaction remaining time :n/a
> {code}
> I was also doing periodic "nodetool tpstats" which were working but not being 
> logged in system.log on the StatusLogger thread until after the compaction 
> started working again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-10515) Compaction hangs with move to 2.1.10

2015-10-13 Thread Jeff Griffith (JIRA)
Jeff Griffith created CASSANDRA-10515:
-

 Summary: Compaction hangs with move to 2.1.10
 Key: CASSANDRA-10515
 URL: https://issues.apache.org/jira/browse/CASSANDRA-10515
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: redhat 6.5, cassandra 2.1.10
Reporter: Jeff Griffith
Priority: Critical


After upgrading from cassandra 2.0.x to 2.1.10, we began seeing problems where 
some nodes break the 12G commit log max we configured and go as high as 65G or 
more. Once it reaches this state, "nodetool compactionstats" hangs. I watched 
the recovery live when compactions begin happening again. the "nodetool 
compactionstats" suddenly completed to show the outstanding jobs most in 100% 
completion state:

{code}
jgriffith@prod1xc1.c2.bf1:~$ ndc
pending tasks: 2185
   compaction type   keyspace  table completed  
totalunit   progress
Compaction   SyncCore  ContactInformationUpdates   61251208033   
170643574558   bytes 35.89%
Compaction   SyncCore CommEvents   19262483904
19266079916   bytes 99.98%
Compaction   SyncCore   EndpointPrefixIndexMinimized6592197093 
6592316682   bytes100.00%
Compaction   SyncCore   EmailHistogramDeltas3411039555 
3411039557   bytes100.00%
Compaction   SyncCoreContactPrefixBytesIndex2879241009 
2879487621   bytes 99.99%
Compaction   SyncCore   EndpointProfiles   21252493623
21252635196   bytes100.00%
Compaction   SyncCore CommEvents   81009853587
81009854438   bytes100.00%
Compaction   SyncCore EndpointIndexIntId3005734580 
3005768582   bytes100.00%
Active compaction remaining time :n/a
{code}

I was also doing periodic "nodetool tpstats" which were working but not being 
logged in system.log on the StatusLogger thread until after the compaction 
started working again.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   >