[jira] [Commented] (CASSANDRA-6125) Race condition in Gossip propagation

2015-07-21 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14636139#comment-14636139
 ] 

Peter Haggerty commented on CASSANDRA-6125:
---

We've seen this bug or something like it on 2.0.11 with 45 nodes in a fairly 
noisy AWS environment but other than CASSANDRA-8336 I don't see any fixes to 
gossip post 2.0.11.

The nodetool status command doesn't list the node that does't have status info. 
It's not up or down, it's simply not there and this impacts % ownership.
In a recent instance of this 4 nodes had the same "status hole" but only 2 of 
the 4 had different nodetool ring output compared to the other 41 "no status 
hole" members of the ring.

Restarting cassandra on the node that has a missing STATUS entry in gossip 
"fixes" the problem in that the hole goes away. This is something we used to 
see more commonly before 2.0.11 so it does appear this fix works but are there 
other places where a race might be happening?

{code}
/10.xx.yyy.169
  generation:1436544814
  heartbeat:2986679
  SEVERITY:0.0
  HOST_ID:7d22299f-b35b-4035-82bc-e2b603a655d7
  LOAD:2.57836E11
  RACK:1e
  NET_VERSION:7
  DC:us-east
  RPC_ADDRESS:10.xx.yyy.169
  RELEASE_VERSION:2.0.11
  SCHEMA:0f72be52-2751-33a6-a172-8511e943b2ec
/10.xx.yyy.175
  generation:1419877470
  heartbeat:53496976
  SEVERITY:1.2787723541259766
  HOST_ID:c87ed8db-76b6-485a-ac2f-32c2822b1ef5
  LOAD:3.08812188602E11
  RACK:1e
  NET_VERSION:7
  STATUS:NORMAL,-1010822684895662807
  DC:us-east
  RPC_ADDRESS:10.xx.yyy.175
  RELEASE_VERSION:2.0.11
  SCHEMA:0f72be52-2751-33a6-a172-8511e943b2ec
{code}


> Race condition in Gossip propagation
> 
>
> Key: CASSANDRA-6125
> URL: https://issues.apache.org/jira/browse/CASSANDRA-6125
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Sergio Bossa
>Assignee: Brandon Williams
> Fix For: 2.0.11, 2.1.1
>
> Attachments: 6125.txt
>
>
> Gossip propagation has a race when concurrent VersionedValues are created and 
> submitted/propagated, causing some updates to be lost, even if happening on 
> different ApplicationStatuses.
> That's what happens basically:
> 1) A new VersionedValue V1 is created with version X.
> 2) A new VersionedValue V2 is created with version Y = X + 1.
> 3) V2 is added to the endpoint state map and propagated.
> 4) Nodes register Y as max version seen.
> 5) At this point, V1 is added to the endpoint state map and propagated too.
> 6) V1 version is X < Y, so nodes do not ask for his value after digests.
> A possible solution would be to propagate/track per-ApplicationStatus 
> versions, possibly encoding them to avoid network overhead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8014) NPE in Message.java line 324

2015-04-20 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502921#comment-14502921
 ] 

Peter Haggerty commented on CASSANDRA-8014:
---

We just saw this again on 2.0.11 in very similar circumstances (gently shutting 
down cassandra with disable commands before terminating it):

{code}
ERROR [RPC-Thread:50] 2015-04-20 14:14:23,165 CassandraDaemon.java (line 199) 
Exception in thread Thread[RPC-Thread:50,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
com.lmax.disruptor.FatalExceptionHandler.handleEventException(FatalExceptionHandler.java:45)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:126)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.thinkaurelius.thrift.Message.getInputTransport(Message.java:338)
at com.thinkaurelius.thrift.Message.invoke(Message.java:308)
at com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:695)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:689)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
... 3 more
{code}

> NPE in Message.java line 324
> 
>
> Key: CASSANDRA-8014
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Cassandra 2.0.9
>Reporter: Peter Haggerty
>Assignee: Pavel Yaskevich
> Attachments: NPE_Message.java_line-324.txt
>
>
> We received this when a server was rebooting and attempted to shut Cassandra 
> down while it was still quite busy. While it's normal for us to have a 
> handful of the RejectedExecution exceptions on a sudden shutdown like this 
> these NPEs in Message.java are new.
> The attached file include the logs from "StorageServiceShutdownHook" to the 
> "Logging initialized" after the server restarts and Cassandra comes back up.
> {code}ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 
> 324) Unexpected throwable while invoking!
> java.lang.NullPointerException
> at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
> at 
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
> at 
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
> at 
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8014) NPE in Message.java line 324

2015-04-20 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-8014:
--
Environment: Cassandra 2.0.9, Cassandra 2.0.11  (was: Cassandra 2.0.9)

> NPE in Message.java line 324
> 
>
> Key: CASSANDRA-8014
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Cassandra 2.0.9, Cassandra 2.0.11
>Reporter: Peter Haggerty
>Assignee: Pavel Yaskevich
> Attachments: NPE_Message.java_line-324.txt
>
>
> We received this when a server was rebooting and attempted to shut Cassandra 
> down while it was still quite busy. While it's normal for us to have a 
> handful of the RejectedExecution exceptions on a sudden shutdown like this 
> these NPEs in Message.java are new.
> The attached file include the logs from "StorageServiceShutdownHook" to the 
> "Logging initialized" after the server restarts and Cassandra comes back up.
> {code}ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 
> 324) Unexpected throwable while invoking!
> java.lang.NullPointerException
> at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
> at 
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
> at 
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
> at 
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8014) NPE in Message.java line 324

2014-12-12 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14245041#comment-14245041
 ] 

Peter Haggerty commented on CASSANDRA-8014:
---

We've seen this in a 2.0.9 instance when running "nodetool disablethrift". It 
throws a half dozen of the "Unexpected throwable", then proceeds to:

{code}
ERROR [pool-6-thread-2] 2014-12-12 23:43:13,643 CassandraDaemon.java (line 199) 
Exception in thread Thread[pool-6-thread-2,5,main]
java.lang.RuntimeException: java.lang.NullPointerException
at 
com.lmax.disruptor.FatalExceptionHandler.handleEventException(FatalExceptionHandler.java:45)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:126)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at com.thinkaurelius.thrift.Message.getInputTransport(Message.java:338)
at com.thinkaurelius.thrift.Message.invoke(Message.java:308)
at com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
... 3 more
{code}

The "nodetool disablethrift" appears to hang until killed.


> NPE in Message.java line 324
> 
>
> Key: CASSANDRA-8014
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: Cassandra 2.0.9
>Reporter: Peter Haggerty
>Assignee: Pavel Yaskevich
> Attachments: NPE_Message.java_line-324.txt
>
>
> We received this when a server was rebooting and attempted to shut Cassandra 
> down while it was still quite busy. While it's normal for us to have a 
> handful of the RejectedExecution exceptions on a sudden shutdown like this 
> these NPEs in Message.java are new.
> The attached file include the logs from "StorageServiceShutdownHook" to the 
> "Logging initialized" after the server restarts and Cassandra comes back up.
> {code}ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 
> 324) Unexpected throwable while invoking!
> java.lang.NullPointerException
> at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
> at 
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
> at 
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
> at 
> org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
> at 
> com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
> at 
> com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
> at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8116) HSHA fails with default rpc_max_threads setting

2014-10-29 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188625#comment-14188625
 ] 

Peter Haggerty commented on CASSANDRA-8116:
---

The latest 2.0.x release of Cassandra using hsha with default settings either 
stalls after a few minutes of operation or crashes.

This does not seem like it should have a priority of "Minor". This is a major 
problem. The longer that 2.0.11 is the "latest" version the bigger the problem 
becomes for new users and existing users that have automation and high levels 
of trust in minor version upgrades.



> HSHA fails with default rpc_max_threads setting
> ---
>
> Key: CASSANDRA-8116
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8116
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Mike Adamson
>Assignee: Tyler Hobbs
>Priority: Minor
> Fix For: 2.0.12, 2.1.2
>
> Attachments: 8116-throw-exc-2.0.txt, 8116.txt
>
>
> The HSHA server fails with 'Out of heap space' error if the rpc_max_threads 
> is left at its default setting (unlimited) in cassandra.yaml.
> I'm not proposing any code change for this but have submitted a patch for a 
> comment change in cassandra.yaml to indicate that rpc_max_threads needs to be 
> changed if you use HSHA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-8102) cassandra-cli and cqlsh report two different values for a setting, partially update it and partially report it

2014-10-10 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-8102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-8102:
--
Description: 
cassandra-cli updates and prints out a min_compaction_threshold that is not 
shown by cqlsh (it shows a different min_threshold attribute)

cqlsh updates "both" values but only shows one of them

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

$ echo "describe foo;" | cassandra-cli -h `hostname` -k bar
  Compaction min/max thresholds: 8/32

$ echo "describe table foo;" | cqlsh -k bar `hostname`
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
{code}

{code}
cqlsh:
ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
'min_threshold' : 16};

cassandra-cli:
  Compaction min/max thresholds: 16/32
  Compaction Strategy Options:
min_threshold: 16
cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

cassandra-cli:
  Compaction min/max thresholds: 8/32
  Compaction Strategy Options:
min_threshold: 16

cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}


  was:
cassandra-cli updates and prints out a min_compaction_threshold that is not 
shown by cqlsh (it shows a different min_threshold attribute)

cqlsh updates "both" values but only shows one of them

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

$ echo "describe foo;" | cassandra-cli -h `hostname` -k bar
  Compaction min/max thresholds: 8/32

$ echo "describe table foo;" | cqlsh -k bar `hostname`
  compaction={'class': 'SizeTieredCompactionStrategy'} AND



cqlsh:
ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
'min_threshold' : 16};

cassandra-cli:
  Compaction min/max thresholds: 16/32
  Compaction Strategy Options:
min_threshold: 16
cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND



cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

cassandra-cli:
  Compaction min/max thresholds: 8/32
  Compaction Strategy Options:
min_threshold: 16

cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}



> cassandra-cli and cqlsh report two different values for a setting, partially 
> update it and partially report it
> --
>
> Key: CASSANDRA-8102
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8102
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.0.9
>Reporter: Peter Haggerty
>Priority: Minor
>
> cassandra-cli updates and prints out a min_compaction_threshold that is not 
> shown by cqlsh (it shows a different min_threshold attribute)
> cqlsh updates "both" values but only shows one of them
> {code}
> cassandra-cli:
> UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;
> $ echo "describe foo;" | cassandra-cli -h `hostname` -k bar
>   Compaction min/max thresholds: 8/32
> $ echo "describe table foo;" | cqlsh -k bar `hostname`
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
> {code}
> {code}
> cqlsh:
> ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
> 'min_threshold' : 16};
> cassandra-cli:
>   Compaction min/max thresholds: 16/32
>   Compaction Strategy Options:
> min_threshold: 16
> cqlsh:
>   compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
> AND
> {code}
> {code}
> cassandra-cli:
> UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;
> cassandra-cli:
>   Compaction min/max thresholds: 8/32
>   Compaction Strategy Options:
> min_threshold: 16
> cqlsh:
>   compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
> AND
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8102) cassandra-cli and cqlsh report two different values for a setting, partially update it and partially report it

2014-10-10 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-8102:
-

 Summary: cassandra-cli and cqlsh report two different values for a 
setting, partially update it and partially report it
 Key: CASSANDRA-8102
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8102
 Project: Cassandra
  Issue Type: Bug
 Environment: 2.0.9
Reporter: Peter Haggerty
Priority: Minor


cassandra-cli updates and prints out a min_compaction_threshold that is not 
shown by cqlsh (it shows a different min_threshold attribute)

cqlsh updates "both" values but only shows one of them

{code}
cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

$ echo "describe foo;" | cassandra-cli -h `hostname` -k bar
  Compaction min/max thresholds: 8/32

$ echo "describe table foo;" | cqlsh -k bar `hostname`
  compaction={'class': 'SizeTieredCompactionStrategy'} AND



cqlsh:
ALTER TABLE foo WITH compaction = {'class' : 'SizeTieredCompactionStrategy', 
'min_threshold' : 16};

cassandra-cli:
  Compaction min/max thresholds: 16/32
  Compaction Strategy Options:
min_threshold: 16
cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND



cassandra-cli:
UPDATE COLUMN FAMILY foo WITH min_compaction_threshold = 8;

cassandra-cli:
  Compaction min/max thresholds: 8/32
  Compaction Strategy Options:
min_threshold: 16

cqlsh:
  compaction={'min_threshold': '16', 'class': 'SizeTieredCompactionStrategy'} 
AND
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8014) NPE in Message.java line 324

2014-09-29 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-8014:
-

 Summary: NPE in Message.java line 324
 Key: CASSANDRA-8014
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8014
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Cassandra 2.0.9
Reporter: Peter Haggerty
 Attachments: NPE_Message.java_line-324.txt

We received this when a server was rebooting and attempted to shut Cassandra 
down while it was still quite busy. While it's normal for us to have a handful 
of the RejectedExecution exceptions on a sudden shutdown like this these NPEs 
in Message.java are new.

The attached file include the logs from "StorageServiceShutdownHook" to the 
"Logging initialized" after the server restarts and Cassandra comes back up.


ERROR [pool-10-thread-2] 2014-09-29 08:33:44,055 Message.java (line 324) 
Unexpected throwable while invoking!
java.lang.NullPointerException
at com.thinkaurelius.thrift.util.mem.Buffer.size(Buffer.java:83)
at 
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.expand(FastMemoryOutputTransport.java:84)
at 
com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.write(FastMemoryOutputTransport.java:167)
at 
org.apache.thrift.transport.TFramedTransport.flush(TFramedTransport.java:156)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:55)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at com.thinkaurelius.thrift.Message.invoke(Message.java:314)
at com.thinkaurelius.thrift.Message$Invocation.execute(Message.java:90)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:638)
at 
com.thinkaurelius.thrift.TDisruptorServer$InvocationHandler.onEvent(TDisruptorServer.java:632)
at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:112)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7543) Assertion error when compacting large row with map//list field or range tombstone

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127649#comment-14127649
 ] 

Peter Haggerty commented on CASSANDRA-7543:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but "Fix Versions" 
shows only 1.2.19.


> Assertion error when compacting large row with map//list field or range 
> tombstone
> -
>
> Key: CASSANDRA-7543
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7543
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: linux
>Reporter: Matt Byrd
>Assignee: Yuki Morishita
>  Labels: compaction, map
> Fix For: 1.2.19
>
> Attachments: 0001-add-rangetombstone-test.patch, 
> 0002-fix-rangetomebstone-not-included-in-LCR-size-calc.patch
>
>
> Hi,
> So in a couple of clusters we're hitting this problem when compacting large 
> rows with a schema which contains the map data-type.
> Here is an example of the error:
> {code}
> java.lang.AssertionError: incorrect row data size 87776427 written to 
> /cassandra/X/Y/X-Y-tmp-ic-2381-Data.db; correct is 87845952
> org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:162)
> org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:163)
> org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
> org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:58)
>  
> org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:60)
>  
> org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionTask.run(CompactionManager.java:208)
> {code}
> I have a python script which reproduces the problem, by just writing lots of 
> data to a single partition key with a schema that contains the map data-type.
> I added some debug logging and found that the difference in bytes seen in the 
> reproduction (255) was due to the following pieces of data being written:
> {code}
> DEBUG [CompactionExecutor:3] 2014-07-13 00:38:42,891 ColumnIndex.java (line 
> 168) DATASIZE writeOpenedMarker columnIndex: 
> org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
> [java.nio.HeapByteBuffer[pos=0 lim=34 cap=34], java.nio.HeapByteBuffer[pos=0 
> lim=34 cap=34]](deletedAt=1405237116014999, localDeletion=1405237116) 
> startPosition: 262476 endPosition: 262561 diff: 85 
> DEBUG [CompactionExecutor:3] 2014-07-13 00:38:43,007 ColumnIndex.java (line 
> 168) DATASIZE writeOpenedMarker columnIndex: 
> org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
> org.apache.cassandra.db.Column@3e5b5939 startPosition: 328157 endPosition: 
> 328242 diff: 85 
> DEBUG [CompactionExecutor:3] 2014-07-13 00:38:44,159 ColumnIndex.java (line 
> 168) DATASIZE writeOpenedMarker columnIndex: 
> org.apache.cassandra.db.ColumnIndex$Builder@6678a9d0 firstColumn: 
> org.apache.cassandra.db.Column@fc3299b startPosition: 984105 endPosition: 
> 984190 diff: 85
> {code}
> So looking at the code you can see that there are extra range tombstones 
> written on the column index border (in ColumnIndex where 
> tombstoneTracker.writeOpenedMarker is called) which aren't accounted for in 
> LazilyCompactedRow.columnSerializedSize.
> This is where the difference comes from in the assertion error, so the 
> solution is just to account for this data.
> I have a patch which does just this, by keeping track of the extra data 
> written out via tombstoneTracker.writeOpenedMarker in ColumnIndex and adding 
> it back to the dataSize in LazilyCompactedRow.java, where it serialises out 
> the row size.
> After applying the patch the reproduction stops producing the AssertionError.
> I know this is not a problem in 2.0 + because of singe pass compaction, 
> however there are lots of 1.2 clusters out there still which might run into 
> this.
> Please let me know if you've any questions.
> Thanks,
> Matt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7828) New node cannot be joined if a value in composite type column is dropped (description updated)

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127645#comment-14127645
 ] 

Peter Haggerty commented on CASSANDRA-7828:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but "Fix Versions" 
shows 2.0.11.


> New node cannot be joined if a value in composite type column is dropped 
> (description updated)
> --
>
> Key: CASSANDRA-7828
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7828
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Igor Zubchenok
>Assignee: Mikhail Stepura
> Fix For: 1.2.19, 2.0.11, 2.1.1
>
> Attachments: 1409959462180-myColumnFamily.zip, 
> CASSANDRA-2.0-7828.patch
>
>
> I get a *RuntimeException* at new node system.log on bootstrapping a new DC:
> {code:title=system.out - RuntimeException caused by IllegalArgumentException 
> in Buffer.limit|borderStyle=solid}
> INFO [NonPeriodicTasks:1] 2014-08-26 15:43:01,030 SecondaryIndexManager.java 
> (line 137) Submitting index build of [myColumnFamily.myColumnFamily_myColumn] 
> for data in 
> SSTableReader(path='/var/lib/cassandra/data/testbug/myColumnFamily/testbug-myColumnFamily-jb-1-Data.db')
> ERROR [CompactionExecutor:2] 2014-08-26 15:43:01,035 CassandraDaemon.java 
> (line 199) Exception in thread Thread[CompactionExecutor:2,1,main]
> java.lang.IllegalArgumentException
>   at java.nio.Buffer.limit(Buffer.java:267)
>   at 
> org.apache.cassandra.utils.ByteBufferUtil.readBytes(ByteBufferUtil.java:587)
>   at 
> org.apache.cassandra.utils.ByteBufferUtil.readBytesWithShortLength(ByteBufferUtil.java:596)
>   at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:61)
>   at 
> org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:36)
>   at org.apache.cassandra.dht.LocalToken.compareTo(LocalToken.java:44)
>   at org.apache.cassandra.db.DecoratedKey.compareTo(DecoratedKey.java:85)
>   at org.apache.cassandra.db.DecoratedKey.compareTo(DecoratedKey.java:36)
>   at 
> java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:727)
>   at 
> java.util.concurrent.ConcurrentSkipListMap.findNode(ConcurrentSkipListMap.java:789)
>   at 
> java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:828)
>   at 
> java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1626)
>   at org.apache.cassandra.db.Memtable.resolve(Memtable.java:215)
>   at org.apache.cassandra.db.Memtable.put(Memtable.java:173)
>   at 
> org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:900)
>   at 
> org.apache.cassandra.db.index.AbstractSimplePerColumnSecondaryIndex.insert(AbstractSimplePerColumnSecondaryIndex.java:107)
>   at 
> org.apache.cassandra.db.index.SecondaryIndexManager.indexRow(SecondaryIndexManager.java:441)
>   at org.apache.cassandra.db.Keyspace.indexRow(Keyspace.java:413)
>   at 
> org.apache.cassandra.db.index.SecondaryIndexBuilder.build(SecondaryIndexBuilder.java:62)
>   at 
> org.apache.cassandra.db.compaction.CompactionManager$9.run(CompactionManager.java:834)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> ERROR [NonPeriodicTasks:1] 2014-08-26 15:43:01,035 CassandraDaemon.java (line 
> 199) Exception in thread Thread[NonPeriodicTasks:1,5,main]
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.lang.IllegalArgumentException
>   at 
> org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:413)
>   at 
> org.apache.cassandra.db.index.SecondaryIndexManager.maybeBuildSecondaryIndexes(SecondaryIndexManager.java:142)
>   at 
> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:113)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolE

[jira] [Commented] (CASSANDRA-7810) tombstones gc'd before being locally applied

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127644#comment-14127644
 ] 

Peter Haggerty commented on CASSANDRA-7810:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but "Fix Versions" 
shows 2.0.11.


> tombstones gc'd before being locally applied
> 
>
> Key: CASSANDRA-7810
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7810
> Project: Cassandra
>  Issue Type: Bug
> Environment: 2.1.0.rc6
>Reporter: Jonathan Halliday
>Assignee: Marcus Eriksson
> Fix For: 1.2.19, 2.0.11, 2.1.0
>
> Attachments: 0001-7810-test-for-2.0.x.patch, 
> 0001-track-gcable-tombstones-v2.patch, 0001-track-gcable-tombstones.patch, 
> 0002-track-gcable-tombstones-for-2.0.patch, range_tombstone_test.py
>
>
> # single node environment
> CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1 };
> use test;
> create table foo (a int, b int, primary key(a,b));
> alter table foo with gc_grace_seconds = 0;
> insert into foo (a,b) values (1,2);
> select * from foo;
> -- one row returned. so far, so good.
> delete from foo where a=1 and b=2;
> select * from foo;
> -- 0 rows. still rainbows and kittens.
> bin/nodetool flush;
> bin/nodetool compact;
> select * from foo;
>  a | b
> ---+---
>  1 | 2
> (1 rows)
> gahhh.
> looks like the tombstones were considered obsolete and thrown away before 
> being applied to the compaction?  gc_grace just means the interval after 
> which they won't be available to remote nodes repair - they should still 
> apply locally regardless (and do correctly in 2.0.9)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7145) FileNotFoundException during compaction

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127646#comment-14127646
 ] 

Peter Haggerty commented on CASSANDRA-7145:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but "Fix Versions" 
shows 2.0.11.


> FileNotFoundException during compaction
> ---
>
> Key: CASSANDRA-7145
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7145
> Project: Cassandra
>  Issue Type: Bug
> Environment: CentOS 6.3, Datastax Enterprise 4.0.1 (Cassandra 2.0.5), 
> Java 1.7.0_55
>Reporter: PJ
>Assignee: Marcus Eriksson
> Fix For: 1.2.19, 2.0.11, 2.1.0
>
> Attachments: 
> 0001-avoid-marking-compacted-sstables-as-compacting.patch, compaction - 
> FileNotFoundException.txt, repair - RuntimeException.txt, startup - 
> AssertionError.txt
>
>
> I can't finish any compaction because my nodes always throw a 
> "FileNotFoundException". I've already tried the following but nothing helped:
> 1. nodetool flush
> 2. nodetool repair (ends with RuntimeException; see attachment)
> 3. node restart (via dse cassandra-stop)
> Whenever I restart the nodes, another type of exception is logged (see 
> attachment) somewhere near the end of startup process. This particular 
> exception doesn't seem to be critical because the nodes still manage to 
> finish the startup and become online.
> I don't have specific steps to reproduce the problem that I'm experiencing 
> with compaction and repair. I'm in the middle of migrating 4.8 billion rows 
> from MySQL via SSTableLoader. 
> Some things that may or may not be relevant:
> 1. I didn't drop and recreate the keyspace (so probably not related to 
> CASSANDRA-4857)
> 2. I do the bulk-loading in batches of 1 to 20 millions rows. When a batch 
> reaches 100% total progress (i.e. starts to build secondary index), I kill 
> the sstableloader process and cancel the index build
> 3. I restart the nodes occasionally. It's possible that there is an on-going 
> compaction during one of those restarts.
> Related StackOverflow question (mine): 
> http://stackoverflow.com/questions/23435847/filenotfoundexception-during-compaction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7808) LazilyCompactedRow incorrectly handles row tombstones

2014-09-09 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14127642#comment-14127642
 ] 

Peter Haggerty commented on CASSANDRA-7808:
---

This is listed in the 2.0 CHANGES.txt as present in 2.0.10 but "Fix Versions" 
shows 2.0.11.


> LazilyCompactedRow incorrectly handles row tombstones
> -
>
> Key: CASSANDRA-7808
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7808
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Richard Low
>Assignee: Richard Low
> Fix For: 1.2.19, 2.0.11, 2.1.0
>
> Attachments: 7808-v1.diff
>
>
> LazilyCompactedRow doesn’t handle row tombstones correctly, leading to an 
> AssertionError (CASSANDRA-4206) in some cases, and the row tombstone being 
> incorrectly dropped in others. It looks like this was introduced by 
> CASSANDRA-5677.
> To reproduce an AssertionError:
> 1. Hack a really small return value for 
> DatabaseDescriptor.getInMemoryCompactionLimit() like 10 bytes to force large 
> row compaction
> 2. Create a column family with gc_grace = 10
> 3. Insert a few columns in one row
> 4. Call nodetool flush
> 5. Delete the row
> 6. Call nodetool flush
> 7. Wait 10 seconds
> 8. Call nodetool compact and it will fail
> To reproduce the row tombstone being dropped, do the same except, after the 
> delete (in step 5), insert a column that sorts before the ones you inserted 
> in step 3. E.g. if you inserted b, c, d in step 3, insert a now. After the 
> compaction, which now succeeds, the full row will be visible, rather than 
> just a.
> The problem is two fold. Firstly, LazilyCompactedRow.Reducer.reduce() and 
> getReduce() incorrectly call container.clear(). This clears the columns (as 
> intended) but also removes the deletion times from container. This means no 
> further columns are deleted if they are annihilated by the row tombstone.
> Secondly, after the second pass, LazilyCompactedRow.isEmpty() is called which 
> calls
> {{ColumnFamilyStore.removeDeletedCF(emptyColumnFamily, 
> controller.gcBefore(key.getToken()))}}
> which unfortunately removes the last deleted time from emptyColumnFamily if 
> it is earlier than gcBefore. Since this is only called after the second pass, 
> the second pass doesn’t remove any columns that are removed by the row 
> tombstone whereas the first pass removes just the first one.
> This is pretty serious - no large rows can ever be compacted and row 
> tombstones can go missing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7605) compactionstats reports incorrect byte values

2014-07-23 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-7605:
--

Attachment: CASSANDRA-7605.txt

> compactionstats reports incorrect byte values
> -
>
> Key: CASSANDRA-7605
> URL: https://issues.apache.org/jira/browse/CASSANDRA-7605
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
> Environment: 2.0.9, Java 1.7.0_55
>Reporter: Peter Haggerty
> Attachments: CASSANDRA-7605.txt
>
>
> The output of nodetool compactionstats (while a compaction is running) is 
> incorrect.
> The output from nodetool compactionhistory and the log both match and they 
> disagree with the output from compactionstats.
> What nodetool said during the compaction was almost certainly wrong given the 
> sizes of files on disk:
>completed   total  unit  progress
> 144713163589146631071165 bytes98.69%
> nodetool compactionhistory and the log both report the same values for that 
> compaction:
> 52,596,321,269 bytes to 38,575,881,134
> The compactionhistory/log values make much more sense given the size of the 
> files on disk.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CASSANDRA-7605) compactionstats reports incorrect byte values

2014-07-23 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-7605:
-

 Summary: compactionstats reports incorrect byte values
 Key: CASSANDRA-7605
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7605
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 2.0.9, Java 1.7.0_55
Reporter: Peter Haggerty


The output of nodetool compactionstats (while a compaction is running) is 
incorrect.

The output from nodetool compactionhistory and the log both match and they 
disagree with the output from compactionstats.

What nodetool said during the compaction was almost certainly wrong given the 
sizes of files on disk:
   completed   total  unit  progress
144713163589146631071165 bytes98.69%

nodetool compactionhistory and the log both report the same values for that 
compaction:
52,596,321,269 bytes to 38,575,881,134

The compactionhistory/log values make much more sense given the size of the 
files on disk.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (CASSANDRA-7246) Gossip Null Pointer Exception when a cassandra instance in ring is restarted

2014-05-16 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-7246:
-

 Summary: Gossip Null Pointer Exception when a cassandra instance 
in ring is restarted
 Key: CASSANDRA-7246
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7246
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 12 node ring of 1.2.x.
11 of 12 are 1.2.15.
1 is 1.2.16.
Reporter: Peter Haggerty
Priority: Minor


12 Cassandra instances, one per node.
11 of the Cassandra instances are 1.2.15.
1 of the Cassandra instances is 1.2.16.

One of the eleven 1.2.15 Cassandra instances is restarted (disable thrift, 
gossip, then flush, drain, stop, start).

The 1.2.16 Cassandra instance noted this by throwing a Null Pointer Exception. 
None of the 1.2.15 instances threw an exception and this is new behavior that 
hasn't been observed before.


ERROR 02:18:06,009 Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException
at org.apache.cassandra.gms.Gossiper.convict(Gossiper.java:264)
at 
org.apache.cassandra.gms.FailureDetector.forceConviction(FailureDetector.java:246)
at 
org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:37)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 INFO 02:18:23,402 Node /10.x.y.x is now part of the cluster
 INFO 02:18:23,403 InetAddress /10.x.y.z is now UP
 INFO 02:18:53,494 FatClient /10.x.y.z has been silent for 3ms, removing 
from gossip
 INFO 02:19:00,031 Handshaking version with /10.x.y.z





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission

2013-12-14 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848357#comment-13848357
 ] 

Peter Haggerty commented on CASSANDRA-5780:
---

We just ran into this again when a node rebooted and came back up thinking 
everything was fine, but every other node in the ring disagreed. This was 
resolved by our normal "manual restart" procedure where we stop thrift, gossip, 
flush the node, drain the node then restart cassandra but it definitely caused 
some confusion for "nodetool status" and "nodetool info" to report that the 
node was up and a working part of the cluster when in fact it wasn't.

The nodes in this state definitely do *not* make it clear that they are not 
part of the cluster anymore.

> nodetool status and ring report incorrect/stale information after decommission
> --
>
> Key: CASSANDRA-5780
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5780
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Peter Haggerty
>Priority: Trivial
>  Labels: lhf, ponies
>
> Cassandra 1.2.6 ring of 12 instances, each with 256 tokens.
> Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring.
> The 9 instances of cassandra that are in the ring all correctly report 
> nodetool status information for the ring and have the same data.
> After the first node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> "nodetool status" on "decommissioned-3rd" reports 9 nodes
> The storage load information is similarly stale on the various decommissioned 
> nodes. The nodetool status and ring commands continue to return information 
> as if they were part of a cluster and they appear to return the last 
> information that they saw.
> In contrast the nodetool info command fails with an exception, which isn't 
> ideal but at least indicates that there was a failure rather than returning 
> stale information.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Updated] (CASSANDRA-5783) nodetool and cassandra-cli report different information for "Compaction min/max thresholds"

2013-07-20 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5783:
--

Priority: Minor  (was: Major)

> nodetool and cassandra-cli report different information for "Compaction 
> min/max thresholds"
> ---
>
> Key: CASSANDRA-5783
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5783
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Affects Versions: 1.2.6
>Reporter: Peter Haggerty
>Priority: Minor
>
> Ask cassandra-cli and nodetool the same question and get different answers 
> back. This was executed after using nodetool to adjust the 
> compactionthreshold on this CF to have a minimum of 2. The change was 
> observed to work as we saw increased compactions which is exactly what one 
> would expect
> $ echo "describe ${CF};" \
>   | cassandra-cli -h localhost -k ${KEYSPACE} \
>   | grep thresholds
>   Compaction min/max thresholds: 4/32
> $ nodetool -h localhost getcompactionthreshold ${KEYSPACE} ${CF}
> Current compaction thresholds for Metrics/dimensions_active_1:
>  min = 2,  max = 32

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (CASSANDRA-5783) nodetool and cassandra-cli report different information for "Compaction min/max thresholds"

2013-07-20 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-5783:
-

 Summary: nodetool and cassandra-cli report different information 
for "Compaction min/max thresholds"
 Key: CASSANDRA-5783
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5783
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.2.6
Reporter: Peter Haggerty


Ask cassandra-cli and nodetool the same question and get different answers 
back. This was executed after using nodetool to adjust the compactionthreshold 
on this CF to have a minimum of 2. The change was observed to work as we saw 
increased compactions which is exactly what one would expect


$ echo "describe ${CF};" \
  | cassandra-cli -h localhost -k ${KEYSPACE} \
  | grep thresholds

  Compaction min/max thresholds: 4/32


$ nodetool -h localhost getcompactionthreshold ${KEYSPACE} ${CF}
Current compaction thresholds for Metrics/dimensions_active_1:
 min = 2,  max = 32


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-4573) HSHA doesn't handle large messages gracefully

2013-07-19 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714355#comment-13714355
 ] 

Peter Haggerty commented on CASSANDRA-4573:
---

We may be seeing this behavior in 1.2.6. I haven't enabled debug but we are 
definitely seeing a correlation between groups of 'Read an invalid frame size 
of 0' messages (dozens at a time) during the same second that we're seeing 
"large" (10 seconds or more) 'GC for ConcurrentMarkSweep' events.

On a 9 node cluster we see this anywhere from 1 to 9 times a day.



> HSHA doesn't handle large messages gracefully
> -
>
> Key: CASSANDRA-4573
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4573
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Reporter: Tyler Hobbs
>Assignee: Vijay
> Attachments: repro.py
>
>
> HSHA doesn't seem to enforce any kind of max message length, and when 
> messages are too large, it doesn't fail gracefully.
> With debug logs enabled, you'll see this:
> {{DEBUG 13:13:31,805 Unexpected state 16}}
> Which seems to mean that there's a SelectionKey that's valid, but isn't ready 
> for reading, writing, or accepting.
> Client-side, you'll get this thrift error (while trying to read a frame as 
> part of {{recv_batch_mutate}}):
> {{TTransportException: TSocket read 0 bytes}}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission

2013-07-19 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-5780:
-

 Summary: nodetool status and ring report incorrect/stale 
information after decommission
 Key: CASSANDRA-5780
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5780
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.2.6
Reporter: Peter Haggerty
Priority: Minor


Cassandra 1.2.6 ring of 12 instances, each with 256 tokens.

Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring.

The 9 instances of cassandra that are in the ring all correctly report nodetool 
status information for the ring and have the same data.


After the first node is decommissioned:
"nodetool status" on "decommissioned-1st" reports 11 nodes

After the second node is decommissioned:
"nodetool status" on "decommissioned-1st" reports 11 nodes
"nodetool status" on "decommissioned-2nd" reports 10 nodes

After the second node is decommissioned:
"nodetool status" on "decommissioned-1st" reports 11 nodes
"nodetool status" on "decommissioned-2nd" reports 10 nodes
"nodetool status" on "decommissioned-3rd" reports 9 nodes


The storage load information is similarly stale on the various decommissioned 
nodes. The nodetool status and ring commands continue to return information as 
if they were part of a cluster and they appear to return the last information 
that they saw.

In contrast the nodetool info command fails with an exception, which isn't 
ideal but at least indicates that there was a failure rather than returning 
stale information.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2013-02-05 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5068:
--

Affects Version/s: 1.1.8
   1.1.9

> CLONE - Once a host has been hinted to, log messages for it repeat every 10 
> mins even if no hints are delivered
> ---
>
> Key: CASSANDRA-5068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.1.6, 1.1.8, 1.1.9, 1.2.0
> Environment: cassandra 1.1.6
> java 1.6.0_30
>Reporter: Peter Haggerty
>Assignee: Brandon Williams
>Priority: Minor
>  Labels: hinted, hintedhandoff
> Attachments: 5068.txt
>
>
> We have "0 row" hinted handoffs every 10 minutes like clockwork. This impacts 
> our ability to monitor the cluster by adding persistent noise in the handoff 
> metric.
> Previous mentions of this issue are here:
> http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
> The hinted handoffs can be scrubbed away with
> nodetool -h 127.0.0.1 scrub system HintsColumnFamily
> but they return after anywhere from a few minutes to multiple hours later.
> These started to appear after an upgrade to 1.1.6 and haven't gone away 
> despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
> A few things we've noticed about the handoffs:
> 1. The phantom handoff endpoint changes after a non-zero handoff comes through
> 2. Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before
> 3. The sstable2json output seems to include multiple sub-sections for each 
> handoff with the same "deletedAt" information.
> The phantom handoff endpoint changes after a non-zero handoff comes through:
>  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
> Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before:
>  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
> The first few entries from one of the json files:
> {
> "0aaa": {
> "ccf5dc203a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 
> "ccf603303a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2013-02-05 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13571772#comment-13571772
 ] 

Peter Haggerty commented on CASSANDRA-5068:
---

We see this in 1.1.9 as well.

> CLONE - Once a host has been hinted to, log messages for it repeat every 10 
> mins even if no hints are delivered
> ---
>
> Key: CASSANDRA-5068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.1.6, 1.2.0
> Environment: cassandra 1.1.6
> java 1.6.0_30
>Reporter: Peter Haggerty
>Assignee: Brandon Williams
>Priority: Minor
>  Labels: hinted, hintedhandoff
> Attachments: 5068.txt
>
>
> We have "0 row" hinted handoffs every 10 minutes like clockwork. This impacts 
> our ability to monitor the cluster by adding persistent noise in the handoff 
> metric.
> Previous mentions of this issue are here:
> http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
> The hinted handoffs can be scrubbed away with
> nodetool -h 127.0.0.1 scrub system HintsColumnFamily
> but they return after anywhere from a few minutes to multiple hours later.
> These started to appear after an upgrade to 1.1.6 and haven't gone away 
> despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
> A few things we've noticed about the handoffs:
> 1. The phantom handoff endpoint changes after a non-zero handoff comes through
> 2. Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before
> 3. The sstable2json output seems to include multiple sub-sections for each 
> handoff with the same "deletedAt" information.
> The phantom handoff endpoint changes after a non-zero handoff comes through:
>  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
> Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before:
>  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
> The first few entries from one of the json files:
> {
> "0aaa": {
> "ccf5dc203a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 
> "ccf603303a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2013-01-11 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5068:
--

Affects Version/s: 1.2.0

> CLONE - Once a host has been hinted to, log messages for it repeat every 10 
> mins even if no hints are delivered
> ---
>
> Key: CASSANDRA-5068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.1.6, 1.2.0
> Environment: cassandra 1.1.6
> java 1.6.0_30
>Reporter: Peter Haggerty
>Assignee: Brandon Williams
>Priority: Minor
>  Labels: hinted, hintedhandoff, phantom
>
> We have "0 row" hinted handoffs every 10 minutes like clockwork. This impacts 
> our ability to monitor the cluster by adding persistent noise in the handoff 
> metric.
> Previous mentions of this issue are here:
> http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
> The hinted handoffs can be scrubbed away with
> nodetool -h 127.0.0.1 scrub system HintsColumnFamily
> but they return after anywhere from a few minutes to multiple hours later.
> These started to appear after an upgrade to 1.1.6 and haven't gone away 
> despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
> A few things we've noticed about the handoffs:
> 1. The phantom handoff endpoint changes after a non-zero handoff comes through
> 2. Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before
> 3. The sstable2json output seems to include multiple sub-sections for each 
> handoff with the same "deletedAt" information.
> The phantom handoff endpoint changes after a non-zero handoff comes through:
>  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
> Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before:
>  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
> The first few entries from one of the json files:
> {
> "0aaa": {
> "ccf5dc203a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 
> "ccf603303a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2012-12-13 Thread Peter Haggerty (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13531678#comment-13531678
 ] 

Peter Haggerty commented on CASSANDRA-5068:
---

When there are zero-row hinted handoffs the output of "list HintsColumnFamily"
might show that 9 of 12 nodes in a ring have a row key like this:
9 of 12 nodes in a ring might have a row key like this:
RowKey: 7554

1 of the 12 nodes will have a different row key than all the rest:
RowKey: 1554

another 1-2 nodes might not have any RowKeys at all


> CLONE - Once a host has been hinted to, log messages for it repeat every 10 
> mins even if no hints are delivered
> ---
>
> Key: CASSANDRA-5068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
> Project: Cassandra
>  Issue Type: Bug
>  Components: Core
>Affects Versions: 1.1.6
> Environment: cassandra 1.1.6
> java 1.6.0_30
>Reporter: Peter Haggerty
>Assignee: Brandon Williams
>Priority: Minor
>  Labels: hinted, hintedhandoff, phantom
>
> We have "0 row" hinted handoffs every 10 minutes like clockwork. This impacts 
> our ability to monitor the cluster by adding persistent noise in the handoff 
> metric.
> Previous mentions of this issue are here:
> http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html
> The hinted handoffs can be scrubbed away with
> nodetool -h 127.0.0.1 scrub system HintsColumnFamily
> but they return after anywhere from a few minutes to multiple hours later.
> These started to appear after an upgrade to 1.1.6 and haven't gone away 
> despite rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.
> A few things we've noticed about the handoffs:
> 1. The phantom handoff endpoint changes after a non-zero handoff comes through
> 2. Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before
> 3. The sstable2json output seems to include multiple sub-sections for each 
> handoff with the same "deletedAt" information.
> The phantom handoff endpoint changes after a non-zero handoff comes through:
>  INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
>  INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
>  INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
> Sometimes a non-zero handoff will be immediately followed by an "off 
> schedule" phantom handoff to the endpoint the phantom had been using before:
>  INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
>  INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
>  INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java 
> (line 392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
> The first few entries from one of the json files:
> {
> "0aaa": {
> "ccf5dc203a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 
> "ccf603303a2211e2e154da71a9bb": {
> "deletedAt": -9223372036854775808, 
> "subColumns": []
> }, 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2012-12-13 Thread Peter Haggerty (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Haggerty updated CASSANDRA-5068:
--

Fix Version/s: (was: 0.8.10)
   (was: 1.0.7)
 Reviewer:   (was: jbellis)
   Labels: hinted hintedhandoff phantom  (was: )
  Description: 
We have "0 row" hinted handoffs every 10 minutes like clockwork. This impacts 
our ability to monitor the cluster by adding persistent noise in the handoff 
metric.

Previous mentions of this issue are here:
http://www.mail-archive.com/user@cassandra.apache.org/msg25982.html

The hinted handoffs can be scrubbed away with
nodetool -h 127.0.0.1 scrub system HintsColumnFamily
but they return after anywhere from a few minutes to multiple hours later.

These started to appear after an upgrade to 1.1.6 and haven't gone away despite 
rolling cleanups, rolling restarts, multiple rounds of scrubbing, etc.

A few things we've noticed about the handoffs:
1. The phantom handoff endpoint changes after a non-zero handoff comes through

2. Sometimes a non-zero handoff will be immediately followed by an "off 
schedule" phantom handoff to the endpoint the phantom had been using before

3. The sstable2json output seems to include multiple sub-sections for each 
handoff with the same "deletedAt" information.



The phantom handoff endpoint changes after a non-zero handoff comes through:
 INFO [HintedHandoff:1] 2012-12-11 06:57:35,093 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
 INFO [HintedHandoff:1] 2012-12-11 07:07:35,092 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.1
 INFO [HintedHandoff:1] 2012-12-11 07:07:37,915 HintedHandOffManager.java (line 
392) Finished hinted handoff of 1058 rows to endpoint /10.10.10.2
 INFO [HintedHandoff:1] 2012-12-11 07:17:35,093 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2
 INFO [HintedHandoff:1] 2012-12-11 07:27:35,093 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.2



Sometimes a non-zero handoff will be immediately followed by an "off schedule" 
phantom handoff to the endpoint the phantom had been using before:
 INFO [HintedHandoff:1] 2012-12-12 21:47:39,335 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
 INFO [HintedHandoff:1] 2012-12-12 21:57:39,335 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
 INFO [HintedHandoff:1] 2012-12-12 22:07:43,319 HintedHandOffManager.java (line 
392) Finished hinted handoff of 1416 rows to endpoint /10.10.10.4
 INFO [HintedHandoff:1] 2012-12-12 22:07:43,320 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.3
 INFO [HintedHandoff:1] 2012-12-12 22:17:39,357 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4
 INFO [HintedHandoff:1] 2012-12-12 22:27:39,337 HintedHandOffManager.java (line 
392) Finished hinted handoff of 0 rows to endpoint /10.10.10.4



The first few entries from one of the json files:
{
"0aaa": {
"ccf5dc203a2211e2e154da71a9bb": {
"deletedAt": -9223372036854775808, 
"subColumns": []
}, 
"ccf603303a2211e2e154da71a9bb": {
"deletedAt": -9223372036854775808, 
"subColumns": []
}, 


  was:
{noformat}
 INFO 15:36:03,977 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:36:03,978 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:46:31,248 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:46:31,249 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:56:29,448 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:56:29,449 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:06:09,949 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:06:09,950 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:16:21,529 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:16:21,530 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
{noformat}

Introduced by CASSANDRA-3554.  The problem is that until a compaction on hints 
occurs, tombstones are present causing the isEmpty() check to be false.

  Environment: 
cassandra 1.1.6
java 1.6.0_30
Affects Version/s: (was: 0.6)
   1.1.6

Cloning CASSANDRA-3733 as it seems to be the same issue.

> CLONE -

[jira] [Created] (CASSANDRA-5068) CLONE - Once a host has been hinted to, log messages for it repeat every 10 mins even if no hints are delivered

2012-12-13 Thread Peter Haggerty (JIRA)
Peter Haggerty created CASSANDRA-5068:
-

 Summary: CLONE - Once a host has been hinted to, log messages for 
it repeat every 10 mins even if no hints are delivered
 Key: CASSANDRA-5068
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5068
 Project: Cassandra
  Issue Type: Bug
  Components: Core
Affects Versions: 0.6
Reporter: Peter Haggerty
Assignee: Brandon Williams
Priority: Minor
 Fix For: 0.8.10, 1.0.7


{noformat}
 INFO 15:36:03,977 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:36:03,978 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:46:31,248 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:46:31,249 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 15:56:29,448 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 15:56:29,449 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:06:09,949 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:06:09,950 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
 INFO 16:16:21,529 Started hinted handoff for token: 
170141183460469231731687303715884105726 with IP: /10.179.111.137
 INFO 16:16:21,530 Finished hinted handoff of 0 rows to endpoint /10.179.111.137
{noformat}

Introduced by CASSANDRA-3554.  The problem is that until a compaction on hints 
occurs, tombstones are present causing the isEmpty() check to be false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira