[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-31 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723544#comment-14723544
 ] 

Joel Knighton commented on CASSANDRA-10068:
---

[10231|https://issues.apache.org/jira/browse/CASSANDRA-10231] is a follow-up 
for the status bug.

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Branimir Lambov
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-28 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720282#comment-14720282
 ] 

Joel Knighton commented on CASSANDRA-10068:
---

Current 3.0 seems to resolve the symptoms (error on batchlog replay) but not 
the root cause, which is status entries like
?N  10.0.0.5  ?  256  ?   null  
rack1
 that stay until node restart.

I imagine this causes other issues not exposed by the test.

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Branimir Lambov
> Fix For: 3.0 beta 2
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-28 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720015#comment-14720015
 ] 

Joel Knighton commented on CASSANDRA-10068:
---

I'll do that today. Thanks.

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Branimir Lambov
> Fix For: 3.0 beta 2
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-28 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718196#comment-14718196
 ] 

Branimir Lambov commented on CASSANDRA-10068:
-

Could you try re-running this with current 3.0? CASSANDRA-6230 should solve the 
problem, as it rewrites most of the related code and has [explicit treatment of 
hints when removing 
nodes|https://github.com/apache/cassandra/commit/96d41f0e0e44d9b3114a5d80dedf12053d36a76b#diff-7521fc1047150d28ea85486225c66578R265].

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Branimir Lambov
> Fix For: 3.0 beta 2
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-19 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703510#comment-14703510
 ] 

Joel Knighton commented on CASSANDRA-10068:
---

I haven't had any luck repro-ing this with a dtest - the timing issues are too 
difficult.

I've narrowed down the cause slightly (maybe?) through watching Jepsen tests 
that reproduce the issue.

The null gossip entries are present in nodes that crash at a particular time 
(seems to be quite late) in the decommission of the node. When started (after 
the decommission has finished without an error present), they have the null 
entry. A restart removes this null entry.

Hope this helps.

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Marcus Eriksson
> Fix For: 3.0 beta 2
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-17 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700737#comment-14700737
 ] 

Joel Knighton commented on CASSANDRA-10068:
---

[~krummas] Instructions on setting up the environment are available at 
https://github.com/riptano/jepsen/tree/cassandra/cassandra.

Specifically, the test under consideration can be run as 
{code}
lein with-profile +trunk test :only 
cassandra.mv-test/mv-crash-subset-decommission
{code}

That said, I understand the environment setup is a bit laborious, and I'm still 
working on reproducing this with the provided dtest.

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Marcus Eriksson
> Fix For: 3.0 beta 2
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-17 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699214#comment-14699214
 ] 

Marcus Eriksson commented on CASSANDRA-10068:
-

[~jkni] how do I run this Jepsen test locally?

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Marcus Eriksson
> Fix For: 3.0.0 rc1
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-14 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697276#comment-14697276
 ] 

Joel Knighton commented on CASSANDRA-10068:
---

[~krummas] I reran the Jepsen tests  - I'm still seeing the "Trying to get the 
view natural endpoint on a non-data replica" (as well as the other batchlog 
replay failure), and it is definitely using the new code passing in 
afterLeaveTokenMetadata, as seen in the stack trace:

java.lang.RuntimeException: Trying to get the view natural endpoint on a 
non-data replica
at 
org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
 ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]

I'm going to try and repro with the dtest.

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Marcus Eriksson
> Fix For: 3.0.0 rc1
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned

2015-08-14 Thread Marcus Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696884#comment-14696884
 ] 

Marcus Eriksson commented on CASSANDRA-10068:
-

bq. java.lang.RuntimeException: Trying to get the view natural endpoint on a 
non-data replica
this is due to the fact that while we are decommissioning, the leaving node is 
still in TokenMetadata so the nodes receiving the rows don't think they should 
own them. Patch here: 
https://github.com/krummas/cassandra/commits/marcuse/10068 that solves that. 
DTest here: https://github.com/krummas/cassandra-dtest/commits/marcuse/10068

[~jkni] I doubt this is related to the other errors you are seeing so I will 
keep looking for that, but could you rerun the test just to make sure it is not 
related?

> Batchlog replay fails with exception after a node is decommissioned
> ---
>
> Key: CASSANDRA-10068
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10068
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Joel Knighton
>Assignee: Marcus Eriksson
> Fix For: 3.0.0 rc1
>
> Attachments: n1.log, n2.log, n3.log, n4.log, n5.log
>
>
> This issue is reproducible through a Jepsen test of materialized views that 
> crashes and decommissions nodes throughout the test.
> At the conclusion of the test, a batchlog replay is initiated through 
> nodetool and hits the following assertion due to a missing host ID: 
> https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197
> A nodetool status on the node with failed batchlog replay shows the following 
> entry for the decommissioned node:
> DN  10.0.0.5  ?  256  ?   null
>   rack1
> On the unaffected nodes, there is no entry for the decommissioned node as 
> expected.
> There are occasional hits of the same assertions for logs in other nodes; it 
> looks like the issue might occasionally resolve itself, but one node seems to 
> have the errant null entry indefinitely.
> In logs for the nodes, this possibly unrelated exception also appears:
> java.lang.RuntimeException: Trying to get the view natural endpoint on a 
> non-data replica
>   at 
> org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91)
>  ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT]
> I have a running cluster with the issue on my machine; it is also repeatable.
> Nothing stands out in the logs of the decommissioned node (n4) for me. The 
> logs of each node in the cluster are attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)