[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14723544#comment-14723544 ] Joel Knighton commented on CASSANDRA-10068: --- [10231|https://issues.apache.org/jira/browse/CASSANDRA-10231] is a follow-up for the status bug. > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Branimir Lambov > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720282#comment-14720282 ] Joel Knighton commented on CASSANDRA-10068: --- Current 3.0 seems to resolve the symptoms (error on batchlog replay) but not the root cause, which is status entries like ?N 10.0.0.5 ? 256 ? null rack1 that stay until node restart. I imagine this causes other issues not exposed by the test. > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Branimir Lambov > Fix For: 3.0 beta 2 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720015#comment-14720015 ] Joel Knighton commented on CASSANDRA-10068: --- I'll do that today. Thanks. > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Branimir Lambov > Fix For: 3.0 beta 2 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718196#comment-14718196 ] Branimir Lambov commented on CASSANDRA-10068: - Could you try re-running this with current 3.0? CASSANDRA-6230 should solve the problem, as it rewrites most of the related code and has [explicit treatment of hints when removing nodes|https://github.com/apache/cassandra/commit/96d41f0e0e44d9b3114a5d80dedf12053d36a76b#diff-7521fc1047150d28ea85486225c66578R265]. > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Branimir Lambov > Fix For: 3.0 beta 2 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14703510#comment-14703510 ] Joel Knighton commented on CASSANDRA-10068: --- I haven't had any luck repro-ing this with a dtest - the timing issues are too difficult. I've narrowed down the cause slightly (maybe?) through watching Jepsen tests that reproduce the issue. The null gossip entries are present in nodes that crash at a particular time (seems to be quite late) in the decommission of the node. When started (after the decommission has finished without an error present), they have the null entry. A restart removes this null entry. Hope this helps. > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Marcus Eriksson > Fix For: 3.0 beta 2 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14700737#comment-14700737 ] Joel Knighton commented on CASSANDRA-10068: --- [~krummas] Instructions on setting up the environment are available at https://github.com/riptano/jepsen/tree/cassandra/cassandra. Specifically, the test under consideration can be run as {code} lein with-profile +trunk test :only cassandra.mv-test/mv-crash-subset-decommission {code} That said, I understand the environment setup is a bit laborious, and I'm still working on reproducing this with the provided dtest. > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Marcus Eriksson > Fix For: 3.0 beta 2 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14699214#comment-14699214 ] Marcus Eriksson commented on CASSANDRA-10068: - [~jkni] how do I run this Jepsen test locally? > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Marcus Eriksson > Fix For: 3.0.0 rc1 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697276#comment-14697276 ] Joel Knighton commented on CASSANDRA-10068: --- [~krummas] I reran the Jepsen tests - I'm still seeing the "Trying to get the view natural endpoint on a non-data replica" (as well as the other batchlog replay failure), and it is definitely using the new code passing in afterLeaveTokenMetadata, as seen in the stack trace: java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica at org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] I'm going to try and repro with the dtest. > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Marcus Eriksson > Fix For: 3.0.0 rc1 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-10068) Batchlog replay fails with exception after a node is decommissioned
[ https://issues.apache.org/jira/browse/CASSANDRA-10068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696884#comment-14696884 ] Marcus Eriksson commented on CASSANDRA-10068: - bq. java.lang.RuntimeException: Trying to get the view natural endpoint on a non-data replica this is due to the fact that while we are decommissioning, the leaving node is still in TokenMetadata so the nodes receiving the rows don't think they should own them. Patch here: https://github.com/krummas/cassandra/commits/marcuse/10068 that solves that. DTest here: https://github.com/krummas/cassandra-dtest/commits/marcuse/10068 [~jkni] I doubt this is related to the other errors you are seeing so I will keep looking for that, but could you rerun the test just to make sure it is not related? > Batchlog replay fails with exception after a node is decommissioned > --- > > Key: CASSANDRA-10068 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10068 > Project: Cassandra > Issue Type: Bug >Reporter: Joel Knighton >Assignee: Marcus Eriksson > Fix For: 3.0.0 rc1 > > Attachments: n1.log, n2.log, n3.log, n4.log, n5.log > > > This issue is reproducible through a Jepsen test of materialized views that > crashes and decommissions nodes throughout the test. > At the conclusion of the test, a batchlog replay is initiated through > nodetool and hits the following assertion due to a missing host ID: > https://github.com/apache/cassandra/blob/3413e557b95d9448b0311954e9b4f53eaf4758cd/src/java/org/apache/cassandra/service/StorageProxy.java#L1197 > A nodetool status on the node with failed batchlog replay shows the following > entry for the decommissioned node: > DN 10.0.0.5 ? 256 ? null > rack1 > On the unaffected nodes, there is no entry for the decommissioned node as > expected. > There are occasional hits of the same assertions for logs in other nodes; it > looks like the issue might occasionally resolve itself, but one node seems to > have the errant null entry indefinitely. > In logs for the nodes, this possibly unrelated exception also appears: > java.lang.RuntimeException: Trying to get the view natural endpoint on a > non-data replica > at > org.apache.cassandra.db.view.MaterializedViewUtils.getViewNaturalEndpoint(MaterializedViewUtils.java:91) > ~[apache-cassandra-3.0.0-alpha1-SNAPSHOT.jar:3.0.0-alpha1-SNAPSHOT] > I have a running cluster with the issue on my machine; it is also repeatable. > Nothing stands out in the logs of the decommissioned node (n4) for me. The > logs of each node in the cluster are attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332)