subject:"\[jira\] \[Updated\] \(CASSANDRA\-16271\) Writes timeout instead of failing on cluster with CL\-1 replicas available during replace"

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-30 Thread Paulo Motta (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-16271:

Status: Changes Suggested  (was: Review In Progress)

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Attachments: sleep_before_replace.diff
>
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-30 Thread Paulo Motta (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-16271:

Reviewers: Krishna Vadali, Paulo Motta, Paulo Motta  (was: Krishna Vadali, 
Paulo Motta)
   Krishna Vadali, Paulo Motta, Paulo Motta  (was: Krishna Vadali, 
Paulo Motta)
   Status: Review In Progress  (was: Patch Available)

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Attachments: sleep_before_replace.diff
>
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-20 Thread Sam Tunnicliffe (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-16271:

Test and Documentation Plan: Added new unit tests. Run existing dtests.
 Status: Patch Available  (was: Open)

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Attachments: sleep_before_replace.diff
>
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Krishna Vadali (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Vadali updated CASSANDRA-16271:
---
Attachment: sleep_before_replace.diff

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
> Attachments: sleep_before_replace.diff
>
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Krishna Vadali (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krishna Vadali updated CASSANDRA-16271:
---
Reviewers: Krishna Vadali, Paulo Motta  (was: Paulo Motta)

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Paulo Motta (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-16271:

Reviewers: Paulo Motta

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Assignee: Sam Tunnicliffe
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Sam Tunnicliffe (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Tunnicliffe updated CASSANDRA-16271:

Status: Open  (was: Triage Needed)

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

2020-11-13 Thread Paulo Motta (Jira)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paulo Motta updated CASSANDRA-16271:

 Bug Category: Parent values: Correctness(12982)Level 1 values: API / 
Semantic Implementation(12988)
   Complexity: Normal
Discovered By: User Report
 Severity: Normal
Since Version: 2.2.8

> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace
> 
>
> Key: CASSANDRA-16271
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16271
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Coordination
>Reporter: Krishna Vadali
>Priority: Normal
>
> Writes timeout instead of failing on cluster with CL-1 replicas available 
> during replace node operation.
> With Consistency Level ALL, we are observing Timeout exceptions during writes 
> when (RF - 1) nodes are available in the cluster with one replace-node 
> operation running. The coordinator is expecting RF + 1 responses, while there 
> are only RF nodes (RF-1 nodes in UN and 1 node in UJ) are available in the 
> cluster, hence timing out.
> The same problem happens on a keyspace with RF=1, CL=ONE and one replica 
> being replaced. Also RF=3, CL=QUORUM, one replica down and another being 
> replaced.
> I believe the expected behavior is that the write should fail with 
> UnavailableException since there are not enough NORMAL replicas to fulfill 
> the request.
> h4. *Steps to reproduce:*
> Run a 3 node test cluster (call the nodes node1 (127.0.0.1), node2 
> (127.0.0.2), node3 (127.0.0.3)):
> {code:java}
>  ccm create test -v 3.11.3 -n 3 -s
> {code}
> Create test keyspaces with RF = 3 and RF = 1 respectively:
> {code:java}
>  create keyspace rf3 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 3};
>  create keyspace rf1 with replication = \{'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> {code}
> Create a table test in both the keyspaces:
> {code:java}
> create table rf3.test ( pk int primary KEY, value int);
> create table rf1.test ( pk int primary KEY, value int);
> {code}
> Stop node node2:
> {code:java}
> ccm node2 stop
> {code}
> Create node node4:
> {code:java}
> ccm add node4 -i 127.0.0.4
> {code}
> Enable auto_bootstrap
> {code:java}
> ccm node4 updateconf 'auto_bootstrap: true'
> {code}
> Ensure node4 does not have itself in its seeds list.
> Run a replace node to replace node2 (address 127.0.0.2 corresponds to node 
> node2)
> {code:java}
> ccm node4 start --jvm_arg="-Dcassandra.replace_address=127.0.0.2"
> {code}
> When the replace node is running, perform write/reads with CONSISTENCY ALL, 
> we observed TimeoutException.
> {code:java}
> SET CONSISTENCY ALL:SET CONSISTENCY ALL: 
> cqlsh> insert into rf3.test (pk, value) values (16, 7);       
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 3 responses." info=\{'received_responses': 3, 'required_responses': 4, 
> 'consistency': 'ALL'}{code}
> {code:java}
> cqlsh> CONSISTENCY ONE; 
> cqlsh> insert into rf1.test (pk, value) VALUES(5, 1); 
> WriteTimeout: Error from server: code=1100 [Coordinator node timed out 
> waiting for replica nodes' responses] message="Operation timed out - received 
> only 1 responses." info=\{'received_responses': 1, 'required_responses': 2, 
> 'consistency': 'ONE'} 
> {code}
> Cluster State:
> {code:java}
>  Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   Owns (effective)  Host ID  
>  Rack
> UN  127.0.0.1  70.45 KiB  1100.0%
> 4f652b22-045b-493b-8722-fb5f7e1723ce  rack1
> UN  127.0.0.3  70.43 KiB  1100.0%
> a0dcd677-bdb3-4947-b9a7-14f3686a709f  rack1
> UJ  127.0.0.4  137.47 KiB  1? 
> e3d794f1-081e-4aba-94f2-31950c713846  rack1
> {code}
> Note: 
>  We introduced sleep during replace operation in order to simulate do our 
> experiments. We attached code diff that does it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

[jira] [Updated] (CASSANDRA-16271) Writes timeout instead of failing on cluster with CL-1 replicas available during replace

8 matches

Site Navigation

Mail list logo

Footer information