[jira] [Comment Edited] (CASSANDRA-16657) Flaky TestPaxos

Ekaterina Dimitrova (Jira) Tue, 18 May 2021 13:18:32 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347139#comment-17347139
 ]


Ekaterina Dimitrova edited comment on CASSANDRA-16657 at 5/18/21, 8:17 PM:
---------------------------------------------------------------------------

I have added some logging around Paxos 
[here|https://github.com/ekaterinadimitrova2/cassandra/pull/113/commits/b67b9779d8d85bd85dbe5c85dcccc3daf63bca42]
 and I didn't find any Paxos related errors in the logs of the failed runs of 
this test.

According to this 
[warning|https://github.com/apache/cassandra-dtest/blob/trunk/paxos_test.py#L65-L6],
 I verified which key corresponds to which token using our partitioner utility 
functions and also printing from the code which node is targeted for replica 
for which key (double-checking myself).

The last _INSERT_ (where all the failures of this test happen) should insert to 
node 2. So my first thought was that there was delay in bringing up node 2. But 
it turned out that for that statement to be executed, we need  both node 2 and 
node 3 to be up, if I comment out the start of any of those 2, the test 
consistently fails. 

My understanding is that this test has to verify that the following issue was 
solved - PAXOS required all nodes for CAS (CASSANDRA-8640). It seems to me that 
maybe this was solved partially or something has changed later(I will dig into 
that as a next step) as this is a pretty old ticket. What I mean is that the 
first 4 operations always succeed but the last one always fails in case node 3 
is also down, even if we should require only the replica, node2 to be available 
at this point. The replication factor is 1.
{code:java}
session.execute("INSERT INTO test (k, v) VALUES (0, 0) IF NOT EXISTS") # 
respective token belongs to node 2 

node3.stop()
session.execute("INSERT INTO test (k, v) VALUES (1, 1) IF NOT EXISTS") # 
respective token belongs to node 2 

node2.stop()
session.execute("INSERT INTO test (k, v) VALUES (3, 2) IF NOT EXISTS") # 
respective token belongs to node 1

node2.start()
session.execute("INSERT INTO test (k, v) VALUES (5, 5) IF NOT EXISTS") # 
respective token belongs to node1

node3.start()
session.execute("INSERT INTO test (k, v) VALUES (6, 6) IF NOT EXISTS") # 
respective token belongs to node2{code}
So now the logical question was why the first insert of key 1 to node 2 after 
stopping node 3 is always successful? Then I checked the logs and here it comes 
the confusion:

CAS completes before the node is stopped:
{code:java}
DEBUG [MutationStage-1] 2021-05-18 02:49:54,316 PaxosState.java:78 - Promising 
ballot b958b790-b783-11eb-52d1-a2e59ce8c54f
DEBUG [MutationStage-1] 2021-05-18 02:49:54,323 PaxosState.java:116 - Accepting 
proposal Commit(b958b790-b783-11eb-52d1-a2e59ce8c54f, [ks.test] key=1 
partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
columns=[[] | [v]]
 Row[info=[ts=1621306194313000] ]: EMPTY | [v=1 ts=1621306194313000])
DEBUG [MutationStage-1] 2021-05-18 02:49:54,326 PaxosState.java:153 - 
Committing proposal Commit(b958b790-b783-11eb-52d1-a2e59ce8c54f, [ks.test] 
key=1 partition_deletion=deletedAt=-9223372036854775808, 
localDeletion=2147483647 columns=[[] | [v]]
 Row[info=[ts=1621306194313000] ]: EMPTY | [v=1 ts=1621306194313000])
DEBUG [StorageServiceShutdownHook] 2021-05-18 02:49:54,330 
StorageService.java:1619 - DRAINING: starting drain process
INFO [StorageServiceShutdownHook] 2021-05-18 02:49:54,331 HintsService.java:220 
- Paused hints dispatch
INFO [StorageServiceShutdownHook] 2021-05-18 02:49:54,332 Server.java:171 - 
Stop listening for CQL clients
INFO [StorageServiceShutdownHook] 2021-05-18 02:49:54,333 Gossiper.java:1962 - 
Announcing shutdown{code}
 I will add further logs and check for related post-CASSANDRA-8640 changes in 
the area.

 //CC [~blerer] as I remember you were working at some point on some PAXOS 
related issues too

 


was (Author: e.dimitrova):
I have added some logging around Paxos 
[here|https://github.com/ekaterinadimitrova2/cassandra/pull/113/commits/b67b9779d8d85bd85dbe5c85dcccc3daf63bca42]
 and I didn't find any Paxos related errors in the logs of the failed runs of 
this test.

According to this 
[warning|https://github.com/apache/cassandra-dtest/blob/trunk/paxos_test.py#L65-L6],
 I verified which key corresponds to which token using our partitioner utility 
functions and also printing from the code which node is targeted for replica 
for which key (double-checking myself).

The last _INSERT_ (where all the failures happen) should insert to node 2. So 
my first thought was that there was delay in bringing up node 2. But it turned 
out that for that statement to be executed, we need  both node 2 and node 3 to 
be up, if I comment out the start of any of those 2, the test consistently 
fails. 

My understanding is that this test has to verify that the following issue was 
solved - PAXOS required all nodes for CAS (CASSANDRA-8640). It seems to me that 
maybe this was solved partially or something has changed later(I will dig into 
that as a next step) as this is a pretty old ticket. What I mean is that the 
first 4 operations always succeed but the last one always fails in case node 3 
is also down, even if we should require only the replica, node2 to be available 
at this point. The replication factor is 1.
{code:java}
session.execute("INSERT INTO test (k, v) VALUES (0, 0) IF NOT EXISTS") # 
respective token belongs to node 2 

node3.stop()
session.execute("INSERT INTO test (k, v) VALUES (1, 1) IF NOT EXISTS") # 
respective token belongs to node 2 

node2.stop()
session.execute("INSERT INTO test (k, v) VALUES (3, 2) IF NOT EXISTS") # 
respective token belongs to node 1

node2.start()
session.execute("INSERT INTO test (k, v) VALUES (5, 5) IF NOT EXISTS") # 
respective token belongs to node1

node3.start()
session.execute("INSERT INTO test (k, v) VALUES (6, 6) IF NOT EXISTS") # 
respective token belongs to node2{code}
So now the logical question was why the first insert of key 1 to node 2 after 
stopping node 3 is always successful? Then I checked the logs and here it comes 
the confusion:

CAS completes before the node is stopped:
{code:java}
DEBUG [MutationStage-1] 2021-05-18 02:49:54,316 PaxosState.java:78 - Promising 
ballot b958b790-b783-11eb-52d1-a2e59ce8c54f
DEBUG [MutationStage-1] 2021-05-18 02:49:54,323 PaxosState.java:116 - Accepting 
proposal Commit(b958b790-b783-11eb-52d1-a2e59ce8c54f, [ks.test] key=1 
partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 
columns=[[] | [v]]
 Row[info=[ts=1621306194313000] ]: EMPTY | [v=1 ts=1621306194313000])
DEBUG [MutationStage-1] 2021-05-18 02:49:54,326 PaxosState.java:153 - 
Committing proposal Commit(b958b790-b783-11eb-52d1-a2e59ce8c54f, [ks.test] 
key=1 partition_deletion=deletedAt=-9223372036854775808, 
localDeletion=2147483647 columns=[[] | [v]]
 Row[info=[ts=1621306194313000] ]: EMPTY | [v=1 ts=1621306194313000])
DEBUG [StorageServiceShutdownHook] 2021-05-18 02:49:54,330 
StorageService.java:1619 - DRAINING: starting drain process
INFO [StorageServiceShutdownHook] 2021-05-18 02:49:54,331 HintsService.java:220 
- Paused hints dispatch
INFO [StorageServiceShutdownHook] 2021-05-18 02:49:54,332 Server.java:171 - 
Stop listening for CQL clients
INFO [StorageServiceShutdownHook] 2021-05-18 02:49:54,333 Gossiper.java:1962 - 
Announcing shutdown{code}
 I will add further logs and check for related post-CASSANDRA-8640 changes in 
the area.

 //CC [~blerer] as I remember you were working at some point on some PAXOS 
related issues too

 

> Flaky TestPaxos
> ---------------
>
>                 Key: CASSANDRA-16657
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16657
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Berenguer Blasi
>            Assignee: Ekaterina Dimitrova
>            Priority: Normal
>             Fix For: 4.0-rc
>
>
> During testing for some other ticket I found in a test run 
> [this|https://ci-cassandra.apache.org/job/Cassandra-devbranch/736/testReport/junit/dtest-novnode.paxos_test/TestPaxos/test_cluster_availability/]
>  paxos failure
> {noformat}
> Error Message
> cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] 
> message="Cannot achieve consistency level SERIAL" info={'consistency': 
> 'SERIAL', 'required_replicas': 1, 'alive_replicas': 0}
> Stacktrace
> self = <paxos_test.TestPaxos object at 0x7fafbd3f4850>
>     @pytest.mark.no_vnodes
>     def test_cluster_availability(self):
>         # Warning, a change in partitioner or a change in CCM token allocation
>         # may require the partition keys of these inserts to be changed.
>         # This must not use vnodes as it relies on assumed token values.
>     
>         session = self.prepare(nodes=3)
>         session.execute("CREATE TABLE test (k int PRIMARY KEY, v int)")
>         session.execute("INSERT INTO test (k, v) VALUES (0, 0) IF NOT EXISTS")
>     
>         self.cluster.nodelist()[2].stop()
>         session.execute("INSERT INTO test (k, v) VALUES (1, 1) IF NOT EXISTS")
>     
>         self.cluster.nodelist()[1].stop()
>         session.execute("INSERT INTO test (k, v) VALUES (3, 2) IF NOT EXISTS")
>     
>         self.cluster.nodelist()[1].start()
>         session.execute("INSERT INTO test (k, v) VALUES (5, 5) IF NOT EXISTS")
>     
>         self.cluster.nodelist()[2].start()
> >       session.execute("INSERT INTO test (k, v) VALUES (6, 6) IF NOT EXISTS")
> paxos_test.py:83: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> ../venv/src/cassandra-driver/cassandra/cluster.py:2618: in execute
>     return self.execute_async(query, parameters, trace, custom_payload, 
> timeout, execution_profile, paging_state, host, execute_as).result()
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> self = <ResponseFuture: query='<SimpleStatement query="INSERT INTO test (k, 
> v) VALUES (6, 6) IF NOT EXISTS", consistency=Not ...el SERIAL" 
> info={'consistency': 'SERIAL', 'required_replicas': 1, 'alive_replicas': 0} 
> coordinator_host=127.0.0.2:9042>
>     def result(self):
>         """
>             Return the final result or raise an Exception if errors were
>             encountered.  If the final result or error has not been set
>             yet, this method will block until it is set, or the timeout
>             set for the request expires.
>     
>             Timeout is specified in the Session request execution functions.
>             If the timeout is exceeded, an :exc:`cassandra.OperationTimedOut` 
> will be raised.
>             This is a client-side timeout. For more information
>             about server-side coordinator timeouts, see 
> :class:`.policies.RetryPolicy`.
>     
>             Example usage::
>     
>                 >>> future = session.execute_async("SELECT * FROM mycf")
>                 >>> # do other stuff...
>     
>                 >>> try:
>                 ...     rows = future.result()
>                 ...     for row in rows:
>                 ...         ... # process results
>                 ... except Exception:
>                 ...     log.exception("Operation failed:")
>     
>             """
>         self._event.wait()
>         if self._final_result is not _NOT_SET:
>             return ResultSet(self, self._final_result)
>         else:
> >           raise self._final_exception
> E           cassandra.Unavailable: Error from server: code=1000 [Unavailable 
> exception] message="Cannot achieve consistency level SERIAL" 
> info={'consistency': 'SERIAL', 'required_replicas': 1, 'alive_replicas': 0}
> ../venv/src/cassandra-driver/cassandra/cluster.py:4894: Unavailable
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-16657) Flaky TestPaxos

Reply via email to