[ 
https://issues.apache.org/jira/browse/CASSANDRA-16752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andres de la Peña updated CASSANDRA-16752:
------------------------------------------
     Bug Category: Parent values: Correctness(12982)Level 1 values: Test 
Failure(12990)
       Complexity: Normal
    Discovered By: DTest
         Severity: Normal
           Status: Open  (was: Triage Needed)

> Python dtest queries executed after restarting a node might fail
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-16752
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16752
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Andres de la Peña
>            Assignee: Andres de la Peña
>            Priority: Normal
>
> We have multiple dtest failures in queries with CL=ALL that are executed 
> right after restarting a node:
>  * 
> [test_14330|https://ci-cassandra.apache.org/job/Cassandra-4.0.0/26/testReport/dtest.consistency_test/TestConsistency/test_14330/]
>  * 
> [test_multiple_repair|https://ci-cassandra.apache.org/job/Cassandra-4.0.0/29/testReport/dtest.repair_tests.incremental_repair_test/TestIncRepair/test_multiple_repair/]
>  * 
> [test_multiple_repair(offheap)|https://ci-cassandra.apache.org/job/Cassandra-4.0.0/26/testReport/dtest-offheap.repair_tests.incremental_repair_test/TestIncRepair/test_multiple_repair/]
> The error is a server-side failure telling that there aren't enough replicas 
> alive to achieve the required consistency level:
> {code:java}
> tools/assertions.py:182: in assert_all
>     res = session.execute(simple_query) if timeout is None else 
> session.execute(simple_query, timeout=timeout)
> ../venv/src/cassandra-driver/cassandra/cluster.py:2618: in execute
>     return self.execute_async(query, parameters, trace, custom_payload, 
> timeout, execution_profile, paging_state, host, execute_as).result()
> ...
> E           cassandra.Unavailable: Error from server: code=1000 [Unavailable 
> exception] message="Cannot achieve consistency level ALL" 
> info={'consistency': 'ALL', 'required_replicas': 2, 'alive_replicas': 1}
> {code}
> The Jenkins failure in {{test_14330}} can be reproduced in the multiplexer, 
> although it requires many iterations:
>  * 
> [https://app.circleci.com/pipelines/github/adelapena/cassandra/600/workflows/2f1b7455-1217-4553-985a-a255fe98ba47/jobs/5784]
>  * 
> [https://app.circleci.com/pipelines/github/adelapena/cassandra/600/workflows/2f1b7455-1217-4553-985a-a255fe98ba47/jobs/5786]
> As a hint, this same test [survives the 
> multiplexer|https://app.circleci.com/pipelines/github/adelapena/cassandra/605/workflows/aba5e7a6-c988-40d9-ab56-7848e711b5e1]
>  if we add [a short 
> sleep|https://github.com/adelapena/cassandra-dtest/commit/28ca11bf31db4c836e9c34500d18cbb9316e3cca]
>  right after restarting the node.
> I think that the reason for the failure is that CCM is no properly waiting 
> for the propagation of the node status. So, the session uses a not-restarted 
> nodes to run the query with CL=ALL before that node actually sees the 
> restarted node. The parameter {{wait_other_notice}} in {{node.start}} should 
> guarantee that this doesn't happen, but I suspect that it isn't working as 
> expected. When using the {{wait_other_notice}} option on node start, the 
> [{{node.watch_log_for_alive}}|https://github.com/riptano/ccm/blob/358c06781c4a8b85dcfca49449232b7405dacc0c/ccmlib/node.py#L657-L664]
>  function looks for the regexp {{%s.* now UP}} in the logs of the other 
> nodes, to ensure that they consider that node as up. However, that expression 
> is matched by two different log entries emitted on log restart:
> {code:java}
> Node /127.0.0.2:7000 has restarted, now UP
> ...
> Node /127.0.0.2:7000 is now UP
> {code}
> The first message is emitted by 
> [{{Gossiper.handleMajorStateChange}}|https://github.com/apache/cassandra/blob/cassandra-4.0.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1313],
>  while the second message is emitted by 
> [{{Gossiper.realMarkAlive}}|https://github.com/apache/cassandra/blob/cassandra-4.0.0/src/java/org/apache/cassandra/gms/Gossiper.java#L1263].
>  I think that we need to wait for the second one, while CCM is waiting only 
> for the first one.
> Indeed, we can easily reproduce the failure in any of the affected dtests by 
> adding a sleep at the beginning of {{Gossiper.realMarkAlive}}. We can get 
> around the failure if we manually modify the test to wait for the right log 
> message, as it's done 
> [here|https://github.com/adelapena/cassandra-dtest/commit/7355be7dd94c91c44b78bd2ccfa1c731a00d6a63].
>  With that wait the test [survives the multiplexer 
> rounds|https://app.circleci.com/pipelines/github/adelapena/cassandra/605/workflows/6ec189fc-4701-4a27-a7af-382a6c53dd0d].
> This problem seems quite unlikely to happen since the time window is very 
> short, but we have seen it in CI and it affects multiple Python dtests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to