[jira] [Commented] (CASSANDRA-16614) Flaky test_pending_range

Jira Thu, 22 Apr 2021 06:19:19 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17328437#comment-17328437
 ]


Andres de la Peña commented on CASSANDRA-16614:
-----------------------------------------------

The failure can be consistently reproduced by inserting a sleep of a few 
seconds [right before killing the 
node|https://github.com/apache/cassandra-dtest/blob/trunk/pending_range_test.py#L67],
 simulating a slow CI environment. That makes the other nodes to see the down 
node as not moving, reproducing the reported failure.

The proposed patch replaces the check that verifies that the other nodes see 
the node as {{MOVING}} by verifying that the nodes have seen the movement, 
which is true either if it's still {{MOVING}} or if has gone back to 
{{NORMAL}}. However, the original purpose of the test was testing the 
{{MOVING}} scenario, since the original bug didn't affect the case where the 
down node has gone back to {{NORMAL}}. In other words, we want to test while 
the node is moving, not when it has moved.

This problem can be seen if we add the aforementioned sleep to simulate a slow 
CI environment and we run the modified test against a Cassandra version that 
contains the bug fixed by CASSANDRA-10887, for example 3.0.2 
({{--cassandra-version=3.0.2}}). In this case the test will always pass without 
detecting the bug that it's meant to detect.

I think we should fix the test to make sure that the down node is always seen 
as {{MOVING}} by the other nodes. We could do that by killing the node while it 
is sleeping for {{ring_delay_ms}} before actually moving the data. That time 
window is 30 seconds, which seems more than enough time to do the log checks 
and kill the node. We can even further increase the value of {{ring_delay_ms}} 
to be totally sure that we have enough time to kill the node while it's still 
moving. Since {{nodetool move}} waits for the stream/fetch phase, the test can 
simply call {{nodetool move}} asynchronously and wait for the log entries 
reporting the {{MOVING}} status before killing the node, as it's done in [this 
commit|https://github.com/adelapena/cassandra-dtest/commit/e7aa346c0fe94d26e9a1ce4e607caec3353059dd].
 This seems to pass even if we add a sleep before killing the node, and it 
still detects the original bug if we use a pre-CASSANDRA-10887 Cassandra 
version.

wdyt?

> Flaky test_pending_range
> ------------------------
>
>                 Key: CASSANDRA-16614
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16614
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Berenguer Blasi
>            Assignee: Berenguer Blasi
>            Priority: Normal
>             Fix For: 4.0-rc
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Flaky 
> [test_pending_range|https://ci-cassandra.apache.org/job/Cassandra-trunk/445/testReport/junit/dtest-large-novnode.pending_range_test/TestPendingRangeMovements/test_pending_range/]
> {noformat}
> Error Message
> AssertionError: assert None is not None  +  where None = <function search at 
> 0x7f29dfa83b80>('127\\.0\\.0\\.1.*?Down.*?Moving', '\nDatacenter: 
> datacenter1\n==========\nAddress         Rack        Status State   Load      
>       Owns               ...   rack1       Up     Normal  90.86 KiB       
> 40.00%              5534023222112865484                         \n\n\n  ')  + 
>    where <function search at 0x7f29dfa83b80> = re.search
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-16614) Flaky test_pending_range

Reply via email to