[
https://issues.apache.org/jira/browse/CASSANDRA-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17328437#comment-17328437
]
Andres de la Peña commented on CASSANDRA-16614:
-----------------------------------------------
The failure can be consistently reproduced by inserting a sleep of a few
seconds [right before killing the
node|https://github.com/apache/cassandra-dtest/blob/trunk/pending_range_test.py#L67],
simulating a slow CI environment. That makes the other nodes to see the down
node as not moving, reproducing the reported failure.
The proposed patch replaces the check that verifies that the other nodes see
the node as {{MOVING}} by verifying that the nodes have seen the movement,
which is true either if it's still {{MOVING}} or if has gone back to
{{NORMAL}}. However, the original purpose of the test was testing the
{{MOVING}} scenario, since the original bug didn't affect the case where the
down node has gone back to {{NORMAL}}. In other words, we want to test while
the node is moving, not when it has moved.
This problem can be seen if we add the aforementioned sleep to simulate a slow
CI environment and we run the modified test against a Cassandra version that
contains the bug fixed by CASSANDRA-10887, for example 3.0.2
({{--cassandra-version=3.0.2}}). In this case the test will always pass without
detecting the bug that it's meant to detect.
I think we should fix the test to make sure that the down node is always seen
as {{MOVING}} by the other nodes. We could do that by killing the node while it
is sleeping for {{ring_delay_ms}} before actually moving the data. That time
window is 30 seconds, which seems more than enough time to do the log checks
and kill the node. We can even further increase the value of {{ring_delay_ms}}
to be totally sure that we have enough time to kill the node while it's still
moving. Since {{nodetool move}} waits for the stream/fetch phase, the test can
simply call {{nodetool move}} asynchronously and wait for the log entries
reporting the {{MOVING}} status before killing the node, as it's done in [this
commit|https://github.com/adelapena/cassandra-dtest/commit/e7aa346c0fe94d26e9a1ce4e607caec3353059dd].
This seems to pass even if we add a sleep before killing the node, and it
still detects the original bug if we use a pre-CASSANDRA-10887 Cassandra
version.
wdyt?
> Flaky test_pending_range
> ------------------------
>
> Key: CASSANDRA-16614
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16614
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/python
> Reporter: Berenguer Blasi
> Assignee: Berenguer Blasi
> Priority: Normal
> Fix For: 4.0-rc
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Flaky
> [test_pending_range|https://ci-cassandra.apache.org/job/Cassandra-trunk/445/testReport/junit/dtest-large-novnode.pending_range_test/TestPendingRangeMovements/test_pending_range/]
> {noformat}
> Error Message
> AssertionError: assert None is not None + where None = <function search at
> 0x7f29dfa83b80>('127\\.0\\.0\\.1.*?Down.*?Moving', '\nDatacenter:
> datacenter1\n==========\nAddress Rack Status State Load
> Owns ... rack1 Up Normal 90.86 KiB
> 40.00% 5534023222112865484 \n\n\n ') +
> where <function search at 0x7f29dfa83b80> = re.search
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]