[ 
https://issues.apache.org/jira/browse/CASSANDRA-18660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742565#comment-17742565
 ] 

Brandon Williams commented on CASSANDRA-18660:
----------------------------------------------

Regardless of the reason why some nodes aren't seeing the restart, we don't 
need them to if we're waiting for CQL to come up. Unless the test requires 
that, which it doesn't since it's going to query each node individually, over 
CQL.  So let's just do 
[that|https://github.com/driftx/cassandra-dtest/commit/1ffd4d3892a743c4bc65cd89eb8a59a7b588c60d].

||Branch||CI||
|[trunk|https://github.com/driftx/cassandra/tree/CASSANDRA-18660-trunk]|[500|https://app.circleci.com/pipelines/github/driftx/cassandra/1130/workflows/fd2d9733-7264-49dd-99b5-b9451e861d70/jobs/36808]

> transient_replication_ring_test.py::TestTransientReplicationRing::test_bootstrap_and_cleanup
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-18660
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18660
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/python
>            Reporter: Andres de la Peña
>            Priority: Normal
>             Fix For: 5.x
>
>
> {{transient_replication_ring_test.py::TestTransientReplicationRing::test_bootstrap_and_cleanup}}
>  is flaky at least in CircleCI and {{trunk}}:
> https://app.circleci.com/pipelines/github/adelapena/cassandra/3018/workflows/2c81b63f-63b8-4baf-8945-d3c680edede6/jobs/58482/tests
> {code}
> ccmlib.node.TimeoutError: 11 Jul 2023 13:59:08 [node3] after 120.11/120 
> seconds Missing: ['127.0.0.2:7000.* is now UP'] not found in system.log:
>  Head: INFO  [Messaging-EventLoop-3-6] 2023-07-11 13:57:0
>  Tail: 
> ...7.0.0.3:7000(/127.0.0.1:56442)->/127.0.0.2:7000-LARGE_MESSAGES-da6108fe 
> successfully connected, version = 12, framing = CRC, encryption = unencrypted
> self = <transient_replication_ring_test.TestTransientReplicationRing object 
> at 0x7f2f340c2c88>
>     @flaky(max_runs=1)
>     @pytest.mark.no_vnodes
>     def test_bootstrap_and_cleanup(self):
>         """Test bootstrapping a new node across a mix of repaired and 
> unrepaired data"""
>         main_session = self.patient_cql_connection(self.node1)
>         nodes = [self.node1, self.node2, self.node3]
>     
>         for i in range(0, 40, 2):
>             self.insert_row(i, i, i, main_session)
>     
>         sessions = [self.exclusive_cql_connection(node) for node in 
> [self.node1, self.node2, self.node3]]
>     
>         expected = [gen_expected(range(0, 11, 2), range(22, 40, 2)),
>                     gen_expected(range(0, 22, 2), range(32, 40, 2)),
>                     gen_expected(range(12, 31, 2))]
>         self.check_expected(sessions, expected)
>     
>         # Make sure at least a little data is repaired, this shouldn't move 
> data anywhere
>         repair_nodes(nodes)
>     
>         self.check_expected(sessions, expected)
>     
>         # Ensure that there is at least some transient data around, because 
> of this if it's missing after bootstrap
>         # We know we failed to get it from the transient replica losing the 
> range entirely
>         nodes[1].stop(wait_other_notice=True)
>     
>         for i in range(1, 40, 2):
>             self.insert_row(i, i, i, main_session)
>     
> >       nodes[1].start(wait_for_binary_proto=True)
> transient_replication_ring_test.py:184: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> transient_replication_ring_test.py:52: in new_start
>     return old_start(*args, **kwargs)
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:896: in start
>     node.watch_log_for_alive(self, from_mark=mark)
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:665: in 
> watch_log_for_alive
>     self.watch_log_for(tofind, from_mark=from_mark, timeout=timeout, 
> filename=filename)
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:593: in watch_log_for
>     head=reads[:50], tail="..."+reads[len(reads)-150:]))
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ 
> start = 1689083828.8439646, timeout = 120
> msg = "Missing: ['127.0.0.2:7000.* is now UP'] not found in system.log:\n 
> Head: INFO  [Messaging-EventLoop-3-6] 2023-07-11 
> 1...127.0.0.2:7000-LARGE_MESSAGES-da6108fe successfully connected, version = 
> 12, framing = CRC, encryption = unencrypted\n"
> node = 'node3'
>     @staticmethod
>     def raise_if_passed(start, timeout, msg, node=None):
>         if start + timeout < time.time():
> >           raise TimeoutError.create(start, timeout, msg, node)
> E           ccmlib.node.TimeoutError: 11 Jul 2023 13:59:08 [node3] after 
> 120.11/120 seconds Missing: ['127.0.0.2:7000.* is now UP'] not found in 
> system.log:
> E            Head: INFO  [Messaging-EventLoop-3-6] 2023-07-11 13:57:0
> E            Tail: 
> ...7.0.0.3:7000(/127.0.0.1:56442)->/127.0.0.2:7000-LARGE_MESSAGES-da6108fe 
> successfully connected, version = 12, framing = CRC, encryption = unencrypted
> ../env3.6/lib/python3.6/site-packages/ccmlib/node.py:56: TimeoutError
> {code}
> The CircleCI config to reproduce can be generated with:
> {code}
> .circleci/generate.sh -p \
>   -e REPEATED_DTESTS_COUNT=500 \
>   -e 
> REPEATED_DTESTS=transient_replication_ring_test.py::TestTransientReplicationRing::test_bootstrap_and_cleanup
> {code}
> Flakiness seems ~1%



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to