[
https://issues.apache.org/jira/browse/CASSANDRA-16061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189074#comment-17189074
]
Berenguer Blasi commented on CASSANDRA-16061:
---------------------------------------------
This is the failure for reference as ci-cass logs are gone now:
{noformat}
===Flaky Test Report===
test_move_forwards_and_cleanup failed; it passed 0 out of the required 1 times.
<class 'ccmlib.node.TimeoutError'>
02 Sep 2020 07:18:39 [node4] Missing: ['Starting listening for CQL
clients']:
INFO [main] 2020-09-02 09:17:30,390 YamlConfigura.....
See system.log for remainder
[<TracebackEntry
/media/sf_VBoxSharedFolder/dtestsvbox/transient_replication_ring_test.py:301>,
<TracebackEntry
/media/sf_VBoxSharedFolder/dtestsvbox/transient_replication_ring_test.py:231>,
<TracebackEntry
/media/sf_VBoxSharedFolder/dtestsvbox/transient_replication_ring_test.py:47>,
<TracebackEntry
/media/sf_VBoxSharedFolder/dtestsvbox/src/ccm/ccmlib/node.py:798>,
<TracebackEntry
/media/sf_VBoxSharedFolder/dtestsvbox/src/ccm/ccmlib/node.py:591>,
<TracebackEntry
/media/sf_VBoxSharedFolder/dtestsvbox/src/ccm/ccmlib/node.py:548>]
===End Flaky Test Report===
{noformat}
It's hard to repro. There is some exotic race where
[boostrap.get()|https://github.com/apache/cassandra/blob/23ba48aa935d3f81e66b65285fa8e7972f94dcfe/src/java/org/apache/cassandra/service/StorageService.java#L1584]
will block as a default connection never completes blocking
[here|https://github.com/apache/cassandra/blob/23ba48aa935d3f81e66b65285fa8e7972f94dcfe/src/java/org/apache/cassandra/streaming/DefaultConnectionFactory.java#L50].
That just times out the test.
That shouldn't be as the default connection has a built in timeout. Even
forcing a timeout myself when waiting on it won't do the trick. Somehow
connecting to node1 is not possible.
I have been debugging this as much as I can. The netty code needs some time to
penetrate and I don't have a full grasp of it despite what I saw made sense.
{{bootstrap.get()}} blocks on AbstractFuture
[parking|https://github.com/google/guava/blob/v27.0/guava/src/com/google/common/util/concurrent/AbstractFuture.java#L523]
the thread. If you google a bit you'll find many people getting blocked
threads around this area given guava makes some assumptions apparently.
I am taking a break here as I am not progressing so I need to go back to the
drawing board on how to approach this. If sbdy wants to try the challenge feel
free to assign it.
> transient_replication_ring_test.py::TestTransientReplicationRing::test_move_forwards_and_cleanup
> ------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-16061
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16061
> Project: Cassandra
> Issue Type: Bug
> Components: Test/dtest/python
> Reporter: Ekaterina Dimitrova
> Assignee: Berenguer Blasi
> Priority: Normal
> Fix For: 4.0-beta
>
>
> Failing here, also locally:
> [https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/312/workflows/da4ce69c-e778-467e-b9f3-27ab166a8321/jobs/1945]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]