[ https://issues.apache.org/jira/browse/CASSANDRA-15865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17218202#comment-17218202 ]
Berenguer Blasi edited comment on CASSANDRA-15865 at 10/21/20, 9:51 AM: ------------------------------------------------------------------------ Some progress. The only failure's logs look like {noformat} grep -ri "is now" node1_debug.log INFO [GossipStage:1] 2020-10-19 15:34:07,204 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:34:07,264 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:34:19,054 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:34:31,968 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:04,653 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:35:24,762 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP grep -ri "is now" node2_debug.log INFO [GossipStage:1] 2020-10-19 15:34:07,930 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:34:07,950 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:34:31,988 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:34:45,688 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:35:04,633 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:24,791 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:25,316 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN {noformat} whereas all the other million non-failing runs diplay {noformat} grep -ri "is now" node1_debug.log INFO [GossipStage:1] 2020-10-19 15:35:40,341 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:35:40,391 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:52,169 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:36:05,123 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:36:57,732 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP grep -ri "is now" node2_debug.log INFO [GossipStage:1] 2020-10-19 15:35:41,061 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:35:41,094 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:05,150 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:18,807 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:36:37,710 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,728 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:57,756 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:58,660 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN {noformat} We can see node2 behaves exactly the same whereas node1 displays an extra 'DOWN'. Matching timestamps this is what's going on in the logs for a bad run {noformat} DEBUG [GossipStage:1] 2020-10-19 15:35:11,626 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60) INFO [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN DEBUG [GossipStage:1] 2020-10-19 15:35:11,716 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000 DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1192 - removing expire time for endpoint : /127.0.0.2:7000 INFO [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60) DEBUG [GossipStage:1] 2020-10-19 15:35:23,368 Gossiper.java:1480 - Shadow request received, adding all states DEBUG [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:471 - Convicting /127.0.0.2:7000 with status shutdown - alive true INFO [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN DEBUG [GossipStage:1] 2020-10-19 15:35:24,728 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000 DEBUG [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change INFO [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP {noformat} and this is what happens for a good run: {noformat} DEBUG [GossipStage:1] 2020-10-19 15:36:44,713 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (4f3d62af-22c5-3886-8c0b-c5f1b95fc403) INFO [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN DEBUG [GossipStage:1] 2020-10-19 15:36:44,960 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000 DEBUG [GossipStage:1] 2020-10-19 15:36:56,647 Gossiper.java:1480 - Shadow request received, adding all states DEBUG [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change INFO [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP {noformat} Looking at {{Gossiper}} code and trying infer what's going on looks like node2 is reported UP/DOWN an extra time but I can't infer what's happening from only looking at the logs. I have added tracing level to the test and hope to repro overnight. I am attaching a zip [^logsGoodNBad.zip] for the good and bad runs for the record. was (Author: bereng): Some progress. The only failure's logs look like {noformat} grep -ri "is now" node1_debug.log INFO [GossipStage:1] 2020-10-19 15:34:07,204 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:34:07,264 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:34:19,054 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:34:31,968 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:04,653 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:35:24,762 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP grep -ri "is now" node2_debug.log INFO [GossipStage:1] 2020-10-19 15:34:07,930 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:34:07,950 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:34:31,988 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:34:45,688 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:35:04,633 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:04,652 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:24,791 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:25,316 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN {noformat} whereas all the other million non-failing runs diplay {noformat} grep -ri "is now" node1_debug.log INFO [GossipStage:1] 2020-10-19 15:35:40,341 Gossiper.java:1235 - Node /127.0.0.2:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:35:40,391 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:35:52,169 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:36:05,123 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,729 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:36:57,732 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP grep -ri "is now" node2_debug.log INFO [GossipStage:1] 2020-10-19 15:35:41,061 Gossiper.java:1235 - Node /127.0.0.1:7000 is now part of the cluster INFO [GossipStage:1] 2020-10-19 15:35:41,094 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:05,150 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:18,807 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN INFO [GossipStage:1] 2020-10-19 15:36:37,710 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:37,728 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:57,756 Gossiper.java:1193 - InetAddress /127.0.0.1:7000 is now UP INFO [GossipStage:1] 2020-10-19 15:36:58,660 Gossiper.java:1211 - InetAddress /127.0.0.1:7000 is now DOWN {noformat} We can see node2 behaves exactly the same whereas node1 displays an extra 'DOWN'. Matching timestamps this is what's going on in the logs for a bad run {noformat} DEBUG [GossipStage:1] 2020-10-19 15:35:11,626 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60) INFO [GossipStage:1] 2020-10-19 15:35:11,715 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN DEBUG [GossipStage:1] 2020-10-19 15:35:11,716 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000 DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1192 - removing expire time for endpoint : /127.0.0.2:7000 INFO [GossipStage:1] 2020-10-19 15:35:11,719 Gossiper.java:1193 - InetAddress /127.0.0.2:7000 is now UP DEBUG [GossipStage:1] 2020-10-19 15:35:11,719 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (7e1669d0-936a-331b-8aca-b6a47709bc60) DEBUG [GossipStage:1] 2020-10-19 15:35:23,368 Gossiper.java:1480 - Shadow request received, adding all states DEBUG [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:471 - Convicting /127.0.0.2:7000 with status shutdown - alive true INFO [GossipStage:1] 2020-10-19 15:35:24,727 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN DEBUG [GossipStage:1] 2020-10-19 15:35:24,728 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000 DEBUG [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change INFO [GossipStage:1] 2020-10-19 15:35:24,729 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP {noformat} and this is what happens for a good run: {noformat} DEBUG [GossipStage:1] 2020-10-19 15:36:44,713 MigrationManager.java:89 - Not pulling schema from /127.0.0.2:7000, because schema versions match (4f3d62af-22c5-3886-8c0b-c5f1b95fc403) INFO [GossipStage:1] 2020-10-19 15:36:44,959 Gossiper.java:1211 - InetAddress /127.0.0.2:7000 is now DOWN DEBUG [GossipStage:1] 2020-10-19 15:36:44,960 FailureDetector.java:354 - Forcing conviction of /127.0.0.2:7000 DEBUG [GossipStage:1] 2020-10-19 15:36:56,647 Gossiper.java:1480 - Shadow request received, adding all states DEBUG [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1142 - Clearing interval times for /127.0.0.2:7000 due to generation change INFO [GossipStage:1] 2020-10-19 15:36:57,725 Gossiper.java:1233 - Node /127.0.0.2:7000 has restarted, now UP {noformat} Looking at {{Gossiper}} code and trying infer what's going on looks like node2 is reported UP/DOWN an extra time but I can't infer what's happening from only looking at the logs. I have added tracing level to the test and hope to repro overnight. I am attaching a zip for the good and bad runs for the record. > Flaky dtest > hintedhandoff_test.py::TestHintedHandoffConfig::test_hintedhandoff_setmaxwindow > ------------------------------------------------------------------------------------------- > > Key: CASSANDRA-15865 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15865 > Project: Cassandra > Issue Type: Bug > Components: Test/dtest/python > Reporter: Sam Tunnicliffe > Assignee: Berenguer Blasi > Priority: Normal > Fix For: 4.0-beta > > Attachments: logsGoodNBad.zip > > Time Spent: 20m > Remaining Estimate: 0h > > I've seen this fail a couple of times under JDK11, when it doesn't appear to > be related to the changes under test. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org