[
https://issues.apache.org/jira/browse/CASSANDRA-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brandon Williams updated CASSANDRA-5254:
----------------------------------------
Attachment: 5254.txt
This is a pernicious thing to debug, since the timing condition is so tight;
enabling DEBUG or TRACE even on just the gossiper does not let it reproduce.
However, careful examination of the INFO messages tells us that
handleMajorStateChange is not being called since there is no 'node restarted'
message, which means applyStateLocally is the only other option, and that is
called in the ack/ack2 handlers. This tells us that we're in the middle of a
gossip round when we send the shutdown message, so the easiest thing to do is
sleep for more than one round. Trivial patch to do so, which has solved this
on the dtests.
> Nodes can be marked up after gossip sends the goodbye command
> -------------------------------------------------------------
>
> Key: CASSANDRA-5254
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5254
> Project: Cassandra
> Issue Type: Bug
> Affects Versions: 1.1.1
> Reporter: Brandon Williams
> Assignee: Brandon Williams
> Priority: Minor
> Attachments: 5254.txt
>
>
> Finally tracked this down on dtestbot after setting the rpc_timeout to
> ridiculous levels:
> {noformat}
> ==> logs/last/node1.log <==
> INFO [FlushWriter:1] 2013-02-14 10:01:10,311 Memtable.java (line 305)
> Completed flushing
> /tmp/dtest-iaYzzR/test/node1/data/system/schema_columns/system-schema_columns-hf-2-Data.db
> (558 bytes) for commitlog position ReplayPosition(segmentId=1360857665931,
> position=4770)
> INFO [MemoryMeter:1] 2013-02-14 10:01:10,974 Memtable.java (line 213)
> CFS(Keyspace='ks', ColumnFamily='cf') liveRatio is 20.488836662749705
> (just-counted was 20.488836662749705). calculation took 96ms for 144 columns
> INFO [GossipStage:1] 2013-02-14 10:01:12,119 Gossiper.java (line 831)
> InetAddress /127.0.0.3 is now dead.
> ==> logs/last/node2.log <==
> INFO [GossipStage:1] 2013-02-14 10:01:12,119 Gossiper.java (line 831)
> InetAddress /127.0.0.3 is now dead.
> INFO [GossipStage:1] 2013-02-14 10:01:12,238 Gossiper.java (line 817)
> InetAddress /127.0.0.3 is now UP
> INFO [GossipTasks:1] 2013-02-14 10:01:26,386 Gossiper.java (line 831)
> InetAddress /127.0.0.3 is now dead.
> ==> logs/last/node3.log <==
> INFO [StorageServiceShutdownHook] 2013-02-14 10:01:11,115 Gossiper.java
> (line 1134) Announcing shutdown
> INFO [StorageServiceShutdownHook] 2013-02-14 10:01:12,118
> MessagingService.java (line 549) Waiting for messaging service to quiesce
> INFO [ACCEPT-/127.0.0.3] 2013-02-14 10:01:12,119 MessagingService.java (line
> 705) MessagingService shutting down server thread.
> {noformat}
> node2 receives the goodbye command from node3, and node1 has already marked
> node3 down, but some kind of signal is still coming from node3 to node2
> marking it up again.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira