[
https://issues.apache.org/jira/browse/CASSANDRA-4066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13405110#comment-13405110
]
Jonathan Ellis commented on CASSANDRA-4066:
-------------------------------------------
To summarize:
The linux kernel bug causes livelock when ntp moves the clock backwards. This
manifests with nodes showing super high load for no [other] reason and marking
each other down as Herman points out, both on Java systems like
C*/Hadoop/ElasticSearch and non-Java systems like mysql. It does NOT cause the
kind of GossipTask death Brandon talks about in the first part of the ticket,
which you only need to worry about if you're changing time by a lot more than a
leap second.
The former you can fix by upgrading to a fixed linux kernel before the next
leap second in a few years, or by rebooting or applying {{date; date `date
+"%m%d%H%M%C%y.%S"`; date}} when you get bitten. The latter is not worth
addressing since it only applies to VMs that get hibernated.
> Cassandra cluster stops responding on time change (scheduling not using
> monotonic time?)
> -----------------------------------------------------------------------------------------
>
> Key: CASSANDRA-4066
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4066
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Environment: Linux; CentOS6 2.6.32-220.4.2.el6.x86_64
> Reporter: David Daeschler
> Assignee: Brandon Williams
> Priority: Minor
> Labels: gossip
> Fix For: 1.1.1
>
>
> The server installation I set up did not have ntpd installed in the base
> installation. When I noticed that the clocks were skewing I installed ntp and
> set the date on all the servers in the cluster. A short time later, I started
> getting UnavailableExceptions on the clients.
> Also, one sever seemed to be unaffected by the time change. That server
> happened to have it's time pushed forward, not backwards like the other 3 in
> the cluster. This leads me to believe something is running on a
> timer/schedule that is not monotonic.
> I'm posting this as a bug, but I suppose it might just be part of the
> communication protocols etc for the cluster and part of the design. But I
> think the devs should be aware of what I saw.
> Otherwise, thank you for a fantastic product. Even after restarting 75% of
> the cluster things seem to have recovered nicely.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira