[ https://issues.apache.org/jira/browse/KAFKA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426279#comment-15426279 ]
Rajini Sivaram commented on KAFKA-4051: --------------------------------------- I think the issue is around the handling of timer tasks in Kafka. While task expiry is set using {{System.currentTimeMillis}} which can move backwards (as reported in this JIRA), the internal timers in the TimingWheel in Kafka used to handle expiry is a monotonically increasing timer that starts with {{System.currentTimeMillis}}. This mismatch causes expiry of tasks until {{System.currentTimeMillis}} catches up with the internal timer. As [~ijuma] has pointed out on the mailing list, Kafka uses {{System.currentTimeMillis}} in a lot of places and switching to {{System.nanoTime}} everywhere could impact performance. We have a few choices on fixing this JIRA (in increasing order of complexity) # We could switch over to {{System.nanoTime}} for TimerTasks alone to fix the issue with delayed tasks reported here # It may be possible to change the timer implementation to recover better when wall clock time moves backwards # Replace {{System.currentTimeMillis}} with {{System.nanoTime}} in time comparisons throughout Kafka code I am inclined to do 1) and run performance tests, but am interested in what others think. > Strange behavior during rebalance when turning the OS clock back > ---------------------------------------------------------------- > > Key: KAFKA-4051 > URL: https://issues.apache.org/jira/browse/KAFKA-4051 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 0.10.0.0 > Environment: OS: Ubuntu 14.04 - 64bits > Reporter: Gabriel Ibarra > Assignee: Rajini Sivaram > > If a rebalance is performed after turning the OS clock back, then the kafka > server enters in a loop and the rebalance cannot be completed until the > system returns to the previous date/hour. > Steps to Reproduce: > - Start a consumer for TOPIC_NAME with group id GROUP_NAME. It will be owner > of all the partitions. > - Turn the system (OS) clock back. For instance 1 hour. > - Start a new consumer for TOPIC_NAME using the same group id, it will force > a rebalance. > After these actions the kafka server logs constantly display the messages > below, and after a while both consumers do not receive more packages. This > condition lasts at least the time that the clock went back, for this example > 1 hour, and finally after this time kafka comes back to work. > [2016-08-08 11:30:23,023] INFO [GroupCoordinator 0]: Preparing to restabilize > group GROUP_NAME with old generation 2 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,025] INFO [GroupCoordinator 0]: Stabilized group > GROUP_NAME generation 3 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,027] INFO [GroupCoordinator 0]: Preparing to restabilize > group GROUP_NAME with old generation 3 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,029] INFO [GroupCoordinator 0]: Group GROUP_NAME > generation 3 is dead and removed (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Preparing to restabilize > group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Stabilized group > GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,033] INFO [GroupCoordinator 0]: Preparing to restabilize > group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,034] INFO [GroupCoordinator 0]: Group GROUP generation 1 > is dead and removed (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,043] INFO [GroupCoordinator 0]: Preparing to restabilize > group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Stabilized group > GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Preparing to restabilize > group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator) > [2016-08-08 11:30:23,045] INFO [GroupCoordinator 0]: Group GROUP_NAME > generation 1 is dead and removed (kafka.coordinator.GroupCoordinator) > Due to the fact that some systems could have enabled NTP or an administrator > option to change the system clock (date/time) it's important to do it safely, > currently the only way to do it safely is following the next steps: > 1- Tear down the Kafka server. > 2- Change the date/time > 3- Tear up the Kafka server. > But, this approach can be done only if the change was performed by the > administrator, not for NTP. Also in many systems turning down the Kafka > server might cause the INFORMATION TO BE LOST. -- This message was sent by Atlassian JIRA (v6.3.4#6332)