[jira] [Commented] (KAFKA-4051) Strange behavior during rebalance when turning the OS clock back

Rajini Sivaram (JIRA) Thu, 18 Aug 2016 04:14:29 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426279#comment-15426279
 ]


Rajini Sivaram commented on KAFKA-4051:
---------------------------------------

I think the issue is around the handling of timer tasks in Kafka. While task 
expiry is set using {{System.currentTimeMillis}} which can move backwards (as 
reported in this JIRA), the internal timers in the TimingWheel in Kafka used to 
handle expiry is a monotonically increasing timer that starts with 
{{System.currentTimeMillis}}. This mismatch causes expiry of tasks until 
{{System.currentTimeMillis}} catches up with the internal timer.

As [~ijuma] has pointed out on the mailing list, Kafka uses 
{{System.currentTimeMillis}} in a lot of places and switching to 
{{System.nanoTime}} everywhere could impact performance. We have a few choices 
on fixing this JIRA (in increasing order of complexity)

# We could switch over to {{System.nanoTime}} for TimerTasks alone to fix the 
issue with delayed tasks reported here
# It may be possible to change the timer implementation to recover better when 
wall clock time moves backwards
# Replace {{System.currentTimeMillis}} with {{System.nanoTime}} in time 
comparisons throughout Kafka code

I am inclined to do 1) and run performance tests, but am interested in what 
others think.


> Strange behavior during rebalance when turning the OS clock back
> ----------------------------------------------------------------
>
>                 Key: KAFKA-4051
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4051
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 0.10.0.0
>         Environment: OS: Ubuntu 14.04 - 64bits
>            Reporter: Gabriel Ibarra
>            Assignee: Rajini Sivaram
>
> If a rebalance is performed after turning the OS clock back, then the kafka 
> server enters in a loop and the rebalance cannot be completed until the 
> system returns to the previous date/hour.
> Steps to Reproduce:
> - Start a consumer for TOPIC_NAME with group id GROUP_NAME. It will be owner 
> of all the partitions.
> - Turn the system (OS) clock back. For instance 1 hour.
> - Start a new consumer for TOPIC_NAME  using the same group id, it will force 
> a rebalance.
> After these actions the kafka server logs constantly display the messages 
> below, and after a while both consumers do not receive more packages. This 
> condition lasts at least the time that the clock went back, for this example 
> 1 hour, and finally after this time kafka comes back to work.
> [2016-08-08 11:30:23,023] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 2 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,025] INFO [GroupCoordinator 0]: Stabilized group 
> GROUP_NAME generation 3 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,027] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 3 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,029] INFO [GroupCoordinator 0]: Group GROUP_NAME 
> generation 3 is dead and removed (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Stabilized group 
> GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,033] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,034] INFO [GroupCoordinator 0]: Group GROUP generation 1 
> is dead and removed (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,043] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Stabilized group 
> GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,045] INFO [GroupCoordinator 0]: Group GROUP_NAME 
> generation 1 is dead and removed (kafka.coordinator.GroupCoordinator)
> Due to the fact that some systems could have enabled NTP or an administrator 
> option to change the system clock (date/time) it's important to do it safely, 
> currently the only way to do it safely is following the next steps:
> 1-  Tear down the Kafka server.
> 2-  Change the date/time
> 3- Tear up the Kafka server.
> But, this approach can be done only if the change was performed by the 
> administrator, not for NTP. Also in many systems turning down the Kafka 
> server might cause the INFORMATION TO BE LOST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4051) Strange behavior during rebalance when turning the OS clock back

Reply via email to