[jira] [Commented] (KAFKA-4051) Strange behavior during rebalance when turning the OS clock back

Rajini Sivaram (JIRA) Thu, 18 Aug 2016 04:26:49 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-4051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426290#comment-15426290
 ]


Rajini Sivaram commented on KAFKA-4051:
---------------------------------------

[~ijuma] As you have pointed out, there are inevitably going to be issues since 
Kafka uses System.currentTimeMillis in so many places. But typically, you would 
expect a single clock change to cause one expiry and thereafter continue to 
work with the changed timer (eg. producer metadata get expires and retry works 
since it is on the updated clock). The issue in this JIRA is that the broker 
doesn't recover until the wall clock time reaches the previously set time. I 
imagine changing the clock back by an hour is an uncommon scenario, but the 
impact is quite big if it does happen. If we are fixing this issue, it will be 
useful to have a system test to check that Kafka continues to function after a 
major clock change.

> Strange behavior during rebalance when turning the OS clock back
> ----------------------------------------------------------------
>
>                 Key: KAFKA-4051
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4051
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>    Affects Versions: 0.10.0.0
>         Environment: OS: Ubuntu 14.04 - 64bits
>            Reporter: Gabriel Ibarra
>            Assignee: Rajini Sivaram
>
> If a rebalance is performed after turning the OS clock back, then the kafka 
> server enters in a loop and the rebalance cannot be completed until the 
> system returns to the previous date/hour.
> Steps to Reproduce:
> - Start a consumer for TOPIC_NAME with group id GROUP_NAME. It will be owner 
> of all the partitions.
> - Turn the system (OS) clock back. For instance 1 hour.
> - Start a new consumer for TOPIC_NAME  using the same group id, it will force 
> a rebalance.
> After these actions the kafka server logs constantly display the messages 
> below, and after a while both consumers do not receive more packages. This 
> condition lasts at least the time that the clock went back, for this example 
> 1 hour, and finally after this time kafka comes back to work.
> [2016-08-08 11:30:23,023] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 2 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,025] INFO [GroupCoordinator 0]: Stabilized group 
> GROUP_NAME generation 3 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,027] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 3 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,029] INFO [GroupCoordinator 0]: Group GROUP_NAME 
> generation 3 is dead and removed (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,032] INFO [GroupCoordinator 0]: Stabilized group 
> GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,033] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,034] INFO [GroupCoordinator 0]: Group GROUP generation 1 
> is dead and removed (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,043] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 0 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Stabilized group 
> GROUP_NAME generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,044] INFO [GroupCoordinator 0]: Preparing to restabilize 
> group GROUP_NAME with old generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-08-08 11:30:23,045] INFO [GroupCoordinator 0]: Group GROUP_NAME 
> generation 1 is dead and removed (kafka.coordinator.GroupCoordinator)
> Due to the fact that some systems could have enabled NTP or an administrator 
> option to change the system clock (date/time) it's important to do it safely, 
> currently the only way to do it safely is following the next steps:
> 1-  Tear down the Kafka server.
> 2-  Change the date/time
> 3- Tear up the Kafka server.
> But, this approach can be done only if the change was performed by the 
> administrator, not for NTP. Also in many systems turning down the Kafka 
> server might cause the INFORMATION TO BE LOST.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4051) Strange behavior during rebalance when turning the OS clock back

Reply via email to