[ 
https://issues.apache.org/jira/browse/KAFKA-19888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18039202#comment-18039202
 ] 

Eliel Gomes edited comment on KAFKA-19888 at 11/19/25 3:44 PM:
---------------------------------------------------------------

Hi [~squah-confluent] , [~dajac], how are you?

During our analysis, we identified a potential time-synchronization conflict:
the *VMware vCenter periodic time sync* (VMware Tools "Synchronize time 
periodically") might have been competing with the *NTP service running* on the 
affected virtual machines.

As a workaround, we disabled the vCenter time synchronization.

Since applying this workaround six days ago, we have not observed any 
recurrence of the “histogram value negative” exception.

That said, we want to emphasize that we cannot confirm with certainty that this 
was the root cause. It may be coincidental, and the issue might not be directly 
related to this configuration. Our intention is to share our findings so the 
Kafka team has additional data points that might assist in evaluating the 
behavior.

If this scenario can indeed impact Kafka’s metrics subsystem, it may be 
valuable to consider:
 * improvements to make future Kafka versions more resilient to time shifts or 
sync conflicts, or
 * explicit documentation guidance highlighting how time synchronization 
conflicts can affect Kafka internals, if the team determines this is relevant.

We’re glad to provide any additional logs, metrics, or configuration details 
that could help.

Thank you for your support.

Best regards,
Eliel Gomes


was (Author: JIRAUSER311502):
Hi [~dajac], how are you?

During our analysis, we identified a potential time-synchronization conflict:
the *VMware vCenter periodic time sync* (VMware Tools "Synchronize time 
periodically") might have been competing with the *NTP service running* on the 
affected virtual machines.

As a workaround, we disabled the vCenter time synchronization.

Since applying this workaround six days ago, we have not observed any 
recurrence of the “histogram value negative” exception.

That said, we want to emphasize that we cannot confirm with certainty that this 
was the root cause. It may be coincidental, and the issue might not be directly 
related to this configuration. Our intention is to share our findings so the 
Kafka team has additional data points that might assist in evaluating the 
behavior.

If this scenario can indeed impact Kafka’s metrics subsystem, it may be 
valuable to consider:

* improvements to make future Kafka versions more resilient to time shifts or 
sync conflicts, or
* explicit documentation guidance highlighting how time synchronization 
conflicts can affect Kafka internals, if the team determines this is relevant.

We’re glad to provide any additional logs, metrics, or configuration details 
that could help.

Thank you for your support.

Best regards,
Eliel Gomes

> Coordinator histogram negative values causing persistent write timeouts and 
> consumer instability in Kafka 4.1.0
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19888
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19888
>             Project: Kafka
>          Issue Type: Bug
>          Components: group-coordinator
>    Affects Versions: 4.1.0
>         Environment: Kafka cluster (KRaft) with 9 brokers
> RHEL-based OS (AlmaLinux 9.6) 
> Running on VMware vSphere
> NTP service active on VMs
> VMware Tools had "Synchronize time periodically" enabled
>            Reporter: Eliel Gomes
>            Assignee: Sean Quah
>            Priority: Blocker
>             Fix For: 4.2.0
>
>
> h2. Summary
> In Kafka 4.1.0, a broker entered a corrupted state where the Coordinator 
> began producing {*}negative histogram values{*}, causing persistent write 
> timeouts to {{{}__consumer_offsets{}}}, continuous group coordinator 
> instability, and repeated consumer disconnections.
> The issue did *not* self-heal after time synchronization stabilized and only 
> cleared after restarting the affected broker.
> This suggests that the Coordinator histogram/timer subsystem becomes invalid 
> after certain time irregularities and does not recover automatically.
>  
> ----
> h2. *Description*
> We observed repeated Coordinator failures on a single broker in a production 
> Kafka 4.1.0 cluster. The internal histogram used by the Coordinator runtime 
> began to record {*}negative values{*}, which should be impossible under 
> correct operation.
> Once this occurs:
>  * Coordinator write events begin to timeout repeatedly
>  * Consumers are disconnected and reconnected frequently
>  * Group coordination is unstable
>  * Commits failing on __consumer_offsets
>  * Lag increasing ("fake" lag), because of failures in commits
>  * The broker remains in a broken state even after the system clock stabilizes
>  * Only restarting the broker clears the condition
>  * After the restart, consumer groups immediately select the same broker 
> again as coordinator, and the problem eventually reappears
> This behavior suggests state corruption inside the Coordinator's 
> timer/histogram logic, where the {{GroupCoordinator}} component is 
> susceptible to a fatal {{ArrayIndexOutOfBoundsException}} when attempting to 
> record a negative value in an internal histogram.
> This bug appears to be triggered by a non-monotonic clock adjustment (i.e., 
> the system clock moving backward).
> When this exception occurs, the {{group-coordinator-event-processor}} thread 
> fails, specifically when processing internal events like 
> {{{*}HighWatermarkUpdate{*}.}}
> The failures cause of offset *commit failures* from {*}consumers{*}.
>  
> ----
> h2. *Actual Behavior*
>  * Coordinator internal histograms observe *negative values*
>  * The Coordinator *fails to write to {{__consumer_offsets}}*
>  * Consumers experience recurring instability
>  * Kafka does not self-recover even when system time becomes stable
>  * Only a broker restart resolves the condition
>  
> ----
> h2. *Expected behavior*
> Kafka should:
>  * Reject or guard against negative deltas in histogram/timer calculations
>  * Prevent a corrupted Coordinator state from persisting
>  * Avoid selecting a malfunctioning broker as group coordinator
>  
> ----
> h2. *Impact*
>  * Critical production consumer instability
>  * Continuous rebalances
>  * Offset commit delays
>  * 
>  ** "Fake Lags"
>  * Persistent coordinator malfunction
>  * Requires broker restarts to recover
>  * High operational risk
> ----
> h2. *LOGS*
> *Server-Side:* The {{GroupCoordinator}} broker log is flooded with the 
> following exception, indicating the event processor thread is failing:
>  
> *Kafka Log - 2025-11-12 at 14:56*
> {code:java}
> [2025-11-12 14:56:38,166] ERROR [group-coordinator-event-processor-0]: Failed 
> to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram 
> recorded value cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
> [2025-11-12 14:56:38,167] ERROR [GroupCoordinator id=5] Execution of 
> HighWatermarkUpdate failed due to Histogram recorded value cannot be 
> negative.. (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
> ...
> [2025-11-12 14:58:38,170] ERROR [group-coordinator-event-processor-2]: Failed 
> to run event InternalEvent(name=FlushBatch) due to: Histogram recorded value 
> cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
> [2025-11-12 14:58:38,170] ERROR [GroupCoordinator id=5] Execution of 
> FlushBatch failed due to Histogram recorded value cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.{code}
>  
> *Kafka Log - 2025-11-13 09:15*
> {code:java}
> [2025-11-13 09:15:39,777] ERROR [group-coordinator-event-processor-0]: Failed 
> to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram 
> recorded value cannot be negative.. 
> (org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
>         at 
> org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
> ~[kafka-clients-4.1.0.jar:?]
>         at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
> ~[kafka-clients-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
>  [kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
>  [kafka-coordinator-common-4.1.0.jar:?]
> [2025-11-13 09:15:39,777] ERROR [GroupCoordinator id=5] Execution of 
> HighWatermarkUpdate failed due to Histogram recorded value cannot be 
> negative.. (org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
> java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
> negative.
>         at 
> org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
>  ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
> ~[HdrHistogram-2.2.2.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
> ~[kafka-clients-4.1.0.jar:?]
>         at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
> ~[kafka-clients-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
>  ~[kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
>  [kafka-coordinator-common-4.1.0.jar:?]
>         at 
> org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
>  [kafka-coordinator-common-4.1.0.jar:?] {code}
>  
> *Client-Side:* Consumers associated with this (failed) {{GroupCoordinator}} 
> begin to fail their offset commits. The client logs show a continuous loop 
> where the consumer:
>  * Tries to commit and fails with {{This is not the correct coordinator.}}
>  * Receives an error response {{{}NOT_COORDINATOR{}}}.
>  * Disconnects and attempts rediscovery.
>  * Discovers the _exact same_ (broken) broker as the coordinator and repeats 
> the cycle.
>  
> *Consumer Log - 2025-11-12 at 14:58*
> {code:java}
> 2025-11-13 14:58:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Offset commit failed on partition 
> SEND_SMS_LOW_PRIORITY-31 at offset 2291423: The coordinator is not available.
> 2025-11-13 14:58:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Group coordinator X.X.X.X:9092 (id: 
> 2147483642 rack: null) is unavailable or invalid due to cause: error response 
> COORDINATOR_NOT_AVAILABLE. isDisconnected: false. Rediscovery will be 
> attempted.
> 2025-11-12 14:58:46.388 INFO [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Requesting disconnect from last known 
> coordinator 10.0.01:9092 (id: 2147483642 rack: null) 2025-11-12 14:58:46.395 
> INFO [o.apache.kafka.clients.NetworkClient] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
> 2147483642 2025-11-12 14:58:46.396 INFO 
> [o.apache.kafka.clients.NetworkClient] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT 
> request with correlation id 3436277 due to node 2147483642 being disconnected 
> (elapsed time since creation: 12ms, elapsed time since send: 12ms, throttle 
> time: 0ms, request timeout: 30000ms) 2025-11-12 14:58:46.677 INFO 
> [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
> clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Discovered group coordinator 
> X.X.X.X:9092 (id: 2147483642 rack: null) {code}
>  
> *Consumer Log - 2025-11-13 at 09:15*
> {code:java}
> 2025-11-13 09:15:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Offset commit failed on partition 
> SEND_SMS_LOW_PRIORITY-31 at offset 2291423: The coordinator is not available.
> 2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Group coordinator X.X.X.X:9092 (id: 
> 2147483642 rack: null) is unavailable or invalid due to cause: error response 
> COORDINATOR_NOT_AVAILABLE. isDisconnected: false. Rediscovery will be 
> attempted.
> 2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Requesting disconnect from last known 
> coordinator X.X.X.X:9092 (id: 2147483642 rack: null)
> 2025-11-13 09:15:46.732 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
> 2147483642
> 2025-11-13 09:15:46.949 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Discovered group coordinator 
> X.X.X.X:9092 (id: 2147483642 rack: null)
> 2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Group coordinator X.X.X.X:9092 (id: 
> 2147483642 rack: null) is unavailable or invalid due to cause: error response 
> COORDINATOR_NOT_AVAILABLE. isDisconnected: false. Rediscovery will be 
> attempted.
> 2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Requesting disconnect from last known 
> coordinator X.X.X.X:9092 (id: 2147483642 rack: null)
> 2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
> 2147483642
> 2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT 
> request with correlation id 3557487 due to node 2147483642 being disconnected 
> (elapsed time since creation: 4908ms, elapsed time since send: 4908ms, 
> throttle time: 0ms, request timeout: 30000ms)
> 2025-11-13 09:15:52.086 INFO  [SendAggregateTemplateRoute] [] - Grouped 2 
> messages.
> 2025-11-13 09:15:52.132 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - 
> [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
> groupId=CONSUMER_GROUP_LOW_PRIORITY] Discovered group coordinator 
> X.X.X.X:9092 (id: 2147483642 rack: null) {code}
>  
> ----
>  
> *Workaround (Temporary Fix):*
>  # Restarting the affected broker to force a new {{GroupCoordinator}} 
> election for the affected consumer groups.
>  # 
>  ## After the broker restarts, consumers will once again discover it as Group 
> Coordinator.
>  
> ----
> h2. *NTP*
> Brokers of Kafka Cluster and Clients Consumers are with NTP service active
> {code:java}
> timedatectl
>                Local time: Fri 2025-11-14 11:07:49 -03
>            Universal time: Fri 2025-11-14 14:07:49 UTC
>                  RTC time: Fri 2025-11-14 14:07:49
>                 Time zone: America/Sao_Paulo (-03, -0300)
> System clock synchronized: yes
>               NTP service: active
>           RTC in local TZ: no {code}
>  
> ----
> h2. *Request for Analysis / Questions*
>  # Does Kafka 4.1.0 require strictly monotonic time for Coordinator 
> histogram/timer logic?
>  # Is histogram corruption (negative values) a known issue in 4.1.X?
>  # Should Kafka validate histogram deltas to prevent negative values?
>  # Is this fixed in newer Kafka releases?
>  # Should CoordinatorRuntime reset its internal state when encountering time 
> irregularities?
>  # Is there a recommended workaround to avoid Coordinator entering this 
> permanent error state?
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to