Eliel Gomes created KAFKA-19888:
-----------------------------------

             Summary: Coordinator histogram negative values causing persistent 
write timeouts and consumer instability in Kafka 4.1.0
                 Key: KAFKA-19888
                 URL: https://issues.apache.org/jira/browse/KAFKA-19888
             Project: Kafka
          Issue Type: Bug
          Components: group-coordinator
    Affects Versions: 4.1.0
         Environment: Kafka cluster (KRaft) with 9 brokers
RHEL-based OS (AlmaLinux 9.6) 
Running on VMware vSphere
NTP service active on VMs
VMware Tools had "Synchronize time periodically" enabled
            Reporter: Eliel Gomes


h2. Summary

In Kafka 4.1.0, a broker entered a corrupted state where the Coordinator began 
producing {*}negative histogram values{*}, causing persistent write timeouts to 
{{{}__consumer_offsets{}}}, continuous group coordinator instability, and 
repeated consumer disconnections.

The issue did *not* self-heal after time synchronization stabilized and only 
cleared after restarting the affected broker.

This suggests that the Coordinator histogram/timer subsystem becomes invalid 
after certain time irregularities and does not recover automatically.

 
----
h2. *Description*

We observed repeated Coordinator failures on a single broker in a production 
Kafka 4.1.0 cluster. The internal histogram used by the Coordinator runtime 
began to record {*}negative values{*}, which should be impossible under correct 
operation.

Once this occurs:
 * Coordinator write events begin to timeout repeatedly

 * Consumers are disconnected and reconnected frequently

 * Group coordination is unstable

 * Commits failing on __consumer_offsets
 * Lag increasing ("fake" lag), because of failures in commits
 * The broker remains in a broken state even after the system clock stabilizes

 * Only restarting the broker clears the condition

 * After the restart, consumer groups immediately select the same broker again 
as coordinator, and the problem eventually reappears

This behavior suggests state corruption inside the Coordinator's 
timer/histogram logic, where the {{GroupCoordinator}} component is susceptible 
to a fatal {{ArrayIndexOutOfBoundsException}} when attempting to record a 
negative value in an internal histogram.

This bug appears to be triggered by a non-monotonic clock adjustment (i.e., the 
system clock moving backward).

When this exception occurs, the {{group-coordinator-event-processor}} thread 
fails, specifically when processing internal events like 
{{{*}HighWatermarkUpdate{*}.}}

The failures cause of offset *commit failures* from {*}consumers{*}.

 
----
h2. *Actual Behavior*
 * Coordinator internal histograms observe *negative values*

 * The Coordinator *fails to write to {{__consumer_offsets}}*

 * Consumers experience recurring instability

 * Kafka does not self-recover even when system time becomes stable

 * Only a broker restart resolves the condition

 
----
h2. *Expected behavior*

Kafka should:
 * Reject or guard against negative deltas in histogram/timer calculations

 * Prevent a corrupted Coordinator state from persisting

 * Avoid selecting a malfunctioning broker as group coordinator

 
----
h2. *Impact*
 * Critical production consumer instability

 * Continuous rebalances

 * Offset commit delays

 ** "Fake Lags"
 * Persistent coordinator malfunction

 * Requires broker restarts to recover

 * High operational risk

----
h2. *LOGS*

*Server-Side:* The {{GroupCoordinator}} broker log is flooded with the 
following exception, indicating the event processor thread is failing:

 
*Kafka Log - 2025-11-12 at 14:56*
 
{code:java}
[2025-11-12 14:56:38,166] ERROR [group-coordinator-event-processor-0]: Failed 
to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram recorded 
value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
[2025-11-12 14:56:38,167] ERROR [GroupCoordinator id=5] Execution of 
HighWatermarkUpdate failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
...
[2025-11-12 14:58:38,170] ERROR [group-coordinator-event-processor-2]: Failed 
to run event InternalEvent(name=FlushBatch) due to: Histogram recorded value 
cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
[2025-11-12 14:58:38,170] ERROR [GroupCoordinator id=5] Execution of FlushBatch 
failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.{code}
 

 

*Kafka Log - 2025-11-13 09:15*

 
{code:java}
[2025-11-13 09:15:39,777] ERROR [group-coordinator-event-processor-0]: Failed 
to run event InternalEvent(name=HighWatermarkUpdate) due to: Histogram recorded 
value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
        at 
org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
~[HdrHistogram-2.2.2.jar:?]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
~[HdrHistogram-2.2.2.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
~[kafka-clients-4.1.0.jar:?]
        at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
~[kafka-clients-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
 [kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
 [kafka-coordinator-common-4.1.0.jar:?]
[2025-11-13 09:15:39,777] ERROR [GroupCoordinator id=5] Execution of 
HighWatermarkUpdate failed due to Histogram recorded value cannot be negative.. 
(org.apache.kafka.coordinator.common.runtime.CoordinatorRuntime)
java.lang.ArrayIndexOutOfBoundsException: Histogram recorded value cannot be 
negative.
        at 
org.HdrHistogram.AbstractHistogram.countsArrayIndex(AbstractHistogram.java:2399)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordSingleValue(AbstractHistogram.java:559)
 ~[HdrHistogram-2.2.2.jar:?]
        at 
org.HdrHistogram.AbstractHistogram.recordValue(AbstractHistogram.java:467) 
~[HdrHistogram-2.2.2.jar:?]
        at org.HdrHistogram.Recorder.recordValue(Recorder.java:136) 
~[HdrHistogram-2.2.2.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.HdrHistogram.record(HdrHistogram.java:100)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.KafkaMetricHistogram.record(KafkaMetricHistogram.java:128)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.common.metrics.Sensor.recordInternal(Sensor.java:236) 
~[kafka-clients-4.1.0.jar:?]
        at org.apache.kafka.common.metrics.Sensor.record(Sensor.java:197) 
~[kafka-clients-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.CoordinatorRuntimeMetricsImpl.recordEventQueueTime(CoordinatorRuntimeMetricsImpl.java:284)
 ~[kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.handleEvents(MultiThreadedEventProcessor.java:146)
 [kafka-coordinator-common-4.1.0.jar:?]
        at 
org.apache.kafka.coordinator.common.runtime.MultiThreadedEventProcessor$EventProcessorThread.run(MultiThreadedEventProcessor.java:179)
 [kafka-coordinator-common-4.1.0.jar:?] {code}
 

*Client-Side:* Consumers associated with this (failed) {{GroupCoordinator}} 
begin to fail their offset commits. The client logs show a continuous loop 
where the consumer:
 * Tries to commit and fails with {{This is not the correct coordinator.}}

 * Receives an error response {{{}NOT_COORDINATOR{}}}.

 * Disconnects and attempts rediscovery.

 * Discovers the _exact same_ (broken) broker as the coordinator and repeats 
the cycle.

 
*Consumer Log - 2025-11-12 at 14:58*

 
{code:java}
2025-11-13 14:58:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Offset commit failed on partition SEND_SMS_LOW_PRIORITY-31 at offset 2291423: 
The coordinator is not available.
2025-11-13 14:58:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-12 14:58:46.388 INFO [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator 10.0.01:9092 (id: 2147483642 
rack: null) 2025-11-12 14:58:46.395 INFO [o.apache.kafka.clients.NetworkClient] 
[] - [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642 2025-11-12 14:58:46.396 INFO [o.apache.kafka.clients.NetworkClient] 
[] - [Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT request 
with correlation id 3436277 due to node 2147483642 being disconnected (elapsed 
time since creation: 12ms, elapsed time since send: 12ms, throttle time: 0ms, 
request timeout: 30000ms) 2025-11-12 14:58:46.677 INFO 
[o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) {code}
 

 

*Consumer Log - 2025-11-13 at 09:15*

 
{code:java}
2025-11-13 09:15:46.730 WARN  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Offset commit failed on partition SEND_SMS_LOW_PRIORITY-31 at offset 2291423: 
The coordinator is not available.
2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-13 09:15:46.730 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator X.X.X.X:9092 (id: 2147483642 
rack: null)
2025-11-13 09:15:46.732 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642
2025-11-13 09:15:46.949 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null)
2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) is unavailable or 
invalid due to cause: error response COORDINATOR_NOT_AVAILABLE. isDisconnected: 
false. Rediscovery will be attempted.
2025-11-13 09:15:51.958 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Requesting disconnect from last known coordinator X.X.X.X:9092 (id: 2147483642 
rack: null)
2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Client requested disconnect from node 
2147483642
2025-11-13 09:15:51.958 INFO  [o.apache.kafka.clients.NetworkClient] [] - 
[Consumer clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, 
groupId=CONSUMER_GROUP_LOW_PRIORITY] Cancelled in-flight OFFSET_COMMIT request 
with correlation id 3557487 due to node 2147483642 being disconnected (elapsed 
time since creation: 4908ms, elapsed time since send: 4908ms, throttle time: 
0ms, request timeout: 30000ms)
2025-11-13 09:15:52.086 INFO  [SendAggregateTemplateRoute] [] - Grouped 2 
messages.
2025-11-13 09:15:52.132 INFO  [o.a.k.c.c.i.ConsumerCoordinator] [] - [Consumer 
clientId=CLIENT_ID_CONSUMER_LOW_PRIORITY, groupId=CONSUMER_GROUP_LOW_PRIORITY] 
Discovered group coordinator X.X.X.X:9092 (id: 2147483642 rack: null) {code}
 
----
 

*Workaround (Temporary Fix):*
 # Restarting the affected broker to force a new {{GroupCoordinator}} election 
for the affected consumer groups.

 ## After the broker restarts, consumers will once again discover it as Group 
Coordinator.

 
----
h2. *NTP*

Brokers of Kafka Cluster and Clients Consumers are with NTP service active
{code:java}
timedatectl
               Local time: Fri 2025-11-14 11:07:49 -03
           Universal time: Fri 2025-11-14 14:07:49 UTC
                 RTC time: Fri 2025-11-14 14:07:49
                Time zone: America/Sao_Paulo (-03, -0300)
System clock synchronized: yes
              NTP service: active
          RTC in local TZ: no {code}
 
----
h2. *Request for Analysis / Questions*
 # Does Kafka 4.1.0 require strictly monotonic time for Coordinator 
histogram/timer logic?

 # Is histogram corruption (negative values) a known issue in 4.x?

 # Should Kafka validate histogram deltas to prevent negative values?

 # Is this fixed in newer Kafka releases?

 # Should CoordinatorRuntime reset its internal state when encountering time 
irregularities?

 # Is there a recommended workaround to avoid Coordinator entering this 
permanent error state?

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to