Fabian Bell created KAFKA-20416:
-----------------------------------
Summary: RocksDB loses entries during broker patches.
Key: KAFKA-20416
URL: https://issues.apache.org/jira/browse/KAFKA-20416
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 3.9.1
Environment: MSK with kafka.m7g.2xlarge instances
Reporter: Fabian Bell
h2. Problem:
We discovered a strange behaviour on our production environment. We use a
KTable to look up data from a topic we write to.
{code:java}
builder.table(topicName, Consumed.with(keySerde, valueSerde),
Materialized.as(storeName)) {code}
When we access the store in the processor, we observed that the store returned
null values for keys that have non-null entries in the topic that backs the
KTable after an MSK security patch. We never tombstone an entry in our topic
nor have a delete retention activated.
This only happens for some of our instances.
We see the following stream logs:
{code:java}
Committing task(s) 0_14 failed.
Detected the states of tasks [0_14] are corrupted. Will close the task as dirty
and re-create and bootstrap from scratch.
Active task(s) got corrupted. Triggering a rebalance.
End offset for changelog our-topic-14 initialized as 16596290.
Restoration in progress for 1 partitions. {our-topic-14: position=0,
end=16596290, totalRestored=0}
State transition from RUNNING to PARTITIONS_REVOKED
No followup rebalance was requested, resetting the rebalance schedule.
partition revocation took 80 ms.
State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED
State transition from PARTITIONS_ASSIGNED to RUNNING {code}
This all happens within a few seconds, and the `Restoration in progress ...`
log is the only one we can see. A full restoration usually takes like 30 min.
The error message of the commit failure is
{code:java}
o.a.k.c.e.TimeoutException: Timeout expired after 60000ms while awaiting
AddOffsetsToTxn {code}
We can fix this situation by clearing the state directory and forcing a full
restoration.
h2. Context:
Each instance has its own persistent state directory. The configured state
directory does not change.
Processing Guarantee: exactly_once_v2
--
This message was sent by Atlassian Jira
(v8.20.10#820010)