Lerh Chuan Low created KAFKA-8547:
-------------------------------------
Summary: 2 __consumer_offsets partitions grow very big
Key: KAFKA-8547
URL: https://issues.apache.org/jira/browse/KAFKA-8547
Project: Kafka
Issue Type: Bug
Components: log cleaner
Affects Versions: 2.1.1
Environment: Ubuntu 18.04, Kafka 2.1.12-2.1.1, running as systemd
service
Reporter: Lerh Chuan Low
There's a related issue here: https://issues.apache.org/jira/browse/KAFKA-3917,
just thought it was a little bit outdated/dead.
We observed a few out of memory errors with our Kafka servers and our theory
was due to 2 overly large partitions in `__consumer_offsets`. On further
digging, it looks like these 2 large partitions have segments dating up to 3
months ago. Also, these old files collectively consumed most of the data from
those partitions (About 10G from the partition's 12G).
When we tried dumping those old segments, we see:
```
11:40 $ ./kafka-run-class.sh kafka.tools.DumpLogSegments --files
00000000161728257775.log --offsets-decoder --print-data-log --deep-iteration
Dumping 00000000161728257775.log
Starting offset: 161728257775
offset: 161728257904 position: 61 CreateTime: 1553457816168 isvalid: true
keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 367038
producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: []
endTxnMarker: COMMIT coordinatorEpoch: 746
offset: 161728258098 position: 200 CreateTime: 1553457816230 isvalid: true
keysize: 4 valuesize: 6 magic: 2 compresscodec: NONE producerId: 366036
producerEpoch: 3 sequence: -1 isTransactional: true headerKeys: []
endTxnMarker: COMMIT coordinatorEpoch: 761
...
```
It looks like all those old segments all contain transactional information, and
the 2 partitions are 1 for the control message COMMIT, the other for the
control message ABORT. (As a side note, we did take a while to figure out that
for a segment with the control bit set, the key really is `endTxnMarker` and
the value is `coordinatorEpoch`...otherwise in a non-control batch dump it
would have value and payload. We were wondering if seeing what those 2
partitions contained in their keys may give us any clues). Our current
workaround is based on this post:
https://issues.apache.org/jira/browse/KAFKA-3917?focusedCommentId=16816874&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16816874.
We set the cleanup policy to both compact,delete and very quickly the
partition was down to below 2G. Not sure if this is something log cleaner
should be able to handle normally?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)