[
https://issues.apache.org/jira/browse/KAFKA-13727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson resolved KAFKA-13727.
-------------------------------------
Fix Version/s: 2.8.2
3.1.1
3.0.2
Resolution: Fixed
> Edge case in cleaner can result in premature removal of ABORT marker
> --------------------------------------------------------------------
>
> Key: KAFKA-13727
> URL: https://issues.apache.org/jira/browse/KAFKA-13727
> Project: Kafka
> Issue Type: Bug
> Reporter: Jason Gustafson
> Assignee: Jason Gustafson
> Priority: Major
> Fix For: 2.8.2, 3.1.1, 3.0.2
>
>
> The log cleaner works by first building a map of the active keys beginning
> from the dirty offset, and then scanning forward from the beginning of the
> log to decide which records should be retained based on whether they are
> included in the map. The map of keys has a limited size. As soon as it fills
> up, we stop building it. The offset corresponding to the last record that was
> included in the map becomes the next dirty offset. Then when we are cleaning,
> we stop scanning forward at the dirty offset. Or to be more precise, we
> continue scanning until the end of the segment which includes the dirty
> offset, but all records above that offset are coped as is without checking
> the map of active keys.
> Compaction is complicated by the presence of transactions. The cleaner must
> keep track of which transactions have data remaining so that it can tell when
> it is safe to remove the respective markers. It works a bit like the
> consumer. Before scanning a segment, the cleaner consults the aborted
> transaction index to figure out which transactions have been aborted. All
> other transactions are considered committed.
> The problem we have found is that the cleaner does not take into account the
> range of offsets between the dirty offset and the end offset of the segment
> containing it when querying ahead for aborted transactions. This means that
> when the cleaner is scanning forward from the dirty offset, it does not have
> the complete set of aborted transactions. The main consequence of this is
> that abort markers associated with transactions which start within this range
> of offsets become eligible for deletion even before the corresponding data
> has been removed from the log.
> Here is an example. Suppose that the log contains the following entries:
> offset=0, key=a
> offset=1, key=b
> offset=2, COMMIT
> offset=3, key=c
> offset=4, key=d
> offset=5, COMMIT
> offset=6, key=b
> offset=7, ABORT
> Suppose we have an offset map which can only contain 2 keys and the dirty
> offset starts at 0. The first time we scan forward, we will build a map with
> keys a and b, which will allow us to move the dirty offset up to 3. Due to
> the issue documented here, we will not detect the aborted transaction
> starting at offset 6. But it will not be eligible for deletion on this round
> of cleaning because it is bound by `delete.retention.ms`. Instead, our new
> logic will set the deletion horizon for this batch based to the current time
> plus the configured `delete.retention.ms`.
> offset=0, key=a
> offset=1, key=b
> offset=2, COMMIT
> offset=3, key=c
> offset=4, key=d
> offset=5, COMMIT
> offset=6, key=b
> offset=7, ABORT (deleteHorizon: N)
> Suppose that the time reaches N+1 before the next cleaning. We will begin
> from the dirty offset of 3 and collect keys c and d before stopping at offset
> 6. Again, we will not detect the aborted transaction beginning at offset 6
> since it is out of the range. This time when we scan, the marker at offset 7
> will be deleted because the transaction will be seen as empty and now the
> deletion horizon has passed. So we end up with this state:
> offset=0, key=a
> offset=1, key=b
> offset=2, COMMIT
> offset=3, key=c
> offset=4, key=d
> offset=5, COMMIT
> offset=6, key=b
> Effectively it becomes a hanging transaction. The interesting thing is that
> we might not even detect it. As far as the leader is concerned, it had
> already completed that transaction, so it is not expecting any additional
> markers. The transaction index would have been rewritten without the aborted
> transaction when the log was cleaned, so any consumer fetching the data would
> see the transaction as committed. On the other hand, if we did a reassignment
> to a new replica, or if we had to rebuild the full log state during recovery,
> then we would suddenly detect it.
> I am not sure how likely this scenario is in practice. I think it's fair to
> say it is an extremely rare case. The cleaner has to fail to clean a full
> segment at least two times and you still need enough time to pass for the
> marker's deletion horizon to be reached. Perhaps it is possible if the
> cardinality of keys is very high and the configured memory limit for the
> cleaner is low.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)