[
https://issues.apache.org/jira/browse/KAFKA-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jun Rao resolved KAFKA-9826.
----------------------------
Fix Version/s: 2.6.0
Resolution: Fixed
merged the PR to trunk
> Log cleaning repeatedly picks same segment with no effect when first dirty
> offset is past start of active segment
> -----------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-9826
> URL: https://issues.apache.org/jira/browse/KAFKA-9826
> Project: Kafka
> Issue Type: Bug
> Components: log cleaner
> Affects Versions: 2.4.1
> Reporter: Steve Rodrigues
> Assignee: Steve Rodrigues
> Priority: Major
> Fix For: 2.6.0
>
>
> Seen on a system where a given partition had a single segment, and for
> whatever reason (deleteRecords?), the logStartOffset was greater than the
> base segment of the log, there were a continuous series of
> ```
> [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of FOO-3 to log
> start offset 55649 since the checkpointed offset 0 is invalid.
> (kafka.log.LogCleanerManager$)
> ```
> messages (partition name changed, it wasn't really FOO). This was expected to
> be resolved by KAFKA-6266 but clearly wasn't.
> Further investigation revealed that a few segments were continuously
> cleaning and generating messages in the `log-cleaner.log` of the form:
> ```
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3...
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3
> for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete.
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior
> to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10
> 13:39:08 EST 2019)... (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner
> thread 1 cleaned log FOO-3 (dirty section = [55287, 55287])
> 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).
> Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)
> Buffer utilization: 0.0%
> Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)
> Start size: 0.0 MB (0 messages)
> End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages)
> (kafka.log.LogCleaner)
> ```
> What seems to have happened here (data determined for a different partition)
> is:
> There exist a number of partitions here which get relatively low traffic,
> including our friend FOO-5. For whatever reason, LogStartOffset of this
> partition has moved beyond the baseOffset of the active segment. (Notes in
> other issues indicate that this is a reasonable scenario.) So there’s one
> segment, starting at 166266, and a log start of 166301.
> grabFilthiestCompactedLog runs and reads the checkpoint file. We see that
> this topicpartition needs to be cleaned, and call cleanableOffsets on it
> which returns an OffsetsToClean with firstDirtyOffset == logStartOffset ==
> 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) =
> 116301, and forceCheckpoint = true.
> The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix
> for KAFKA-6266). We then create a LogToClean object based on the
> firstDirtyOffset and firstUncleanableOffset of 166301 (past the active
> segment’s base offset).
> The LogToClean object has cleanBytes = logSegments(-1,
> firstDirtyOffset).map(_.size).sum → the size of this segment. It has
> firstUncleanableOffset and cleanableBytes determined by
> calculateCleanableBytes. calculateCleanableBytes returns:
> {{}}
> {{val firstUncleanableSegment =
> log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
> {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
> {{val cleanableBytes = log.logSegments(firstDirtyOffset,
> math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum
> (firstUncleanableOffset, cleanableBytes)}}
> firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base
> offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301,
> 166266) → which _is the active segment_
> So there are “cleanableBytes” > 0.
> We then filter out segments with totalbytes (clean + cleanable) > 0. This
> segment has totalBytes > 0, and it has cleanablebytes, so great! It’s
> filthiest.
> The cleaner picks it, calls cleanLog on it, which then does cleaner.clean,
> which returns nextDirtyOffset and cleaner stats. cleaner.clean callls
> doClean, which builds an offsetMap. The offsetMap looks at non-active
> segments, when building, but there aren’t any. So the endOffset of the
> offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including
> logging to log-cleaner.log). After this we call cleanerManager.doneCleaning,
> which writes the checkpoint file with the latest value… of 0.
> And then the process starts all over.
> It appears that there's at least one bug here, where `log.logSegments(from,
> to)` will return an empty list if from == to and both are segment-aligned,
> but _not_ if they are in the middle of a segment, and possibly that
> LogToClean does start=firstDirtyOffset, end = max(firstDirtyOffset,
> firstUncleanableOffset) – it can move the firstUncleanableOffset even when
> the firstDirtyOffset is past firstUncleanable.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)