[jira] [Updated] (KAFKA-9826) Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment

Jun Rao (Jira) Thu, 07 May 2020 08:01:11 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jun Rao updated KAFKA-9826:
---------------------------
    Fix Version/s: 2.5.1
                   2.4.2

> Log cleaning repeatedly picks same segment with no effect when first dirty 
> offset is past start of active segment
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9826
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9826
>             Project: Kafka
>          Issue Type: Bug
>          Components: log cleaner
>    Affects Versions: 2.4.1
>            Reporter: Steve Rodrigues
>            Assignee: Steve Rodrigues
>            Priority: Major
>             Fix For: 2.6.0, 2.4.2, 2.5.1
>
>
> Seen on a system where a given partition had a single segment, and for 
> whatever reason (deleteRecords?), the logStartOffset was greater than the 
> base segment of the log, there were a continuous series of 
> ```
> [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of  FOO-3 to log 
> start offset 55649 since the checkpointed offset 0 is invalid. 
> (kafka.log.LogCleanerManager$)
> ```
> messages (partition name changed, it wasn't really FOO). This was expected to 
> be resolved by KAFKA-6266 but clearly wasn't. 
> Further investigation revealed that  a few segments were continuously 
> cleaning and generating messages in the `log-cleaner.log` of the form:
> ```
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 
> for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior 
> to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 
> 13:39:08 EST 2019)... (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner 
> thread 1 cleaned log FOO-3 (dirty section = [55287, 55287])
> 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).
> Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)
> Buffer utilization: 0.0%
> Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)
> Start size: 0.0 MB (0 messages)
> End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) 
> (kafka.log.LogCleaner)
> ```
> What seems to have happened here (data determined for a different partition) 
> is:
> There exist a number of partitions here which get relatively low traffic, 
> including our friend FOO-5. For whatever reason, LogStartOffset of this 
> partition has moved beyond the baseOffset of the active segment. (Notes in 
> other issues indicate that this is a reasonable scenario.) So there’s one 
> segment, starting at 166266, and a log start of 166301.
> grabFilthiestCompactedLog runs and reads the checkpoint file. We see that 
> this topicpartition needs to be cleaned, and call cleanableOffsets on it 
> which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 
> 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 
> 116301, and forceCheckpoint = true.
> The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix 
> for KAFKA-6266). We then create a LogToClean object based on the 
> firstDirtyOffset and firstUncleanableOffset of 166301 (past the active 
> segment’s base offset).
> The LogToClean object has cleanBytes = logSegments(-1, 
> firstDirtyOffset).map(_.size).sum → the size of this segment. It has 
> firstUncleanableOffset and cleanableBytes determined by 
> calculateCleanableBytes. calculateCleanableBytes returns:
> {{}}
> {{val firstUncleanableSegment = 
> log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
> {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
> {{val cleanableBytes = log.logSegments(firstDirtyOffset, 
> math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum
> (firstUncleanableOffset, cleanableBytes)}}
> firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base 
> offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 
> 166266) → which _is the active segment_
> So there are “cleanableBytes” > 0.
> We then filter out segments with totalbytes (clean + cleanable) > 0. This 
> segment has totalBytes > 0, and it has cleanablebytes, so great! It’s 
> filthiest.
> The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, 
> which returns nextDirtyOffset and cleaner stats. cleaner.clean callls 
> doClean, which builds an offsetMap. The offsetMap looks at non-active 
> segments, when building, but there aren’t any. So the endOffset of the 
> offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including 
> logging to log-cleaner.log). After this we call cleanerManager.doneCleaning, 
> which writes the checkpoint file with the latest value… of 0.
> And then the process starts all over.
> It appears that there's at least one bug here, where `log.logSegments(from, 
> to)` will return an empty list if from == to and both are segment-aligned, 
> but _not_ if they are in the middle of a segment, and possibly that 
> LogToClean does start=firstDirtyOffset, end = max(firstDirtyOffset, 
> firstUncleanableOffset) – it can move the firstUncleanableOffset even when 
> the firstDirtyOffset is past firstUncleanable.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9826) Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment

Reply via email to