[jira] [Commented] (KAFKA-9826) Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment

ASF GitHub Bot (Jira) Sat, 11 Apr 2020 18:45:45 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081607#comment-17081607
 ]


ASF GitHub Bot commented on KAFKA-9826:
---------------------------------------

steverod commented on pull request #8469: [KAFKA-9826] Handle an unaligned 
first dirty offset during log cleaning.
URL: https://github.com/apache/kafka/pull/8469
 
 
   What
   ====
   In KAFKA-9826, a log whose first dirty offset was past the start of the 
active segment and past the last cleaned point resulted in an endless cycle of 
picking the segment to clean and discarding it. Though this didn't interfere 
with cleaning other log segments, it kept the log cleaner thread continuously 
busy (potentially wasting CPU and impacting other running threads) and filled 
the logs with lots of extraneous messages.
   
   This was determined to be because the active segment was getting mistakenly 
picked for cleaning, and because the `logSegments` code handles (start == end) 
cases only for (start, end) on a segment boundary: the intent is to return a 
null list, but if they're not on a segment boundary, the routine returns that 
segment.
   
   This fix has two parts:
   1. It changes logSegments to handle start==end by returning an empty List 
always.
   2. It changes the definition of calculateCleanableBytes to not consider 
anything past the UncleanableOffset; previously, it would potentially shift the 
UncleanableOffset to match the firstDirtyOffset even if the firstDirtyOffset 
was past the firstUncleanableOffset. This has no real effect now in the context 
of the fix for (1) but it makes the code read more like the model that the code 
is attempting to follow.
   
   These changes require modifications to a few test cases that handled this 
particular test case; they were introduced in the context of KAFKA-8764. Those 
situations are now handled elsewhere in code, but the tests themselves allowed 
a DirtyOffset in the active segment, and expected an active segment to be 
selected for cleaning.
   
   An additional unit test for the logSegments call is added. 
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Log cleaning repeatedly picks same segment with no effect when first dirty 
> offset is past start of active segment
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-9826
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9826
>             Project: Kafka
>          Issue Type: Bug
>          Components: log cleaner
>    Affects Versions: 2.4.1
>            Reporter: Steve Rodrigues
>            Assignee: Steve Rodrigues
>            Priority: Major
>
> Seen on a system where a given partition had a single segment, and for 
> whatever reason (deleteRecords?), the logStartOffset was greater than the 
> base segment of the log, there were a continuous series of 
> ```
> [2020-03-03 16:56:31,374] WARN Resetting first dirty offset of  FOO-3 to log 
> start offset 55649 since the checkpointed offset 0 is invalid. 
> (kafka.log.LogCleanerManager$)
> ```
> messages (partition name changed, it wasn't really FOO). This was expected to 
> be resolved by KAFKA-6266 but clearly wasn't. 
> Further investigation revealed that  a few segments were continuously 
> cleaning and generating messages in the `log-cleaner.log` of the form:
> ```
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Beginning cleaning of log FOO-3 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,924] INFO Cleaner 1: Building offset map for FOO-3... 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Building offset map for log FOO-3 
> for 0 segments in offset range [55287, 54237). (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Offset map for log FOO-3 complete. 
> (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO Cleaner 1: Cleaning log FOO-3 (cleaning prior 
> to Wed Dec 31 19:00:00 EST 1969, discarding tombstones prior to Tue Dec 10 
> 13:39:08 EST 2019)... (kafka.log.LogCleaner)
> [2020-03-31 13:34:50,927] INFO [kafka-log-cleaner-thread-1]: Log cleaner 
> thread 1 cleaned log FOO-3 (dirty section = [55287, 55287])
> 0.0 MB of log processed in 0.0 seconds (0.0 MB/sec).
> Indexed 0.0 MB in 0.0 seconds (0.0 Mb/sec, 100.0% of total time)
> Buffer utilization: 0.0%
> Cleaned 0.0 MB in 0.0 seconds (NaN Mb/sec, 0.0% of total time)
> Start size: 0.0 MB (0 messages)
> End size: 0.0 MB (0 messages) NaN% size reduction (NaN% fewer messages) 
> (kafka.log.LogCleaner)
> ```
> What seems to have happened here (data determined for a different partition) 
> is:
> There exist a number of partitions here which get relatively low traffic, 
> including our friend FOO-5. For whatever reason, LogStartOffset of this 
> partition has moved beyond the baseOffset of the active segment. (Notes in 
> other issues indicate that this is a reasonable scenario.) So there’s one 
> segment, starting at 166266, and a log start of 166301.
> grabFilthiestCompactedLog runs and reads the checkpoint file. We see that 
> this topicpartition needs to be cleaned, and call cleanableOffsets on it 
> which returns an OffsetsToClean with firstDirtyOffset == logStartOffset == 
> 166301 and firstUncleanableOffset = max(logStart, activeSegment.baseOffset) = 
> 116301, and forceCheckpoint = true.
> The checkpoint file is updated in grabFilthiestCompactedLog (this is the fix 
> for KAFKA-6266). We then create a LogToClean object based on the 
> firstDirtyOffset and firstUncleanableOffset of 166301 (past the active 
> segment’s base offset).
> The LogToClean object has cleanBytes = logSegments(-1, 
> firstDirtyOffset).map(_.size).sum → the size of this segment. It has 
> firstUncleanableOffset and cleanableBytes determined by 
> calculateCleanableBytes. calculateCleanableBytes returns:
> {{}}
> {{val firstUncleanableSegment = 
> log.nonActiveLogSegmentsFrom(uncleanableOffset).headOption.getOrElse(log.activeSegment)}}
> {{val firstUncleanableOffset = firstUncleanableSegment.baseOffset}}
> {{val cleanableBytes = log.logSegments(firstDirtyOffset, 
> math.max(firstDirtyOffset, firstUncleanableOffset)).map(_.size.toLong).sum
> (firstUncleanableOffset, cleanableBytes)}}
> firstUncleanableSegment is activeSegment. firstUncleanableOffset is the base 
> offset, 166266. cleanableBytes is looking for logSegments(166301, max(166301, 
> 166266) → which _is the active segment_
> So there are “cleanableBytes” > 0.
> We then filter out segments with totalbytes (clean + cleanable) > 0. This 
> segment has totalBytes > 0, and it has cleanablebytes, so great! It’s 
> filthiest.
> The cleaner picks it, calls cleanLog on it, which then does cleaner.clean, 
> which returns nextDirtyOffset and cleaner stats. cleaner.clean callls 
> doClean, which builds an offsetMap. The offsetMap looks at non-active 
> segments, when building, but there aren’t any. So the endOffset of the 
> offsetMap is lastOffset (default -1) + 1 → 0. We record the stats (including 
> logging to log-cleaner.log). After this we call cleanerManager.doneCleaning, 
> which writes the checkpoint file with the latest value… of 0.
> And then the process starts all over.
> It appears that there's at least one bug here, where `log.logSegments(from, 
> to)` will return an empty list if from == to and both are segment-aligned, 
> but _not_ if they are in the middle of a segment, and possibly that 
> LogToClean does start=firstDirtyOffset, end = max(firstDirtyOffset, 
> firstUncleanableOffset) – it can move the firstUncleanableOffset even when 
> the firstDirtyOffset is past firstUncleanable.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9826) Log cleaning repeatedly picks same segment with no effect when first dirty offset is past start of active segment

Reply via email to