[
https://issues.apache.org/jira/browse/KAFKA-1755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216486#comment-14216486
]
Joel Koshy commented on KAFKA-1755:
-----------------------------------
There are a couple of issues that I was thinking of in scope for this jira:
* Log cleaner threads quitting on errors (which may be a non-issue as discussed
further below).
* Dealing with cleaner failures due to unkeyed messages.
* Other cleaner failures are possible as well (for e.g., compressed message
sets until KAFKA-1374 is reviewed and checked-in)
The reason this jira was filed is because the log cleaner compacts all
compacted topics so one topic should (ideally) not affect another. Any
practical deployment would need to set up alerts on the cleaner thread dying.
Right now, I think the most reliable way to alert (with the currently available
metrics) would be to monitor the max-dirty-ratio. If we set up this alert, then
allowing the cleaner to continue would in practice only delay an alert. So one
can argue that it is better to fail fast - i.e., let the log cleaner die
because a problematic topic is something that needs to be looked into
immediately. However, I think there are further improvements with alternatives
that can be made. It would be helpful if others can share their
thoughts/preferences on these:
* Introduce a new LogCleaningState: LogCleaningPausedDueToError
* Introduce a metric for the number of live cleaner threads
* If the log cleaner encounters any uncaught error, there are a couple of
options:
** Don't let the thread die, but move the partition to
LogCleaningPausedDueToError. Other topics-partitions can still be compacted.
Alerts can be set up on the number of partitions in state
LogCleaningPausedDueToError.
** Let the cleaner die and decrement live cleaner count. Alerts can be set up
on the number of live cleaner threads.
* If the cleaner encounters un-keyed messages:
** Delete those messages, and do nothing. i.e., ignore (or just log the count
in log cleaner stats)
** Keep the messages, move the partition to LogCleaningPausedDueToError. The
motivation for this is accidental misconfiguration. i.e., it may be important
to not lose those messages. The error log cleaning state can be cleared only by
deleting and then recreating the topic.
* Additionally, I think we should reject producer requests containing un-keyed
messages to compacted topics.
* With all of the above, a backup alert can also be set up on the
max-dirty-ratio.
> Log cleaner thread should not exit on errors
> --------------------------------------------
>
> Key: KAFKA-1755
> URL: https://issues.apache.org/jira/browse/KAFKA-1755
> Project: Kafka
> Issue Type: Bug
> Reporter: Joel Koshy
> Labels: newbie++
> Fix For: 0.8.3
>
>
> The log cleaner is a critical process when using compacted topics.
> However, if there is any error in any topic (notably if a key is missing)
> then the cleaner exits and all other compacted topics will also be adversely
> affected - i.e., compaction stops across the board.
> This can be improved by just aborting compaction for a topic on any error and
> keep the thread from exiting.
> Another improvement would be to reject messages without keys that are sent to
> compacted topics although this is not enough by itself.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)