Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Jason Gustafson Tue, 14 Aug 2018 16:12:58 -0700

Hey Stanislav, responses below:

My initial suggestion was to *track *the uncleanable disk space.
> I can see why marking a log directory as offline after a certain threshold
> of uncleanable disk space is more useful.
> I'm not sure if we can set that threshold to be of certain size (e.g 100GB)
> as log directories might have different sizes.  Maybe a percentage would be
> better then (e.g 30% of whole log dir size), WDYT?






On Fri, Aug 10, 2018 at 2:05 AM, Stanislav Kozlovski <stanis...@confluent.io
> wrote:

> Hey Jason,
>
> My initial suggestion was to *track *the uncleanable disk space.
> I can see why marking a log directory as offline after a certain threshold
> of uncleanable disk space is more useful.
> I'm not sure if we can set that threshold to be of certain size (e.g 100GB)
> as log directories might have different sizes.  Maybe a percentage would be
> better then (e.g 30% of whole log dir size), WDYT?
>
> I feel it still makes sense to have a metric tracking how many uncleanable
> partitions there are and the total amount of uncleanable disk space (per
> log dir, via a JMX tag).
> But now, rather than fail the log directory after a certain count of
> uncleanable partitions, we could fail it after a certain percentage (or
> size) of its storage is uncleanable.
>
> I'd like to hear other people's thoughts on this. Sound good?
>
> Best,
> Stanislav
>
>
>
>
> On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Stanislav,
> >
> > Sorry, I was probably looking at an older version (I had the tab open for
> > so long!).
> >
> > I have been thinking about `max.uncleanable.partitions` and wondering if
> > it's what we really want. The main risk if the cleaner cannot clean a
> > partition is eventually running out of disk space. This is the most
> common
> > problem we have seen with cleaner failures and it can happen even if
> there
> > is just one uncleanable partition. We've actually seen cases in which a
> > single __consumer_offsets grew large enough to fill a significant portion
> > of the disk. The difficulty with allowing a system to run out of disk
> space
> > before failing is that it makes recovery difficult and time consuming.
> > Clean shutdown, for example, requires writing some state to disk. Without
> > clean shutdown, it can take the broker significantly longer to startup
> > because it has do more segment recovery.
> >
> > For this problem, `max.uncleanable.partitions` does not really help. You
> > can set it to 1 and fail fast, but that is not much better than the
> > existing state. You had a suggestion previously in the KIP to use the
> size
> > of uncleanable disk space instead. What was the reason for rejecting
> that?
> > Intuitively, it seems like a better fit for a cleaner failure. It would
> > provide users some time to react to failures while still protecting them
> > from exhausting the disk.
> >
> > Thanks,
> > Jason
> >
> >
> >
> >
> > On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski <
> > stanis...@confluent.io>
> > wrote:
> >
> > > Hey Jason,
> > >
> > > 1. *10* is the default value, it says so in the KIP
> > > 2. This is a good catch. As the current implementation stands, it's
> not a
> > > useful metric since the thread continues to run even if all log
> > directories
> > > are offline (although I'm not sure what the broker's behavior is in
> that
> > > scenario). I'll make sure the thread stops if all log directories are
> > > online.
> > >
> > > I don't know which "Needs Discussion" item you're referencing, there
> > hasn't
> > > been any in the KIP since August 1 and that was for the metric only.
> KIP
> > > History
> > > <https://cwiki.apache.org/confluence/pages/viewpreviousversions.action
> ?
> > > pageId=89064875>
> > >
> > > I've updated the KIP to mention the "time-since-last-run" metric.
> > >
> > > Thanks,
> > > Stanislav
> > >
> > > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Stanislav,
> > > >
> > > > Just a couple quick questions:
> > > >
> > > > 1. I may have missed it, but what will be the default value for
> > > > `max.uncleanable.partitions`?
> > > > 2. It seems there will be some impact for users that monitoring
> > > > "time-since-last-run-ms" in order to detect cleaner failures. Not
> sure
> > > it's
> > > > a major concern, but probably worth mentioning in the compatibility
> > > > section. Also, is this still a useful metric after this KIP?
> > > >
> > > > Also, maybe the "Needs Discussion" item can be moved to rejected
> > > > alternatives since you've moved to a vote? I think leaving this for
> > > > potential future work is reasonable.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rchi...@apache.org>
> > wrote:
> > > >
> > > > > I'm okay with that.
> > > > >
> > > > > -Ray
> > > > >
> > > > > On 8/6/18 10:59 AM, Colin McCabe wrote:
> > > > >
> > > > >> Perhaps we could start with max.uncleanable.partitions and then
> > > > implement
> > > > >> max.uncleanable.partitions.per.logdir in a follow-up change if it
> > > seemed
> > > > >> to be necessary?  What do you think?
> > > > >>
> > > > >> regards,
> > > > >> Colin
> > > > >>
> > > > >>
> > > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote:
> > > > >>
> > > > >>> Hey Ray,
> > > > >>>
> > > > >>> Thanks for the explanation. In regards to the configuration
> > property
> > > -
> > > > >>> I'm
> > > > >>> not sure. As long as it has sufficient documentation, I find
> > > > >>> "max.uncleanable.partitions" to be okay. If we were to add the
> > > > >>> distinction
> > > > >>> explicitly, maybe it should be `max.uncleanable.partitions.
> > > per.logdir`
> > > > ?
> > > > >>>
> > > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rchi...@apache.org>
> > > wrote:
> > > > >>>
> > > > >>> One more thing occurred to me.  Should the configuration property
> > be
> > > > >>>> named "max.uncleanable.partitions.per.disk" instead?
> > > > >>>>
> > > > >>>> -Ray
> > > > >>>>
> > > > >>>>
> > > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> > > > >>>>
> > > > >>>>> Yes, good catch. Thank you, James!
> > > > >>>>>
> > > > >>>>> Best,
> > > > >>>>> Stanislav
> > > > >>>>>
> > > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <
> wushuja...@gmail.com
> > >
> > > > >>>>> wrote:
> > > > >>>>>
> > > > >>>>> Can you update the KIP to say what the default is for
> > > > >>>>>> max.uncleanable.partitions?
> > > > >>>>>>
> > > > >>>>>> -James
> > > > >>>>>>
> > > > >>>>>> Sent from my iPhone
> > > > >>>>>>
> > > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <
> > > > >>>>>>>
> > > > >>>>>> stanis...@confluent.io>
> > > > >>>>
> > > > >>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>> Hey group,
> > > > >>>>>>>
> > > > >>>>>>> I am planning on starting a voting thread tomorrow. Please do
> > > reply
> > > > >>>>>>> if
> > > > >>>>>>>
> > > > >>>>>> you
> > > > >>>>>>
> > > > >>>>>>> feel there is anything left to discuss.
> > > > >>>>>>>
> > > > >>>>>>> Best,
> > > > >>>>>>> Stanislav
> > > > >>>>>>>
> > > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> > > > >>>>>>>
> > > > >>>>>> stanis...@confluent.io>
> > > > >>>>>>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Hey, Ray
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks for pointing that out, it's fixed now
> > > > >>>>>>>>
> > > > >>>>>>>> Best,
> > > > >>>>>>>> Stanislav
> > > > >>>>>>>>
> > > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <
> > rchi...@apache.org>
> > > > >>>>>>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>
> > > > >>>>> Thanks.  Can you fix the link in the "KIPs under discussion"
> > table
> > > on
> > > > >>>>>>>>> the main KIP landing page
> > > > >>>>>>>>> <
> > > > >>>>>>>>>
> > > > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+
> > > > >>>> Improvement+Proposals#
> > > > >>>>
> > > > >>>>> ?
> > > > >>>>>>>
> > > > >>>>>>>> I tried, but the Wiki won't let me.
> > > > >>>>>>>>>
> > > > >>>>>>>>> -Ray
> > > > >>>>>>>>>
> > > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> > > > >>>>>>>>>> Hey guys,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> @Colin - good point. I added some sentences mentioning
> > recent
> > > > >>>>>>>>>>
> > > > >>>>>>>>> improvements
> > > > >>>>>>>>>
> > > > >>>>>>>>>> in the introductory section.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said -
> > once a
> > > > >>>>>>>>>> disk
> > > > >>>>>>>>>>
> > > > >>>>>>>>> fails,
> > > > >>>>>>>>>
> > > > >>>>>>>>>> you don't want to work with it again. As such, I've
> changed
> > my
> > > > >>>>>>>>>> mind
> > > > >>>>>>>>>>
> > > > >>>>>>>>> and
> > > > >>>>>>
> > > > >>>>>>> believe that we should mark the LogDir (assume its a disk) as
> > > > >>>>>>>>>>
> > > > >>>>>>>>> offline
> > > > >>>>
> > > > >>>>> on
> > > > >>>>>>
> > > > >>>>>>> the first `IOException` encountered. This is the LogCleaner's
> > > > >>>>>>>>>>
> > > > >>>>>>>>> current
> > > > >>>>
> > > > >>>>> behavior. We shouldn't change that.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn
> a
> > > > >>>>>>>>>> thread.
> > > > >>>>>>>>>>
> > > > >>>>>>>>> The
> > > > >>>>>>
> > > > >>>>>>> correct approach in my mind is to either have it stay dead or
> > > never
> > > > >>>>>>>>>>
> > > > >>>>>>>>> let
> > > > >>>>>>
> > > > >>>>>>> it
> > > > >>>>>>>>>
> > > > >>>>>>>>>> die in the first place.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right,
> this
> > > > metric
> > > > >>>>>>>>>>
> > > > >>>>>>>>> is
> > > > >>>>
> > > > >>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
> > > > >>>>>>>>>>
> > > > >>>>>>>>> metric
> > > > >>>>
> > > > >>>>> and
> > > > >>>>>>>>>
> > > > >>>>>>>>>> inspect logs.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Hey Ray,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the
> > > > LogCleaner
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> to
> > > > >>>>
> > > > >>>>> skip over problematic partitions instead of dying.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> I think we can do this for every exception that isn't
> > > > >>>>>>>>>> `IOException`.
> > > > >>>>>>>>>>
> > > > >>>>>>>>> This
> > > > >>>>>>>>>
> > > > >>>>>>>>>> will future-proof us against bugs in the system and
> > potential
> > > > >>>>>>>>>> other
> > > > >>>>>>>>>>
> > > > >>>>>>>>> errors.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Protecting yourself against unexpected failures is always
> a
> > > good
> > > > >>>>>>>>>>
> > > > >>>>>>>>> thing
> > > > >>>>
> > > > >>>>> in
> > > > >>>>>>>>>
> > > > >>>>>>>>>> my mind, but I also think that protecting yourself against
> > > bugs
> > > > in
> > > > >>>>>>>>>>
> > > > >>>>>>>>> the
> > > > >>>>
> > > > >>>>> software is sort of clunky. What does everybody think about
> this?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> 4) The only improvement I can think of is that if such an
> > > > >>>>>>>>>>> error occurs, then have the option (configuration
> setting?)
> > > to
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> create a
> > > > >>>>>>
> > > > >>>>>>> <log_segment>.skip file (or something similar).
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> This is a good suggestion. Have others also seen
> corruption
> > be
> > > > >>>>>>>>>>
> > > > >>>>>>>>> generally
> > > > >>>>>>
> > > > >>>>>>> tied to the same segment?
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <
> > > > >>>>>>>>>> dhru...@confluent.io
> > > > >>>>>>>>>>
> > > > >>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> For the cleaner thread specifically, I do not think
> > respawning
> > > > >>>>>>>>>>> will
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> help at
> > > > >>>>>>>>>
> > > > >>>>>>>>>> all because we are more than likely to run into the same
> > issue
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> again
> > > > >>>>
> > > > >>>>> which
> > > > >>>>>>>>>
> > > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense
> for
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> transient
> > > > >>>>
> > > > >>>>> errors or when you believe some part of the system could have
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> healed
> > > > >>>>
> > > > >>>>> itself, both of which I think are not true for the log cleaner.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <
> > > > >>>>>>>>>>> rndg...@gmail.com>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> <<<respawning threads is likely to make things worse, by
> > > putting
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> you
> > > > >>>>
> > > > >>>>> in
> > > > >>>>>>>>>
> > > > >>>>>>>>>> an
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> infinite loop which consumes resources and fires off
> > > > continuous
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> log
> > > > >>>>
> > > > >>>>> messages.
> > > > >>>>>>>>>>>> Hi Colin.  In case it could be relevant, one way to
> > mitigate
> > > > >>>>>>>>>>>> this
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> effect
> > > > >>>>>>>>>
> > > > >>>>>>>>>> is
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is
> > to
> > > > >>>>>>>>>>>> occur
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> then
> > > > >>>>>>
> > > > >>>>>>> wait
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is
> > to
> > > > >>>>>>>>>>>> occur
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> wait
> > > > >>>>>>
> > > > >>>>>>> for
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes,
> etc.
> > > up
> > > > to
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> some
> > > > >>>>>>
> > > > >>>>>>> max
> > > > >>>>>>>>>
> > > > >>>>>>>>>> wait time).
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or
> not
> > > in
> > > > >>>>>>>>>>>> this
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> context,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> but a mitigation like the increasing backoff described
> > above
> > > > may
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> be
> > > > >>>>
> > > > >>>>> relevant in weighing the pros and cons.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Ron
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <
> > > > >>>>>>>>>>>> cmcc...@apache.org>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> > > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were
> > more
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> tolerant
> > > > >>>>
> > > > >>>>> of
> > > > >>>>>>>>>
> > > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays
> dead.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Things are better now than they used to be. We have
> the
> > > > metric
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time-
> > > since-last-run-ms
> > > > >>>>
> > > > >>>>> which we can use to tell us if the threads are dead. And as of
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 1.1.0,
> > > > >>>>>>>>>
> > > > >>>>>>>>>> we
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log
> cleaner
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> thread,
> > > > >>>>
> > > > >>>>> without requiring a broker restart.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+
> > > > >>>> Dynamic+Broker+Configuration
> > > > >>>>
> > > > >>>>> <
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 226+-+
> > > > >>>> Dynamic+Broker+Configuration
> > > > >>>>
> > > > >>>>> I've only read about this, I haven't personally tried it.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we
> > should
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> probably
> > > > >>>>
> > > > >>>>> add a
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere
> > in
> > > > the
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> KIP.
> > > > >>>>>>
> > > > >>>>>>> Maybe
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> in the intro section?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I think it's clear that requiring the users to manually
> > > > restart
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> the
> > > > >>>>
> > > > >>>>> log
> > > > >>>>>>>>>
> > > > >>>>>>>>>> cleaner is not a very good solution.  But it's good to
> know
> > > that
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> it's a
> > > > >>>>>>>>>
> > > > >>>>>>>>>> possibility on some older releases.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Some comments:
> > > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue
> to
> > > > clean
> > > > >>>>>>>>>>>>>> as
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> many
> > > > >>>>>>>>>
> > > > >>>>>>>>>> partitions as it can, skipping over the problematic ones
> if
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> possible.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> * If the log cleaner thread dies, I think it should
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> automatically
> > > > >>>>
> > > > >>>>> be
> > > > >>>>>>
> > > > >>>>>>> revived. Your KIP attempts to do that by catching exceptions
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> during
> > > > >>>>>>
> > > > >>>>>>> execution, but I think we should go all the way and make sure
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> that a
> > > > >>>>>>
> > > > >>>>>>> new
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> one gets created, if the thread ever dies.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka
> > works.
> > > > We
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> don't
> > > > >>>>>>
> > > > >>>>>>> automatically re-create other threads in the broker if they
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> terminate.
> > > > >>>>>>>>>
> > > > >>>>>>>>>> In
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> general, if there is a serious bug in the code,
> > respawning
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> threads
> > > > >>>>
> > > > >>>>> is
> > > > >>>>>>
> > > > >>>>>>> likely to make things worse, by putting you in an infinite
> loop
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> which
> > > > >>>>>>
> > > > >>>>>>> consumes resources and fires off continuous log messages.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> partitions.
> > > > >>>>
> > > > >>>>> I've
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> seen cases where an uncleanable partition later became
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> cleanable.
> > > > >>>>
> > > > >>>>> I
> > > > >>>>>>
> > > > >>>>>>> unfortunately don't remember how that happened, but I
> remember
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> being
> > > > >>>>>>
> > > > >>>>>>> surprised when I discovered it. It might have been something
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> like
> > > > >>>>
> > > > >>>>> a
> > > > >>>>>>
> > > > >>>>>>> follower was uncleanable but after a leader election
> happened,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the
> > > > >>>>
> > > > >>>>> log
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> Distributed
> > > > >>>>
> > > > >>>>> File
> > > > >>>>>>>>>
> > > > >>>>>>>>>> System (HDFS) and it was a constant source of user
> problems.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> What would happen is disks would just go bad over time.
> > > The
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> DataNode
> > > > >>>>>>
> > > > >>>>>>> would notice this and take them offline.  But then, due to
> some
> > > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try
> to
> > > > >>>>>>>>>>>>> re-add
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> them
> > > > >>>>>>
> > > > >>>>>>> to
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> the system.  Then one of two things would happen: the
> disk
> > > > would
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> just
> > > > >>>>>>
> > > > >>>>>>> fail
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> immediately again, or it would appear to work and then
> > fail
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> after a
> > > > >>>>
> > > > >>>>> short
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> amount of time.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O
> > request
> > > > >>>>>>>>>>>>> take a
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> really
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> long time and time out.  So a bunch of request handler
> > > threads
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> would
> > > > >>>>>>
> > > > >>>>>>> basically slam into a brick wall when they tried to access
> the
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> bad
> > > > >>>>
> > > > >>>>> disk,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in
> the
> > > > >>>>>>>>>>>>> second
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> scenario,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then
> > failed.
> > > > Any
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> data
> > > > >>>>>>
> > > > >>>>>>> that
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> had been written on that DataNode to that disk would be
> > > lost,
> > > > >>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> we
> > > > >>>>>>
> > > > >>>>>>> would
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> need to re-replicate it.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over
> > > time.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> Once
> > > > >>>>
> > > > >>>>> they're
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust
> > > > against
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> cases
> > > > >>>>>>
> > > > >>>>>>> where
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the disk really is failing, and really is returning bad
> > > data
> > > > or
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> timing
> > > > >>>>>>>>>
> > > > >>>>>>>>>> out.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * For your metrics, can you spell out the full metric
> in
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> JMX-style
> > > > >>>>
> > > > >>>>> format, such as:
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-
> > > > >>>> partitions-count
> > > > >>>>
> > > > >>>>>                 value=4
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names
> > can
> > > be
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> very
> > > > >>>>
> > > > >>>>> long.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> I think the current max size is 210 characters (or maybe
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> 240-ish?).
> > > > >>>>>>
> > > > >>>>>>> Having the "uncleanable-partitions" being a list could be
> very
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> large
> > > > >>>>>>
> > > > >>>>>>> metric. Also, having the metric come out as a csv might be
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> difficult
> > > > >>>>>>
> > > > >>>>>>> to
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want the
> > topic
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> names
> > > > >>>>
> > > > >>>>> to
> > > > >>>>>>>>>
> > > > >>>>>>>>>> be
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> accessible, what do you think of having the
> > > > >>>>>>>>>>>>>>         kafka.log:type=LogCleanerManag
> > > > >>>>>>>>>>>>>> er,topic=topic1,partition=2
> > > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type,
> but
> > > my
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> example
> > > > >>>>>>
> > > > >>>>>>> was
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> that the topic and partition can be tags in the metric.
> > That
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> will
> > > > >>>>
> > > > >>>>> allow
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> monitoring systems to more easily slice and dice the
> > metric.
> > > > I'm
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> not
> > > > >>>>>>
> > > > >>>>>>> sure what the attribute for that metric would be. Maybe
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> something
> > > > >>>>
> > > > >>>>> like
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> time-since-last-clean?
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Or
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> maybe even just "Value=1".
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I haven't though about this that hard, but do we really
> > > need
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a
> > metric?
> > > > It
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> seems
> > > > >>>>>>
> > > > >>>>>>> like
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> the admin should notice that uncleanable partitions are
> > > > present,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> and
> > > > >>>>>>
> > > > >>>>>>> then
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> check the logs?
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that
> this
> > > > likely
> > > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not
> sure
> > > > that
> > > > >>>>>>>>>>>>>> is
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the
> > > > >>>>>>
> > > > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> problems,
> > > > >>>>>>
> > > > >>>>>>> all
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> of them are partition-level scenarios that happened
> > during
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> normal
> > > > >>>>
> > > > >>>>> operation. None of them were indicative of disk problems.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I don't think this is a meaningful comparison.  In
> > general,
> > > > we
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> don't
> > > > >>>>>>
> > > > >>>>>>> accept JIRAs for hard disk problems that happen on a
> particular
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> cluster.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is
> having
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> problems"
> > > > >>>>>>
> > > > >>>>>>> we
> > > > >>>>>>>>>
> > > > >>>>>>>>>> could close that as "not a Kafka bug."  This doesn't prove
> > > that
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> disk
> > > > >>>>>>
> > > > >>>>>>> problems don't happen, but  just that JIRA isn't the right
> > place
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> for
> > > > >>>>>>
> > > > >>>>>>> them.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> I do agree that the log cleaner has had a significant
> > > number
> > > > of
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> logic
> > > > >>>>>>
> > > > >>>>>>> bugs, and that we need to be careful to limit their impact.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> That's
> > > > >>>>
> > > > >>>>> one
> > > > >>>>>>>>>
> > > > >>>>>>>>>> reason why I think that a threshold of "number of
> > uncleanable
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> logs"
> > > > >>>>
> > > > >>>>> is
> > > > >>>>>>>>>
> > > > >>>>>>>>>> a
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> good idea, rather than just failing after one
> IOException.
> > > In
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> all
> > > > >>>>
> > > > >>>>> the
> > > > >>>>>>>>>
> > > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log
> > > cleaner,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> it
> > > > >>>>
> > > > >>>>> was
> > > > >>>>>>>>>
> > > > >>>>>>>>>> just one partition that had the issue.  We also should
> > > increase
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> test
> > > > >>>>>>
> > > > >>>>>>> coverage for the log cleaner.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a
> certain
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> threshold,
> > > > >>>>>>
> > > > >>>>>>> that actually increases the blast radius of log compaction
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> failures.
> > > > >>>>>>
> > > > >>>>>>> Currently, the uncleaned partitions are still readable and
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> writable.
> > > > >>>>>>
> > > > >>>>>>> Taking the disks offline would impact availability of the
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> uncleanable
> > > > >>>>>>>>>
> > > > >>>>>>>>>> partitions, as well as impact all other partitions that
> are
> > on
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> the
> > > > >>>>
> > > > >>>>> disk.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the
> > disk
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>> partition
> > > > >>>>>>
> > > > >>>>>>> offline.  This is spelled out in KIP-112 (
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%
> > > > >>>> 3A+Handle+disk+failure+for+JBOD
> > > > >>>>
> > > > >>>>> ) :
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it
> > > starts,
> > > > >>>>>>>>>>>>>> and
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> mark
> > > > >>>>>>>>>
> > > > >>>>>>>>>> log directory as
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> bad once there is IOException when broker attempts to
> > > access
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> (i.e.
> > > > >>>>
> > > > >>>>> read
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>> or write) the log directory.
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> - Broker will be offline if all log directories are
> bad.
> > > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log
> > > > directory.
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> New
> > > > >>>
> > > > >>>
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure

Reply via email to