Hey Stanislav, responses below: My initial suggestion was to *track *the uncleanable disk space. > I can see why marking a log directory as offline after a certain threshold > of uncleanable disk space is more useful. > I'm not sure if we can set that threshold to be of certain size (e.g 100GB) > as log directories might have different sizes. Maybe a percentage would be > better then (e.g 30% of whole log dir size), WDYT?
On Fri, Aug 10, 2018 at 2:05 AM, Stanislav Kozlovski <stanis...@confluent.io > wrote: > Hey Jason, > > My initial suggestion was to *track *the uncleanable disk space. > I can see why marking a log directory as offline after a certain threshold > of uncleanable disk space is more useful. > I'm not sure if we can set that threshold to be of certain size (e.g 100GB) > as log directories might have different sizes. Maybe a percentage would be > better then (e.g 30% of whole log dir size), WDYT? > > I feel it still makes sense to have a metric tracking how many uncleanable > partitions there are and the total amount of uncleanable disk space (per > log dir, via a JMX tag). > But now, rather than fail the log directory after a certain count of > uncleanable partitions, we could fail it after a certain percentage (or > size) of its storage is uncleanable. > > I'd like to hear other people's thoughts on this. Sound good? > > Best, > Stanislav > > > > > On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson <ja...@confluent.io> > wrote: > > > Hey Stanislav, > > > > Sorry, I was probably looking at an older version (I had the tab open for > > so long!). > > > > I have been thinking about `max.uncleanable.partitions` and wondering if > > it's what we really want. The main risk if the cleaner cannot clean a > > partition is eventually running out of disk space. This is the most > common > > problem we have seen with cleaner failures and it can happen even if > there > > is just one uncleanable partition. We've actually seen cases in which a > > single __consumer_offsets grew large enough to fill a significant portion > > of the disk. The difficulty with allowing a system to run out of disk > space > > before failing is that it makes recovery difficult and time consuming. > > Clean shutdown, for example, requires writing some state to disk. Without > > clean shutdown, it can take the broker significantly longer to startup > > because it has do more segment recovery. > > > > For this problem, `max.uncleanable.partitions` does not really help. You > > can set it to 1 and fail fast, but that is not much better than the > > existing state. You had a suggestion previously in the KIP to use the > size > > of uncleanable disk space instead. What was the reason for rejecting > that? > > Intuitively, it seems like a better fit for a cleaner failure. It would > > provide users some time to react to failures while still protecting them > > from exhausting the disk. > > > > Thanks, > > Jason > > > > > > > > > > On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski < > > stanis...@confluent.io> > > wrote: > > > > > Hey Jason, > > > > > > 1. *10* is the default value, it says so in the KIP > > > 2. This is a good catch. As the current implementation stands, it's > not a > > > useful metric since the thread continues to run even if all log > > directories > > > are offline (although I'm not sure what the broker's behavior is in > that > > > scenario). I'll make sure the thread stops if all log directories are > > > online. > > > > > > I don't know which "Needs Discussion" item you're referencing, there > > hasn't > > > been any in the KIP since August 1 and that was for the metric only. > KIP > > > History > > > <https://cwiki.apache.org/confluence/pages/viewpreviousversions.action > ? > > > pageId=89064875> > > > > > > I've updated the KIP to mention the "time-since-last-run" metric. > > > > > > Thanks, > > > Stanislav > > > > > > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson <ja...@confluent.io> > > > wrote: > > > > > > > Hi Stanislav, > > > > > > > > Just a couple quick questions: > > > > > > > > 1. I may have missed it, but what will be the default value for > > > > `max.uncleanable.partitions`? > > > > 2. It seems there will be some impact for users that monitoring > > > > "time-since-last-run-ms" in order to detect cleaner failures. Not > sure > > > it's > > > > a major concern, but probably worth mentioning in the compatibility > > > > section. Also, is this still a useful metric after this KIP? > > > > > > > > Also, maybe the "Needs Discussion" item can be moved to rejected > > > > alternatives since you've moved to a vote? I think leaving this for > > > > potential future work is reasonable. > > > > > > > > Thanks, > > > > Jason > > > > > > > > > > > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang <rchi...@apache.org> > > wrote: > > > > > > > > > I'm okay with that. > > > > > > > > > > -Ray > > > > > > > > > > On 8/6/18 10:59 AM, Colin McCabe wrote: > > > > > > > > > >> Perhaps we could start with max.uncleanable.partitions and then > > > > implement > > > > >> max.uncleanable.partitions.per.logdir in a follow-up change if it > > > seemed > > > > >> to be necessary? What do you think? > > > > >> > > > > >> regards, > > > > >> Colin > > > > >> > > > > >> > > > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote: > > > > >> > > > > >>> Hey Ray, > > > > >>> > > > > >>> Thanks for the explanation. In regards to the configuration > > property > > > - > > > > >>> I'm > > > > >>> not sure. As long as it has sufficient documentation, I find > > > > >>> "max.uncleanable.partitions" to be okay. If we were to add the > > > > >>> distinction > > > > >>> explicitly, maybe it should be `max.uncleanable.partitions. > > > per.logdir` > > > > ? > > > > >>> > > > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang <rchi...@apache.org> > > > wrote: > > > > >>> > > > > >>> One more thing occurred to me. Should the configuration property > > be > > > > >>>> named "max.uncleanable.partitions.per.disk" instead? > > > > >>>> > > > > >>>> -Ray > > > > >>>> > > > > >>>> > > > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote: > > > > >>>> > > > > >>>>> Yes, good catch. Thank you, James! > > > > >>>>> > > > > >>>>> Best, > > > > >>>>> Stanislav > > > > >>>>> > > > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng < > wushuja...@gmail.com > > > > > > > >>>>> wrote: > > > > >>>>> > > > > >>>>> Can you update the KIP to say what the default is for > > > > >>>>>> max.uncleanable.partitions? > > > > >>>>>> > > > > >>>>>> -James > > > > >>>>>> > > > > >>>>>> Sent from my iPhone > > > > >>>>>> > > > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski < > > > > >>>>>>> > > > > >>>>>> stanis...@confluent.io> > > > > >>>> > > > > >>>>> wrote: > > > > >>>>>> > > > > >>>>>>> Hey group, > > > > >>>>>>> > > > > >>>>>>> I am planning on starting a voting thread tomorrow. Please do > > > reply > > > > >>>>>>> if > > > > >>>>>>> > > > > >>>>>> you > > > > >>>>>> > > > > >>>>>>> feel there is anything left to discuss. > > > > >>>>>>> > > > > >>>>>>> Best, > > > > >>>>>>> Stanislav > > > > >>>>>>> > > > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski < > > > > >>>>>>> > > > > >>>>>> stanis...@confluent.io> > > > > >>>>>> > > > > >>>>>>> wrote: > > > > >>>>>>> > > > > >>>>>>> Hey, Ray > > > > >>>>>>>> > > > > >>>>>>>> Thanks for pointing that out, it's fixed now > > > > >>>>>>>> > > > > >>>>>>>> Best, > > > > >>>>>>>> Stanislav > > > > >>>>>>>> > > > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang < > > rchi...@apache.org> > > > > >>>>>>>>> > > > > >>>>>>>> wrote: > > > > >>>> > > > > >>>>> Thanks. Can you fix the link in the "KIPs under discussion" > > table > > > on > > > > >>>>>>>>> the main KIP landing page > > > > >>>>>>>>> < > > > > >>>>>>>>> > > > > >>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+ > > > > >>>> Improvement+Proposals# > > > > >>>> > > > > >>>>> ? > > > > >>>>>>> > > > > >>>>>>>> I tried, but the Wiki won't let me. > > > > >>>>>>>>> > > > > >>>>>>>>> -Ray > > > > >>>>>>>>> > > > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: > > > > >>>>>>>>>> Hey guys, > > > > >>>>>>>>>> > > > > >>>>>>>>>> @Colin - good point. I added some sentences mentioning > > recent > > > > >>>>>>>>>> > > > > >>>>>>>>> improvements > > > > >>>>>>>>> > > > > >>>>>>>>>> in the introductory section. > > > > >>>>>>>>>> > > > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - > > once a > > > > >>>>>>>>>> disk > > > > >>>>>>>>>> > > > > >>>>>>>>> fails, > > > > >>>>>>>>> > > > > >>>>>>>>>> you don't want to work with it again. As such, I've > changed > > my > > > > >>>>>>>>>> mind > > > > >>>>>>>>>> > > > > >>>>>>>>> and > > > > >>>>>> > > > > >>>>>>> believe that we should mark the LogDir (assume its a disk) as > > > > >>>>>>>>>> > > > > >>>>>>>>> offline > > > > >>>> > > > > >>>>> on > > > > >>>>>> > > > > >>>>>>> the first `IOException` encountered. This is the LogCleaner's > > > > >>>>>>>>>> > > > > >>>>>>>>> current > > > > >>>> > > > > >>>>> behavior. We shouldn't change that. > > > > >>>>>>>>>> > > > > >>>>>>>>>> *Respawning Threads* - I believe we should never re-spawn > a > > > > >>>>>>>>>> thread. > > > > >>>>>>>>>> > > > > >>>>>>>>> The > > > > >>>>>> > > > > >>>>>>> correct approach in my mind is to either have it stay dead or > > > never > > > > >>>>>>>>>> > > > > >>>>>>>>> let > > > > >>>>>> > > > > >>>>>>> it > > > > >>>>>>>>> > > > > >>>>>>>>>> die in the first place. > > > > >>>>>>>>>> > > > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, > this > > > > metric > > > > >>>>>>>>>> > > > > >>>>>>>>> is > > > > >>>> > > > > >>>>> unneeded. Users can monitor the `uncleanable-partitions-count` > > > > >>>>>>>>>> > > > > >>>>>>>>> metric > > > > >>>> > > > > >>>>> and > > > > >>>>>>>>> > > > > >>>>>>>>>> inspect logs. > > > > >>>>>>>>>> > > > > >>>>>>>>>> > > > > >>>>>>>>>> Hey Ray, > > > > >>>>>>>>>> > > > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the > > > > LogCleaner > > > > >>>>>>>>>>> > > > > >>>>>>>>>> to > > > > >>>> > > > > >>>>> skip over problematic partitions instead of dying. > > > > >>>>>>>>>>> > > > > >>>>>>>>>> I think we can do this for every exception that isn't > > > > >>>>>>>>>> `IOException`. > > > > >>>>>>>>>> > > > > >>>>>>>>> This > > > > >>>>>>>>> > > > > >>>>>>>>>> will future-proof us against bugs in the system and > > potential > > > > >>>>>>>>>> other > > > > >>>>>>>>>> > > > > >>>>>>>>> errors. > > > > >>>>>>>>> > > > > >>>>>>>>>> Protecting yourself against unexpected failures is always > a > > > good > > > > >>>>>>>>>> > > > > >>>>>>>>> thing > > > > >>>> > > > > >>>>> in > > > > >>>>>>>>> > > > > >>>>>>>>>> my mind, but I also think that protecting yourself against > > > bugs > > > > in > > > > >>>>>>>>>> > > > > >>>>>>>>> the > > > > >>>> > > > > >>>>> software is sort of clunky. What does everybody think about > this? > > > > >>>>>>>>>> > > > > >>>>>>>>>> 4) The only improvement I can think of is that if such an > > > > >>>>>>>>>>> error occurs, then have the option (configuration > setting?) > > > to > > > > >>>>>>>>>>> > > > > >>>>>>>>>> create a > > > > >>>>>> > > > > >>>>>>> <log_segment>.skip file (or something similar). > > > > >>>>>>>>>>> > > > > >>>>>>>>>> This is a good suggestion. Have others also seen > corruption > > be > > > > >>>>>>>>>> > > > > >>>>>>>>> generally > > > > >>>>>> > > > > >>>>>>> tied to the same segment? > > > > >>>>>>>>>> > > > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah < > > > > >>>>>>>>>> dhru...@confluent.io > > > > >>>>>>>>>> > > > > >>>>>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>>> For the cleaner thread specifically, I do not think > > respawning > > > > >>>>>>>>>>> will > > > > >>>>>>>>>>> > > > > >>>>>>>>>> help at > > > > >>>>>>>>> > > > > >>>>>>>>>> all because we are more than likely to run into the same > > issue > > > > >>>>>>>>>>> > > > > >>>>>>>>>> again > > > > >>>> > > > > >>>>> which > > > > >>>>>>>>> > > > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense > for > > > > >>>>>>>>>>> > > > > >>>>>>>>>> transient > > > > >>>> > > > > >>>>> errors or when you believe some part of the system could have > > > > >>>>>>>>>>> > > > > >>>>>>>>>> healed > > > > >>>> > > > > >>>>> itself, both of which I think are not true for the log cleaner. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino < > > > > >>>>>>>>>>> rndg...@gmail.com> > > > > >>>>>>>>>>> > > > > >>>>>>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>>> <<<respawning threads is likely to make things worse, by > > > putting > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> you > > > > >>>> > > > > >>>>> in > > > > >>>>>>>>> > > > > >>>>>>>>>> an > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> infinite loop which consumes resources and fires off > > > > continuous > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> log > > > > >>>> > > > > >>>>> messages. > > > > >>>>>>>>>>>> Hi Colin. In case it could be relevant, one way to > > mitigate > > > > >>>>>>>>>>>> this > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> effect > > > > >>>>>>>>> > > > > >>>>>>>>>> is > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn is > > to > > > > >>>>>>>>>>>> occur > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> then > > > > >>>>>> > > > > >>>>>>> wait > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn is > > to > > > > >>>>>>>>>>>> occur > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> wait > > > > >>>>>> > > > > >>>>>>> for > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, > etc. > > > up > > > > to > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> some > > > > >>>>>> > > > > >>>>>>> max > > > > >>>>>>>>> > > > > >>>>>>>>>> wait time). > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate or > not > > > in > > > > >>>>>>>>>>>> this > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> context, > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> but a mitigation like the increasing backoff described > > above > > > > may > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> be > > > > >>>> > > > > >>>>> relevant in weighing the pros and cons. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> Ron > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe < > > > > >>>>>>>>>>>> cmcc...@apache.org> > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>> wrote: > > > > >>>>>>>>> > > > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > > > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP! > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner were > > more > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> tolerant > > > > >>>> > > > > >>>>> of > > > > >>>>>>>>> > > > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays > dead. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> Things are better now than they used to be. We have > the > > > > metric > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time- > > > since-last-run-ms > > > > >>>> > > > > >>>>> which we can use to tell us if the threads are dead. And as of > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> 1.1.0, > > > > >>>>>>>>> > > > > >>>>>>>>>> we > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log > cleaner > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> thread, > > > > >>>> > > > > >>>>> without requiring a broker restart. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+ > > > > >>>> Dynamic+Broker+Configuration > > > > >>>> > > > > >>>>> < > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP- > > > 226+-+ > > > > >>>> Dynamic+Broker+Configuration > > > > >>>> > > > > >>>>> I've only read about this, I haven't personally tried it. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Thanks for pointing this out, James! Stanislav, we > > should > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> probably > > > > >>>> > > > > >>>>> add a > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere > > in > > > > the > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> KIP. > > > > >>>>>> > > > > >>>>>>> Maybe > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> in the intro section? > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> I think it's clear that requiring the users to manually > > > > restart > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> the > > > > >>>> > > > > >>>>> log > > > > >>>>>>>>> > > > > >>>>>>>>>> cleaner is not a very good solution. But it's good to > know > > > that > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> it's a > > > > >>>>>>>>> > > > > >>>>>>>>>> possibility on some older releases. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Some comments: > > > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner continue > to > > > > clean > > > > >>>>>>>>>>>>>> as > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> many > > > > >>>>>>>>> > > > > >>>>>>>>>> partitions as it can, skipping over the problematic ones > if > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> possible. > > > > >>>>>>>>> > > > > >>>>>>>>>> * If the log cleaner thread dies, I think it should > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> automatically > > > > >>>> > > > > >>>>> be > > > > >>>>>> > > > > >>>>>>> revived. Your KIP attempts to do that by catching exceptions > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> during > > > > >>>>>> > > > > >>>>>>> execution, but I think we should go all the way and make sure > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> that a > > > > >>>>>> > > > > >>>>>>> new > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> one gets created, if the thread ever dies. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka > > works. > > > > We > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> don't > > > > >>>>>> > > > > >>>>>>> automatically re-create other threads in the broker if they > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> terminate. > > > > >>>>>>>>> > > > > >>>>>>>>>> In > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> general, if there is a serious bug in the code, > > respawning > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> threads > > > > >>>> > > > > >>>>> is > > > > >>>>>> > > > > >>>>>>> likely to make things worse, by putting you in an infinite > loop > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> which > > > > >>>>>> > > > > >>>>>>> consumes resources and fires off continuous log messages. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> * It might be worth trying to re-clean the uncleanable > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> partitions. > > > > >>>> > > > > >>>>> I've > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> seen cases where an uncleanable partition later became > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> cleanable. > > > > >>>> > > > > >>>>> I > > > > >>>>>> > > > > >>>>>>> unfortunately don't remember how that happened, but I > remember > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> being > > > > >>>>>> > > > > >>>>>>> surprised when I discovered it. It might have been something > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> like > > > > >>>> > > > > >>>>> a > > > > >>>>>> > > > > >>>>>>> follower was uncleanable but after a leader election > happened, > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> the > > > > >>>> > > > > >>>>> log > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not sure. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> James, I disagree. We had this behavior in the Hadoop > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> Distributed > > > > >>>> > > > > >>>>> File > > > > >>>>>>>>> > > > > >>>>>>>>>> System (HDFS) and it was a constant source of user > problems. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> What would happen is disks would just go bad over time. > > > The > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> DataNode > > > > >>>>>> > > > > >>>>>>> would notice this and take them offline. But then, due to > some > > > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically try > to > > > > >>>>>>>>>>>>> re-add > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> them > > > > >>>>>> > > > > >>>>>>> to > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> the system. Then one of two things would happen: the > disk > > > > would > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> just > > > > >>>>>> > > > > >>>>>>> fail > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> immediately again, or it would appear to work and then > > fail > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> after a > > > > >>>> > > > > >>>>> short > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> amount of time. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O > > request > > > > >>>>>>>>>>>>> take a > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> really > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> long time and time out. So a bunch of request handler > > > threads > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> would > > > > >>>>>> > > > > >>>>>>> basically slam into a brick wall when they tried to access > the > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> bad > > > > >>>> > > > > >>>>> disk, > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> slowing the DataNode to a crawl. It was even worse in > the > > > > >>>>>>>>>>>>> second > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> scenario, > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then > > failed. > > > > Any > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> data > > > > >>>>>> > > > > >>>>>>> that > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> had been written on that DataNode to that disk would be > > > lost, > > > > >>>>>>>>>>>>> and > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> we > > > > >>>>>> > > > > >>>>>>> would > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> need to re-replicate it. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal over > > > time. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> Once > > > > >>>> > > > > >>>>> they're > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> bad, they stay bad. The log cleaner needs to be robust > > > > against > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> cases > > > > >>>>>> > > > > >>>>>>> where > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> the disk really is failing, and really is returning bad > > > data > > > > or > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> timing > > > > >>>>>>>>> > > > > >>>>>>>>>> out. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> * For your metrics, can you spell out the full metric > in > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> JMX-style > > > > >>>> > > > > >>>>> format, such as: > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=uncleanable- > > > > >>>> partitions-count > > > > >>>> > > > > >>>>> value=4 > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition names > > can > > > be > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> very > > > > >>>> > > > > >>>>> long. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> I think the current max size is 210 characters (or maybe > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> 240-ish?). > > > > >>>>>> > > > > >>>>>>> Having the "uncleanable-partitions" being a list could be > very > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> large > > > > >>>>>> > > > > >>>>>>> metric. Also, having the metric come out as a csv might be > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> difficult > > > > >>>>>> > > > > >>>>>>> to > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want the > > topic > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> names > > > > >>>> > > > > >>>>> to > > > > >>>>>>>>> > > > > >>>>>>>>>> be > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> accessible, what do you think of having the > > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManag > > > > >>>>>>>>>>>>>> er,topic=topic1,partition=2 > > > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right type, > but > > > my > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> example > > > > >>>>>> > > > > >>>>>>> was > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> that the topic and partition can be tags in the metric. > > That > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> will > > > > >>>> > > > > >>>>> allow > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> monitoring systems to more easily slice and dice the > > metric. > > > > I'm > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> not > > > > >>>>>> > > > > >>>>>>> sure what the attribute for that metric would be. Maybe > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> something > > > > >>>> > > > > >>>>> like > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> time-since-last-clean? > > > > >>>>>>>>> > > > > >>>>>>>>>> Or > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> maybe even just "Value=1". > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> I haven't though about this that hard, but do we really > > > need > > > > >>>>>>>>>>>>> the > > > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a > > metric? > > > > It > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> seems > > > > >>>>>> > > > > >>>>>>> like > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> the admin should notice that uncleanable partitions are > > > > present, > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> and > > > > >>>>>> > > > > >>>>>>> then > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> check the logs? > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that > this > > > > likely > > > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not > sure > > > > that > > > > >>>>>>>>>>>>>> is > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> the > > > > >>>>>> > > > > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> problems, > > > > >>>>>> > > > > >>>>>>> all > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> of them are partition-level scenarios that happened > > during > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> normal > > > > >>>> > > > > >>>>> operation. None of them were indicative of disk problems. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> I don't think this is a meaningful comparison. In > > general, > > > > we > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> don't > > > > >>>>>> > > > > >>>>>>> accept JIRAs for hard disk problems that happen on a > particular > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> cluster. > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is > having > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> problems" > > > > >>>>>> > > > > >>>>>>> we > > > > >>>>>>>>> > > > > >>>>>>>>>> could close that as "not a Kafka bug." This doesn't prove > > > that > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> disk > > > > >>>>>> > > > > >>>>>>> problems don't happen, but just that JIRA isn't the right > > place > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> for > > > > >>>>>> > > > > >>>>>>> them. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> I do agree that the log cleaner has had a significant > > > number > > > > of > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> logic > > > > >>>>>> > > > > >>>>>>> bugs, and that we need to be careful to limit their impact. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> That's > > > > >>>> > > > > >>>>> one > > > > >>>>>>>>> > > > > >>>>>>>>>> reason why I think that a threshold of "number of > > uncleanable > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> logs" > > > > >>>> > > > > >>>>> is > > > > >>>>>>>>> > > > > >>>>>>>>>> a > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> good idea, rather than just failing after one > IOException. > > > In > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> all > > > > >>>> > > > > >>>>> the > > > > >>>>>>>>> > > > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log > > > cleaner, > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> it > > > > >>>> > > > > >>>>> was > > > > >>>>>>>>> > > > > >>>>>>>>>> just one partition that had the issue. We also should > > > increase > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> test > > > > >>>>>> > > > > >>>>>>> coverage for the log cleaner. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a > certain > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> threshold, > > > > >>>>>> > > > > >>>>>>> that actually increases the blast radius of log compaction > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> failures. > > > > >>>>>> > > > > >>>>>>> Currently, the uncleaned partitions are still readable and > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> writable. > > > > >>>>>> > > > > >>>>>>> Taking the disks offline would impact availability of the > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> uncleanable > > > > >>>>>>>>> > > > > >>>>>>>>>> partitions, as well as impact all other partitions that > are > > on > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> the > > > > >>>> > > > > >>>>> disk. > > > > >>>>>>>>>>>> > > > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take the > > disk > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>> partition > > > > >>>>>> > > > > >>>>>>> offline. This is spelled out in KIP-112 ( > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112% > > > > >>>> 3A+Handle+disk+failure+for+JBOD > > > > >>>> > > > > >>>>> ) : > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it > > > starts, > > > > >>>>>>>>>>>>>> and > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> mark > > > > >>>>>>>>> > > > > >>>>>>>>>> log directory as > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> bad once there is IOException when broker attempts to > > > access > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> (i.e. > > > > >>>> > > > > >>>>> read > > > > >>>>>>>>>>> > > > > >>>>>>>>>>>> or write) the log directory. > > > > >>>>>>>>>>>>> > > > > >>>>>>>>>>>>>> - Broker will be offline if all log directories are > bad. > > > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log > > > > directory. > > > > >>>>>>>>>>>>>> > > > > >>>>>>>>>>>>> New > > > > >>> > > > > >>> > > > > > > > > > > > > > -- > > > Best, > > > Stanislav > > > > > > > > -- > Best, > Stanislav >