Thanks, Niket. +1 (binding) Colin
On Wed, Aug 3, 2022, at 15:24, Niket Goel wrote: > Thanks for the explanation Colin. > >> ForceRenounceCount => >> kafka.controller:type=KafkaController,name=MetadataErrorCount >> publisher-error-count => metadata-load-error-count >> listener-batch-load-error-count => metadata-apply-error-count > > Yeah, this makes sense. I have made the changes in naming you suggested > and updated the KIP. > > - Niket > > >> On Aug 3, 2022, at 2:00 PM, Colin McCabe <cmcc...@apache.org> wrote: >> >> I think there are a few different cases here: >> >> 1a. We hit an ApiException PREPARING metadata records on the active >> controller. This is normal and expected. For example, someone tried to >> create a topic that already exists. We translate the ApiException to an >> ApiError and return the appropriate error code to the user. >> (NotControllerException is also included here in 1A) >> >> 1b. We hit a non-ApiException PREPARING metadata records on the active >> controller. This is not expected. It would be something like we get a >> NullPointerException in a createTopics RPC. In this case, we resign >> leadership and increment your ForceRenounceCount metric. >> >> 2. We hit any exception APPLYING metadata records on the active controller. >> In this case, we just want to exit the process. This prevents the >> (potentially) bad records from ever being commited to the log, since the >> active controller applies the records locally before sending them out to its >> peers. >> >> 3. We hit any exception APPLYING metadata records on a standby controller. >> In this case, we want to increment a metric so that we know that this >> happened. But we don't want to exit the process, in case all the controllers >> are hitting the same problem. Remember that the standby controller only ever >> applies committed records. >> >> 4a. We hit any exception constructing the MetadataDelta / new MetadataImage >> on the broker. Increment listener-batch-load-error-count. >> >> 4b. We hit any exception applying the MetadataDelta / new MetadataImage on >> the broker. Increment publisher-error-count >> >> The KIP is solid on 4a and 4b, but we have no metric here that we can use >> for case #3. The standby controller can't resign because it's not active in >> the first place. So I would suggest rather than having a resignation metric, >> we just have a general controller metadata error metric. >> >> How do you feel about: >> >> ForceRenounceCount => >> kafka.controller:type=KafkaController,name=MetadataErrorCount >> publisher-error-count => metadata-load-error-count >> listener-batch-load-error-count => metadata-apply-error-count >> >> best, >> Colin >> >> >> On Tue, Aug 2, 2022, at 18:05, Niket Goel wrote: >>> Thanks for taking the time to go over the KIP Colin. >>> >>> While I agree with both your points about error handling, I think this >>> KIP focuses on just exposing these errors via the proposed metrics and >>> does not alter the error handling behavior on either the brokers or the >>> controllers. The metrics (as proposed) are somewhat independent of that >>> and should just hook into whatever we are doing. If we change the error >>> handling then the metrics should be hooked into the new code as well. >>> >>> Maybe the point you are trying to make is that having the properties >>> which you mention below will generally make it so that errors are not >>> committed to the log and hence having the metrics for the is sufficient >>> (as the error will likely be transient and recoverable?). >>> >>> Apologies if I did not understand the intent of your comment. >>> >>> - Niket >>> >>>> On Aug 2, 2022, at 3:34 PM, Colin McCabe <cmcc...@apache.org> wrote: >>>> >>>> Hi Niket, >>>> >>>> Thanks for the KIP -- much appreciated! The new metrics look very useful. >>>> >>>> I agree with the proposed error handling for errors on standby controllers >>>> and brokers. For active controllers, I think we should establish two >>>> points: >>>> >>>> 1. the active controller replays metadata before submitting it to the Raft >>>> quorum >>>> 2. metadata replay errors on the active cause the process to exit, prior >>>> to attempting to commit the record >>>> >>>> This will allow most of these metadata replay errors to be noticed and NOT >>>> committed to the metadata log, which I think will make things much more >>>> robust. Since the controller process can be restarted very quickly, it >>>> shouldn't be an undue operational burden. (It's true that when in combined >>>> mode, restarts will take longer, but this kind of tradeoff is integral to >>>> combined mode -- you get reduced fault isolation in exchange for the lower >>>> overhead of one fewer JVM process). >>>> >>>> best, >>>> Colin >>>> >>>> >>>> On Mon, Aug 1, 2022, at 18:05, David Arthur wrote: >>>>> Thanks, Niket. >>>>> >>>>> +1 binding from me >>>>> >>>>> -David >>>>> >>>>> On Mon, Aug 1, 2022 at 8:15 PM Niket Goel <ng...@confluent.io.invalid> >>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> I would like to start a vote on KIP-859 which adds some new metrics to >>>>>> KRaft to allow for better visibility into log processing errors. >>>>>> >>>>>> KIP >>>>>> —ttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics >>>>>> >>>>>> <https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%25253A%252BAdd%252BMetadata%252BLog%252BProcessing%252BError%252BRelated%252BMetrics%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw3enbkssNlSaDhPrm6faYby&source=gmail-imap&ust=1660165379000000&usg=AOvVaw3kqbZjyeDMGa9vyeBvkobj> >>>>>> Discussion Thread — >>>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://lists.apache.org/thread/yl87h1s484yc09yjo1no46hwpbv0qkwt%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw2R3Sj0u0NlQOG9XHh-Wgzs&source=gmail-imap&ust=1660165379000000&usg=AOvVaw1TFzTmin99eZXg0g5Y81Mz >>>>>> >>>>>> Thanks >>>>>> Niket >>>>>>