Thanks, Niket. +1 (binding)

Colin

On Wed, Aug 3, 2022, at 15:24, Niket Goel wrote:
> Thanks for the explanation Colin.
>
>> ForceRenounceCount => 
>> kafka.controller:type=KafkaController,name=MetadataErrorCount
>> publisher-error-count => metadata-load-error-count
>> listener-batch-load-error-count => metadata-apply-error-count
>
> Yeah, this makes sense. I have made the changes in naming you suggested 
> and updated the KIP.
>
> - Niket
>
>
>> On Aug 3, 2022, at 2:00 PM, Colin McCabe <cmcc...@apache.org> wrote:
>> 
>> I think there are a few different cases here:
>> 
>> 1a. We hit an ApiException PREPARING metadata records on the active 
>> controller. This is normal and expected. For example, someone tried to 
>> create a topic that already exists. We translate the ApiException to an 
>> ApiError and return the appropriate error code to the user. 
>> (NotControllerException is also included here in 1A)
>> 
>> 1b. We hit a non-ApiException PREPARING metadata records on the active 
>> controller. This is not expected. It would be something like we get a 
>> NullPointerException in a createTopics RPC. In this case, we resign 
>> leadership and increment your ForceRenounceCount metric.
>> 
>> 2. We hit any exception APPLYING metadata records on the active controller. 
>> In this case, we just want to exit the process. This prevents the 
>> (potentially) bad records from ever being commited to the log, since the 
>> active controller applies the records locally before sending them out to its 
>> peers.
>> 
>> 3. We hit any exception APPLYING metadata records on a standby controller. 
>> In this case, we want to increment a metric so that we know that this 
>> happened. But we don't want to exit the process, in case all the controllers 
>> are hitting the same problem. Remember that the standby controller only ever 
>> applies committed records.
>> 
>> 4a. We hit any exception constructing the MetadataDelta / new MetadataImage 
>> on the broker. Increment listener-batch-load-error-count.
>> 
>> 4b. We hit any exception applying the MetadataDelta / new MetadataImage on 
>> the broker. Increment publisher-error-count
>> 
>> The KIP is solid on 4a and 4b, but we have no metric here that we can use 
>> for case #3. The standby controller can't resign because it's not active in 
>> the first place. So I would suggest rather than having a resignation metric, 
>> we just have a general controller metadata error metric.
>> 
>> How do you feel about:
>> 
>> ForceRenounceCount => 
>> kafka.controller:type=KafkaController,name=MetadataErrorCount
>> publisher-error-count => metadata-load-error-count
>> listener-batch-load-error-count => metadata-apply-error-count
>> 
>> best,
>> Colin
>> 
>> 
>> On Tue, Aug 2, 2022, at 18:05, Niket Goel wrote:
>>> Thanks for taking the time to go over the KIP Colin.
>>> 
>>> While I agree with both your points about error handling, I think this 
>>> KIP focuses on just exposing these errors via the proposed metrics and 
>>> does not alter the error handling behavior on either the brokers or the 
>>> controllers. The metrics (as proposed) are somewhat independent of that 
>>> and should just hook into whatever we are doing. If we change the error 
>>> handling then the metrics should be hooked into the new code as well.
>>> 
>>> Maybe the point you are trying to make is that having the properties 
>>> which you mention below will generally make it so that errors are not 
>>> committed to the log and hence having the metrics for the is sufficient 
>>> (as the error will likely be transient and recoverable?).
>>> 
>>> Apologies if I did not understand the intent of your comment.
>>> 
>>> - Niket
>>> 
>>>> On Aug 2, 2022, at 3:34 PM, Colin McCabe <cmcc...@apache.org> wrote:
>>>> 
>>>> Hi Niket,
>>>> 
>>>> Thanks for the KIP -- much appreciated! The new metrics look very useful.
>>>> 
>>>> I agree with the proposed error handling for errors on standby controllers 
>>>> and brokers. For active controllers, I think we should establish two 
>>>> points:
>>>> 
>>>> 1. the active controller replays metadata before submitting it to the Raft 
>>>> quorum
>>>> 2. metadata replay errors on the active cause the process to exit, prior 
>>>> to attempting to commit the record
>>>> 
>>>> This will allow most of these metadata replay errors to be noticed and NOT 
>>>> committed to the metadata log, which I think will make things much more 
>>>> robust. Since the controller process can be restarted very quickly, it 
>>>> shouldn't be an undue operational burden. (It's true that when in combined 
>>>> mode, restarts will take longer, but this kind of tradeoff is integral to 
>>>> combined mode -- you get reduced fault isolation in exchange for the lower 
>>>> overhead of one fewer JVM process).
>>>> 
>>>> best,
>>>> Colin
>>>> 
>>>> 
>>>> On Mon, Aug 1, 2022, at 18:05, David Arthur wrote:
>>>>> Thanks, Niket.
>>>>> 
>>>>> +1 binding from me
>>>>> 
>>>>> -David
>>>>> 
>>>>> On Mon, Aug 1, 2022 at 8:15 PM Niket Goel <ng...@confluent.io.invalid> 
>>>>> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> I would like to start a vote on KIP-859 which adds some new metrics to 
>>>>>> KRaft to allow for better visibility into log processing errors.
>>>>>> 
>>>>>> KIP 
>>>>>> —ttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics
>>>>>>  
>>>>>> <https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%25253A%252BAdd%252BMetadata%252BLog%252BProcessing%252BError%252BRelated%252BMetrics%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw3enbkssNlSaDhPrm6faYby&source=gmail-imap&ust=1660165379000000&usg=AOvVaw3kqbZjyeDMGa9vyeBvkobj>
>>>>>> Discussion Thread — 
>>>>>> https://www.google.com/url?q=https://www.google.com/url?q%3Dhttps://lists.apache.org/thread/yl87h1s484yc09yjo1no46hwpbv0qkwt%26source%3Dgmail-imap%26ust%3D1660084494000000%26usg%3DAOvVaw2R3Sj0u0NlQOG9XHh-Wgzs&source=gmail-imap&ust=1660165379000000&usg=AOvVaw1TFzTmin99eZXg0g5Y81Mz
>>>>>> 
>>>>>> Thanks
>>>>>> Niket
>>>>>>

Reply via email to