Hi Niket,

Thanks for the KIP -- much appreciated! The new metrics look very useful.

I agree with the proposed error handling for errors on standby controllers and 
brokers. For active controllers, I think we should establish two points:

1. the active controller replays metadata before submitting it to the Raft 
quorum
2. metadata replay errors on the active cause the process to exit, prior to 
attempting to commit the record

This will allow most of these metadata replay errors to be noticed and NOT 
committed to the metadata log, which I think will make things much more robust. 
Since the controller process can be restarted very quickly, it shouldn't be an 
undue operational burden. (It's true that when in combined mode, restarts will 
take longer, but this kind of tradeoff is integral to combined mode -- you get 
reduced fault isolation in exchange for the lower overhead of one fewer JVM 
process).

best,
Colin


On Mon, Aug 1, 2022, at 18:05, David Arthur wrote:
> Thanks, Niket.
>
> +1 binding from me
>
> -David
>
> On Mon, Aug 1, 2022 at 8:15 PM Niket Goel <ng...@confluent.io.invalid> wrote:
>>
>> Hi all,
>>
>> I would like to start a vote on KIP-859 which adds some new metrics to KRaft 
>> to allow for better visibility into log processing errors.
>>
>> KIP 
>> —ttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics
>>  
>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics>
>> Discussion Thread — 
>> https://lists.apache.org/thread/yl87h1s484yc09yjo1no46hwpbv0qkwt
>>
>> Thanks
>> Niket
>>

Reply via email to