Hi Niket, Thanks for the KIP -- much appreciated! The new metrics look very useful.
I agree with the proposed error handling for errors on standby controllers and brokers. For active controllers, I think we should establish two points: 1. the active controller replays metadata before submitting it to the Raft quorum 2. metadata replay errors on the active cause the process to exit, prior to attempting to commit the record This will allow most of these metadata replay errors to be noticed and NOT committed to the metadata log, which I think will make things much more robust. Since the controller process can be restarted very quickly, it shouldn't be an undue operational burden. (It's true that when in combined mode, restarts will take longer, but this kind of tradeoff is integral to combined mode -- you get reduced fault isolation in exchange for the lower overhead of one fewer JVM process). best, Colin On Mon, Aug 1, 2022, at 18:05, David Arthur wrote: > Thanks, Niket. > > +1 binding from me > > -David > > On Mon, Aug 1, 2022 at 8:15 PM Niket Goel <ng...@confluent.io.invalid> wrote: >> >> Hi all, >> >> I would like to start a vote on KIP-859 which adds some new metrics to KRaft >> to allow for better visibility into log processing errors. >> >> KIP >> —ttps://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics >> >> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-859%3A+Add+Metadata+Log+Processing+Error+Related+Metrics> >> Discussion Thread — >> https://lists.apache.org/thread/yl87h1s484yc09yjo1no46hwpbv0qkwt >> >> Thanks >> Niket >>