Re: [DISCUSS] KIP-1242: Detection and handling of misrouted connections

Andrew Schofield Thu, 12 Mar 2026 02:58:58 -0700

Hi Luke,
Thanks for your questions.

1. It's a good question. If we considered metadata.recovery.strategy=NONE and 
metadata.cluster.check.enable=true to be conflicting configurations, an 
application configuring the former prior to KIP-1242 would fail when it 
upgrades to a KIP-1242 client. This might be acceptable at a major version 
bump, but I don't think it is on a minor version bump. I think there is value 
in checking by default without waiting for AK 5.0, so I don't want to make the 
configurations conflict.


I suggest that disabling the checking if metadata.recovery.strategy=NONE 
(non-default since AK 4.0) is the easiest path forwards. What do you think?

2. If the broker doesn't support ApiVersions v5 or later, it will have no 
effect. Experience of logging "helpful" information as part of KIP-714 is that 
it is actually soon considered annoying, so I propose to document that the 
cluster check will only have an effect when connecting to brokers that support 
ApiVersions v5 or later (which is hopefully AK 4.4) and not log anything.


I have made updates to the KIP. Please take a look.

Thanks,
Andrew

On 2026/03/12 06:03:02 Luke Chen wrote:
> Hi Andrew,
> 
> Thanks for the KIP!
> 
> Questions:
> 1. If the client doesn't enable (i.e. metadata.recovery.strategy=NONE),
> what will happen if `metadata.cluster.check.enable=true`
> and REBOOTSTRAP_REQUIRED error is received? Should we fail fast when users
> use this config combination?
> 
> 2. We set `metadata.cluster.check.enable` to true by default now. What
> happens if the broker doesn't support the ApiVersions v5 or later? It
> should have no effect, right? Should we document or log something about it?
> Otherwise, the config will confuse users.
> 
> Thank you,
> Luke
> 
> On Mon, Mar 9, 2026 at 8:55 AM Gaurav Narula <[email protected]> wrote:
> 
> > Hi Andrew,
> >
> > Thank you for the KIP. I welcome the suggestion as I've run into a version
> > of this problem in the past which I’d like to share for posterity.
> >
> > I've run into situations where requests to the controller sent via
> > NodeToControllerChannelManagerImpl failed with authentication exceptions.
> > On debugging it was found that NodeToControllerChannelManagerImpl cached
> > the controller node whose advertised address had changed and the cached
> > entry referred to a node in another cluster. My fix then was to propose
> > https://github.com/apache/kafka/pull/14760 but there quite likely exists
> > a gap for situations where users don’t face an auth exception (perhaps
> > clusters share auth?) in the same code path. I believe this KIP should
> > close that gap and allow for better error handling of such scenarios.
> >
> > Thanks once again!
> >
> > Regards,
> > Gaurav
> >
> > > On 3 Mar 2026, at 10:52, Andrew Schofield <[email protected]> wrote:
> > >
> > > Thinking about this some more, I have changed the error code on receipt
> > of an incorrect cluster ID to REBOOTSTRAP_REQUIRED, matching incorrect node
> > ID. This is because I have heard of situations in which people use
> > rebootstrapping to switch clusters for recovery purposes so it's important
> > that a retriable error is used. Logging on client and server will indicate
> > when the checks fail, so the KIP's aim of making misconfiguration diagnosis
> > easier will be satisfied while making the clients tolerant of intentional
> > changes which should drive rebootstrapping.
> > >
> > > Unless there are further comments, I will start voting on this KIP next
> > week.
> > >
> > > Thanks,
> > > Andrew
> > >
> > > On 2026/03/02 18:36:07 Rajini Sivaram wrote:
> > >> Hi Andrew,
> > >>
> > >> Thanks for the update, looks good.
> > >>
> > >> Regards,
> > >>
> > >> Rajini
> > >>
> > >> On Mon, Mar 2, 2026 at 1:57 PM Andrew Schofield <[email protected]>
> > >> wrote:
> > >>
> > >>> Hi Rajini,
> > >>> Thanks for your comments.
> > >>>
> > >>> I have changed the KIP such that the client discards cluster ID and
> > node
> > >>> information when rebootstrapping begins.
> > >>>
> > >>> I have also added a common client configuration to disable sending of
> > the
> > >>> cluster ID and node ID information, just in case there is a situation
> > in
> > >>> which the assumptions behind this KIP do not apply to an existing
> > >>> deployment.
> > >>>
> > >>> Thanks,
> > >>> Andrew
> > >>>
> > >>> On 2026/03/02 12:09:46 Rajini Sivaram wrote:
> > >>>> Hi Andrew,
> > >>>>
> > >>>> Thanks for the KIP.
> > >>>>
> > >>>> The KIP says:
> > >>>> If the client is bootstrapping, it does not supply ClusterId  or
> > NodeId .
> > >>>> After bootstrapping, during which it learns the information from its
> > >>> initial
> > >>>> Metadata  response, it supplies both.
> > >>>>
> > >>>> It will be good to clarify the behaviour during re-bootstrapping. We
> > >>> clear
> > >>>> the current metadata during re-bootstrap and revert to bootstrap
> > >>> metadata.
> > >>>> At this point, we don't retain node ids or cluster id from previous
> > >>>> metadata responses. I think this makes sense because we want
> > >>>> re-bootstrapping to behave in the same way as the first bootstrap. If
> > we
> > >>>> retain this behaviour, validation of cluster id and node-id will be
> > based
> > >>>> on the Metadata response of the last bootstrap, which is not
> > necessarily
> > >>>> the initial Metadata response. I think this is the desired behaviour,
> > can
> > >>>> we clarify in the KIP?
> > >>>>
> > >>>> Kafka clients have always supported cluster id change without
> > requiring
> > >>>> restart. Do we need an opt-out in case some deployments rely on this
> > >>>> feature? If re-bootstrapping is enabled, clients would re-bootstrap if
> > >>>> connections consistently fail. So as long as we continue to clear old
> > >>>> metadata on re-bootstrap, we should be fine. Not sure if we need an
> > >>>> explicit opt-out for the case where re-bootstrapping is disabled.
> > >>>>
> > >>>> Thanks,
> > >>>>
> > >>>> Rajini
> > >>>>
> > >>>>
> > >>>> On Thu, Feb 12, 2026 at 1:43 PM Andrew Schofield <
> > [email protected]>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi David,
> > >>>>> Thanks for your question.
> > >>>>>
> > >>>>> Here's one elderly JIRA I've unearthed which is related
> > >>>>> https://issues.apache.org/jira/browse/KAFKA-15828.
> > >>>>>
> > >>>>> I am also aware of suspected problems in the networking for cloud
> > >>>>> providers which occasionally seem to route connections to the wrong
> > >>> place.
> > >>>>>
> > >>>>> The KIP is aiming to get some basic diagnosis and recovery into the
> > >>> Kafka
> > >>>>> protocol where today there is none. As you can imagine, there is
> > total
> > >>>>> mayhem when a client confidently thinks it's talking to one broker
> > when
> > >>>>> actually it's talking to quite another. Diagnosis of this kind of
> > >>> problem
> > >>>>> would really help in getting to the bottom of rare issues such as
> > >>> these.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Andrew
> > >>>>>
> > >>>>> On 2026/02/11 16:12:50 David Arthur wrote:
> > >>>>>> Thanks for the KIP, Andrew. I'm all for making the client more
> > robust
> > >>>>>> against networking and deployment weirdness
> > >>>>>>
> > >>>>>> I'm not sure I fully grok the scenario you are covering here. It
> > >>> sounds
> > >>>>>> like you're guarding against a hostname being reused by a different
> > >>>>> broker.
> > >>>>>> Does the client not learn about the new broker hostnames when it
> > >>>>> refreshes
> > >>>>>> metadata periodically?
> > >>>>>>
> > >>>>>> -David
> > >>>>>>
> > >>>>>> On Thu, Nov 20, 2025 at 5:59 AM Andrew Schofield <
> > >>>>> [email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Hi,
> > >>>>>>> I would like to start discussion of a new KIP for detecting and
> > >>>>> handling
> > >>>>>>> misrouted connections from Kafka clients. The Kafka protocol does
> > >>> not
> > >>>>>>> contain any information for working out when the broker metadata
> > >>>>>>> information in a client is inconsistent or stale. This KIP
> > >>> proposes a
> > >>>>> way
> > >>>>>>> to address this.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1242%3A+Detection+and+handling+of+misrouted+connections
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Andrew
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> David Arthur
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS] KIP-1242: Detection and handling of misrouted connections

Reply via email to