Hi all, The KIP has been around for some time, and I've updated the document according to the previous comments. Here is the outline of the proposed changes: 1. Adding a timeout configuration. 2. Adding a new exception type 3. Adding a WARN level log message
The obvious changes on the client side are: 1. They won't attempt to resolve for DNS at the constructor level 2. They will try to bootstrap once network client poll is invoked. In terms of client-side behavior: *Consumer*: Either the poll timer runs out and returns an empty record, or BootstrapConnectionException thrown *Producer*: API calls will be blocked on waitOnMetadata until the timer runs out, either the max block ms, or the bootstrap timer. *Admin*: The API calls timeout if API timeout expires first; otherwise, it throws a BootstrapConnectionException. Let me know your thoughts: I would like to start voting in a week or so. Link: https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+DNS+Resolution+Failure+Should+Not+Fail+the+Clients Thanks, P On Thu, Mar 23, 2023 at 12:59 PM Philip Nee <philip...@gmail.com> wrote: > Hey Kirk, > > Sorry about omitting your response; it slipped through the cracks... > > *I’m not sure if producers and consumers likewise do DNS resolution in > their constructors?* > Yes, both producer and consumer bootstrap in the constructor. > > *I agree that moving the DNS resolution to poll(), it would be hard to > distinguish hard failures (host name resolution) from transient network > issues. After all, the DNS resolution issue we saw was technically a > transient issue.* > I'm amending a timeout configuration in my KIP because that should resolve > any transient issues if it doesn't persist for too long. If it is a hard > failure (like, misconfiguration of the host name), it should be logged and > errored out after the expiration. > > P > > On Wed, Mar 8, 2023 at 8:12 AM Kirk True <k...@kirktrue.pro> wrote: > >> Hi Philip, >> >> I’m understanding the options proposed as consisting of these questions: >> >> Should we throw an Exception or not? >> Where should the DNS resolution/bootstrapping occur—in the constructor, >> poll, or somewhere else? >> Should there be a timeout, and if so, what configuration drives it? >> >> We were seeing instances of this issue when constructing KafkaAdminClient >> instances. The constructor performs the DNS lookup in that client. I’m not >> sure if producers and consumers likewise do DNS resolution in their >> constructors? >> >> I agree that moving the DNS resolution to poll(), it would be hard to >> distinguish hard failures (host name resolution) from transient network >> issues. After all, the DNS resolution issue we saw was technically a >> transient issue. >> >> Is this an accurate summary of the current thinking: >> >> Per Jason's suggestion, introduce a new BootstrapConnectionException >> Per Chris’ suggestion, introduce a new >> API—performInitialDnsResolution()—that an application developer can call to >> perform a fast-fail check, throwing BootstrapConnectionException >> Otherwise, move the DNS resolution to poll() >> Handle BootstrapConnectionException failures in poll() similar to how we >> handle NetworkException today, with retries, timeouts, etc., i.e. we don’t >> introduce any new configuration >> Improve logging to distinguish the DNS resolution case >> >> Thanks, >> Kirk >> >> > On Mar 6, 2023, at 9:15 AM, Philip Nee <philip...@gmail.com> wrote: >> > >> > Cheers Kafka Community, >> > >> > I just wanna give this thread bump, as it has been a bit quiet for the >> past >> > week. I have not updated the KIP based on Chris and Jason's feedback, >> as I >> > would also like to know more about what do people think. >> > >> > Jason - Thanks for the suggestion, I think your suggestion makes a lot >> of >> > sense. >> > >> > Thanks! >> > P >> > >> > On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson >> <ja...@confluent.io.invalid> >> > wrote: >> > >> >> Hi Philip, >> >> >> >>> Having an overall timeout also seems reasonable, but I wonder what >> should >> >> the client do after running out of the time? Should we throw a >> >> non-retriable exception (instead of TimeoutExceptoin to stop the client >> >> from retrying) and alert the user to examine the config and the DNS >> server? >> >> >> >> Yeah, not sure exactly. I'd probably suggest a >> >> `BootstrapConnectionException` or something like that with a clear >> message >> >> indicating the problem. What the user does with it is up to them, but >> at >> >> least it gives them the option to fail their application if that is >> what >> >> they prefer to do in this case. If they catch it and ignore it, I would >> >> expect the client to just continue retrying. Logging for bootstrap >> >> dns/connection failures will be helpful in any case. >> >> >> >> -Jason >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <philip...@gmail.com> >> wrote: >> >> >> >>> Jason: >> >>> Thanks for the feedback. Now giving it a second thought, I think your >> >>> suggestion of logging the error might make sense, as I could imagine >> most >> >>> users would just continue to retry, so it might not be necessary to >> throw >> >>> an exception anyway. >> >>> Having an overall timeout also seems reasonable, but I wonder what >> should >> >>> the client do after running out of the time? Should we throw a >> >>> non-retriable exception (instead of TimeoutExceptoin to stop the >> client >> >>> from retrying) and alert the user to examine the config and the DNS >> >> server? >> >>> >> >>> Chris: >> >>> I feel I still haven't answered your question about the pre-flight >> check, >> >>> as it seems exposing an API might be harder to push through. >> >>> >> >>> Thanks! >> >>> P >> >>> >> >>> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson >> >>> <ja...@confluent.io.invalid> >> >>> wrote: >> >>> >> >>>> One more random thought I had just as I pushed send. We're currently >> >>>> treating this problem somewhat narrowly by focusing only on the DNS >> >>>> resolution of the bootstrap servers. Even if the servers resolve, >> >> there's >> >>>> no guarantee that they are reachable by the client. Would it make >> sense >> >>> to >> >>>> have a timeout which bounds the total time that the client should >> wait >> >> to >> >>>> connect to the bootstrap servers? Something like ` >> >>>> bootstrap.servers.connection.timeout.ms`. >> >>>> >> >>>> -Jason >> >>>> >> >>>> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io >> > >> >>>> wrote: >> >>>> >> >>>>> Hi Philip, >> >>>>> >> >>>>> An alternative is not to fail at all. Every other network error is >> >>> caught >> >>>>> and handled internally in the client. We see this case as different >> >>>> because >> >>>>> a DNS resolution error may imply misconfiguration. Could it also >> >> imply >> >>>> that >> >>>>> the DNS server is unavailable? I'm not sure why that case should be >> >>>> handled >> >>>>> differently than if the bootstrap servers themselves are >> unavailable. >> >>>> Would >> >>>>> it be enough to log a clear warning in the logs if the bootstrap >> >>> servers >> >>>>> could not resolve? >> >>>>> >> >>>>> On the whole, the current fail-fast approach probably does more good >> >>> than >> >>>>> bad, but it does seem somewhat inconsistent overall and my guess is >> >>> that >> >>>>> dynamic environments will become increasingly common. It would be >> >> nice >> >>> to >> >>>>> have a reasonable default behavior which could handle these cases >> >>>>> gracefully without any additional logic. In any case, it would be >> >> nice >> >>> to >> >>>>> see this option in the rejected alternatives at least if we do not >> >> take >> >>>> it. >> >>>>> >> >>>>> If we want to take the route of throwing an exception, then I think >> >>> we're >> >>>>> probably going to need a new configuration since I can't see what a >> >>>>> reasonable timeout we would use as a default. The benefit of a >> >>>>> configuration is that it would let us retain the current default >> >>> behavior >> >>>>> with timeout effectively set to 0 and it would also let users >> >>> effectively >> >>>>> disable the timeout by using a very large value. Otherwise, it seems >> >>>> like a >> >>>>> potential compatibility break to have a new exception type thrown at >> >>> some >> >>>>> arbitrary time without giving the user any control over it. >> >>>>> >> >>>>> Thanks, >> >>>>> Jason >> >>>>> >> >>>>> >> >>>>> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton >> >> <chr...@aiven.io.invalid >> >>>> >> >>>>> wrote: >> >>>>> >> >>>>>> Hi Philip, >> >>>>>> >> >>>>>> Yeah, it's basically DNS resolution we're talking about, though >> >>> there's >> >>>>>> some additional subtlety there with the logic introduced by KIP-235 >> >>> [1]. >> >>>>>> Essentially it should cover any scenario that causes a client >> >>>> constructor >> >>>>>> to fail with the current logic but would not after this KIP is >> >>> released. >> >>>>>> >> >>>>>> We can generalize the Connect use case like this: a client >> >> application >> >>>>>> that >> >>>>>> may connect to different Kafka clusters with a public-facing, >> >>>> easy-to-use >> >>>>>> API for restarting failed tasks and automatic handling of retriable >> >>>>>> exceptions. The ease with which failed tasks can be restarted is >> >>>>>> significant because it reduces the cost of failing on non-retriable >> >>>>>> exceptions and makes fail-fast behavior easier to work with. And, >> in >> >>>> cases >> >>>>>> like this where we can't really know whether the error we're >> dealing >> >>>> with >> >>>>>> is retriable or not, it's better IMO to continue to allow >> >> applications >> >>>>>> like >> >>>>>> these to fail fast. I do agree that it'd be nice to get input from >> >> the >> >>>>>> community, though. >> >>>>>> >> >>>>>> I was toying with the idea of a NetworkException subclass too. It's >> >> a >> >>>>>> simpler API, but it doesn't allow for preflight validation, which >> >> can >> >>> be >> >>>>>> useful in scenarios where submitting new configurations for client >> >>>>>> applications is expensive in terms of time or resources. Then >> >> again, I >> >>>>>> don't see why the two are mutually exclusive, and we might opt to >> >> use >> >>>> the >> >>>>>> NetworkException subclass in this KIP and pursue an opt-in >> >> validation >> >>>> API >> >>>>>> later on. Thoughts? >> >>>>>> >> >>>>>> [1] - >> >>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection >> >>>>>> >> >>>>>> Cheers, >> >>>>>> >> >>>>>> Chris >> >>>>>> >> >>>>>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <philip...@gmail.com> >> >>> wrote: >> >>>>>> >> >>>>>>> Hey Chris, >> >>>>>>> >> >>>>>>> Thanks again for the feedback! >> >>>>>>> >> >>>>>>> >> >>>>>>> For the preflight DNS check (are we basically trying to resolve >> >> the >> >>>> DNS >> >>>>>>> there?): Maybe it makes more sense to add it to the Config >> >> modules? >> >>> I >> >>>>>> would >> >>>>>>> like to hear what the community says as I'm not familiar with the >> >>>>>> Connect >> >>>>>>> use case. >> >>>>>>> >> >>>>>>> A "slower failing" alternative - I wonder if it makes sense for us >> >>> to >> >>>>>>> extend the NetworkException so that clients can be smarter at >> >>> handling >> >>>>>>> these exceptions. Of course, it is still retriable and requires >> >>>> polling >> >>>>>> the >> >>>>>>> consumer, but then we can distinguish the DNS resolution error >> >> from >> >>>>>> other >> >>>>>>> network errors. >> >>>>>>> >> >>>>>>> Thanks! >> >>>>>>> P >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton >> >>> <chr...@aiven.io.invalid >> >>>>> >> >>>>>>> wrote: >> >>>>>>> >> >>>>>>>> Hi Philip, >> >>>>>>>> >> >>>>>>>> Yeah, "DNS resolution should occur..." seems like a better fit. >> >>> 👍 >> >>>>>>>> >> >>>>>>>> One other question I have is whether we should expose some kind >> >> of >> >>>>>> public >> >>>>>>>> API for performing preflight validation of the bootstrap URLs. >> >> If >> >>> we >> >>>>>>> change >> >>>>>>>> the behavior of a client configured with a silly typo (e.g., >> >>>>>>>> "loclahost instead of localhost") from failing in the >> >> constructor >> >>> to >> >>>>>>>> failing with a retriable exception, this might lead some client >> >>>>>>>> applications to handle that failure by, well, retrying. For >> >>>> reference, >> >>>>>>> this >> >>>>>>>> is exactly what we do in Kafka Connect right now; see [1] and >> >> [2]. >> >>>> IMO >> >>>>>>> it'd >> >>>>>>>> be nice to be able to opt into keeping the current behavior so >> >>> that >> >>>>>>>> projects like Connect could still do preflight checks of the >> >>>>>>>> bootstrap.servers property for connectors before starting them, >> >>> and >> >>>>>>> report >> >>>>>>>> any issues by failing fast instead of continuously writing >> >>>>>> warning/error >> >>>>>>>> messages to their logs. >> >>>>>>>> >> >>>>>>>> I'm not sure about where this new API could go, but a few >> >> options >> >>>>>> might >> >>>>>>> be: >> >>>>>>>> >> >>>>>>>> - Expose a public variant of the existing ClientUtils class >> >>>>>>>> - Add static methods to the ConsumerConfig, ProducerConfig, and >> >>>>>>>> AdminClientConfig classes >> >>>>>>>> - Add those same static methods to the KafkaConsumer, >> >>> KafkaProducer, >> >>>>>> and >> >>>>>>>> KafkaAdminClient classes >> >>>>>>>> >> >>>>>>>> If this seems reasonable, we should probably also specify in the >> >>> KIP >> >>>>>> that >> >>>>>>>> Kafka Connect will leverage this preflight validation logic >> >> before >> >>>>>>>> instantiating any Kafka clients for use by connectors or tasks, >> >>> and >> >>>>>>>> continue to fail fast if there are typos in the >> >> bootstrap.servers >> >>>>>>> property, >> >>>>>>>> or if temporary DNS resolution issues come up. >> >>>>>>>> >> >>>>>>>> [1] - >> >>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606 >> >>>>>>>> [2] - >> >>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439 >> >>>>>>>> >> >>>>>>>> Cheers, >> >>>>>>>> >> >>>>>>>> Chris >> >>>>>>>> >> >>>>>>>> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com >> >>> >> >>>>>> wrote: >> >>>>>>>> >> >>>>>>>>> Hey Chris, >> >>>>>>>>> >> >>>>>>>>> Thanks for the quick response, and I apologize for the unclear >> >>>>>> wording >> >>>>>>>>> there, I guess "DNS lookup" would be a more appropriate >> >> wording >> >>>>>> here. >> >>>>>>> So >> >>>>>>>>> what I meant there was, to delegate the DNS lookup in the >> >>>>>> constructor >> >>>>>>> to >> >>>>>>>>> the network client poll, and it will happen on the very first >> >>>>>> poll. I >> >>>>>>>>> guess the logic could look like this: >> >>>>>>>>> >> >>>>>>>>> - if the client has been bootstrapped, do nothing. >> >>>>>>>>> - Otherwise, perform DNS lookup, and acquire the bootstrap >> >>> server >> >>>>>>>> address. >> >>>>>>>>> >> >>>>>>>>> Thanks for the comment there, I'll change up the wording. >> >> Maybe >> >>>>>> revise >> >>>>>>>> it >> >>>>>>>>> as "DNS resolution should occur in the poll...." ? >> >>>>>>>>> >> >>>>>>>>> P >> >>>>>>>>> >> >>>>>>>>> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton >> >>>>>> <chr...@aiven.io.invalid >> >>>>>>>> >> >>>>>>>>> wrote: >> >>>>>>>>> >> >>>>>>>>>> Hi Philip, >> >>>>>>>>>> >> >>>>>>>>>> Thanks for the KIP! >> >>>>>>>>>> >> >>>>>>>>>> QQ: In the "Proposed Changes" section, the KIP states that >> >>>>>>>> "Bootstrapping >> >>>>>>>>>> should now occur in the poll method before attempting to >> >>> update >> >>>>>> the >> >>>>>>>>>> metadata. This includes resolving the addresses and >> >>>> bootstrapping >> >>>>>> the >> >>>>>>>>>> metadata.". By "bootstrapping the metadata" do we mean >> >>> actually >> >>>>>>>>> contacting >> >>>>>>>>>> the bootstrap servers, or just setting some internal state >> >>>>>> related to >> >>>>>>>> the >> >>>>>>>>>> current set of servers that can be contacted for metadata? I >> >>> ask >> >>>>>>>> because >> >>>>>>>>> it >> >>>>>>>>>> seems like the language here implies the former, but if >> >> that's >> >>>> the >> >>>>>>>> case, >> >>>>>>>>>> this is already happening in poll (or at least, the first >> >>>>>> invocation >> >>>>>>> of >> >>>>>>>>>> it), and if it's the latter, it's probably not necessary to >> >>>>>> mention >> >>>>>>> in >> >>>>>>>>> the >> >>>>>>>>>> KIP since it doesn't really impact user-facing behavior. It >> >>> also >> >>>>>>> seems >> >>>>>>>>> like >> >>>>>>>>>> that detail might impact how intertwined this and KIP-899 >> >> are, >> >>>>>> though >> >>>>>>>> the >> >>>>>>>>>> similarity could still be superficial either way. >> >>>>>>>>>> >> >>>>>>>>>> Cheers, >> >>>>>>>>>> >> >>>>>>>>>> Chris >> >>>>>>>>>> >> >>>>>>>>>> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee < >> >>> philip...@gmail.com >> >>>>> >> >>>>>>>> wrote: >> >>>>>>>>>> >> >>>>>>>>>>> Hey Ismael, >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks for the feedback! The proposal is not to retry >> >>>>>> automatically >> >>>>>>>> but >> >>>>>>>>>>> relies on the user polling the NetworkClient (basically, >> >>>>>>>> consumer.poll) >> >>>>>>>>>> to >> >>>>>>>>>>> reattempt the bootstrap. If bootstrapping fails, a >> >>>>>> NetworkException >> >>>>>>>>>>> (retriable) will be thrown. >> >>>>>>>>>>> >> >>>>>>>>>>> Thanks! >> >>>>>>>>>>> P >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma < >> >>>> ism...@juma.me.uk> >> >>>>>>>> wrote: >> >>>>>>>>>>> >> >>>>>>>>>>>> Thanks for the KIP. Not sure if I missed it, but how >> >> long >> >>>>>> will we >> >>>>>>>>> retry >> >>>>>>>>>>> for >> >>>>>>>>>>>> and when do we give up and propagate the failure to the >> >>>> user? >> >>>>>>>>>>>> >> >>>>>>>>>>>> Ismael >> >>>>>>>>>>>> >> >>>>>>>>>>>> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee < >> >>>>>> philip...@gmail.com> >> >>>>>>>>>> wrote: >> >>>>>>>>>>>> >> >>>>>>>>>>>>> Hi all! >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> I want to start a discussion thread about how we can >> >>>> handle >> >>>>>>>> client >> >>>>>>>>>>>>> bootstrap failure due DNS lookup. This requires a bit >> >>> of >> >>>>>>>>> behavioral >> >>>>>>>>>>>>> change, so a KIP is proposed and attached to this >> >> email. >> >>>>>> Let me >> >>>>>>>>> know >> >>>>>>>>>>> what >> >>>>>>>>>>>>> you think! >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> *A small remark here*: *As the title of this KIP might >> >>>> sound >> >>>>>>>>>>>>> familiar/similar to KIP-899, it is not the same.* >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> *In Summary:* I want to propose a KIP to change the >> >>>> existing >> >>>>>>>>>> bootstrap >> >>>>>>>>>>>>> (upon instantiation) strategy because it is reasonable >> >>> to >> >>>>>> allow >> >>>>>>>>>> clients >> >>>>>>>>>>>> to >> >>>>>>>>>>>>> retry >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> *KIP: * >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>> >> >>> >> >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution >> >>>>>>>>>>>>> >> >>>>>>>>>>>>> Thanks! >> >>>>>>>>>>>>> Philip >> >>>>>>>>>>>>> >> >>>>>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>> >> >>>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>> >> >> >> >>