Hi all,

The KIP has been around for some time, and I've updated the document
according to the previous comments. Here is the outline of the proposed
changes:
1. Adding a timeout configuration.
2. Adding a new exception type
3. Adding a WARN level log message

The obvious changes on the client side are:
1. They won't attempt to resolve for DNS at the constructor level
2. They will try to bootstrap once network client poll is invoked.

In terms of client-side behavior:
*Consumer*: Either the poll timer runs out and returns an empty record, or
BootstrapConnectionException thrown
*Producer*: API calls will be blocked on waitOnMetadata until the timer
runs out, either the max block ms, or the bootstrap timer.
*Admin*: The API calls timeout if API timeout expires first; otherwise, it
throws a BootstrapConnectionException.

Let me know your thoughts: I would like to start voting in a week or so.

Link:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+DNS+Resolution+Failure+Should+Not+Fail+the+Clients

Thanks,
P

On Thu, Mar 23, 2023 at 12:59 PM Philip Nee <philip...@gmail.com> wrote:

> Hey Kirk,
>
> Sorry about omitting your response; it slipped through the cracks...
>
> *I’m not sure if producers and consumers likewise do DNS resolution in
> their constructors?*
> Yes, both producer and consumer bootstrap in the constructor.
>
> *I agree that moving the DNS resolution to poll(), it would be hard to
> distinguish hard failures (host name resolution) from transient network
> issues. After all, the DNS resolution issue we saw was technically a
> transient issue.*
> I'm amending a timeout configuration in my KIP because that should resolve
> any transient issues if it doesn't persist for too long.  If it is a hard
> failure (like, misconfiguration of the host name), it should be logged and
> errored out after the expiration.
>
> P
>
> On Wed, Mar 8, 2023 at 8:12 AM Kirk True <k...@kirktrue.pro> wrote:
>
>> Hi Philip,
>>
>> I’m understanding the options proposed as consisting of these questions:
>>
>> Should we throw an Exception or not?
>> Where should the DNS resolution/bootstrapping occur—in the constructor,
>> poll, or somewhere else?
>> Should there be a timeout, and if so, what configuration drives it?
>>
>> We were seeing instances of this issue when constructing KafkaAdminClient
>> instances. The constructor performs the DNS lookup in that client. I’m not
>> sure if producers and consumers likewise do DNS resolution in their
>> constructors?
>>
>> I agree that moving the DNS resolution to poll(), it would be hard to
>> distinguish hard failures (host name resolution) from transient network
>> issues. After all, the DNS resolution issue we saw was technically a
>> transient issue.
>>
>> Is this an accurate summary of the current thinking:
>>
>> Per Jason's suggestion, introduce a new BootstrapConnectionException
>> Per Chris’ suggestion, introduce a new
>> API—performInitialDnsResolution()—that an application developer can call to
>> perform a fast-fail check, throwing BootstrapConnectionException
>> Otherwise, move the DNS resolution to poll()
>> Handle BootstrapConnectionException failures in poll() similar to how we
>> handle NetworkException today, with retries, timeouts, etc., i.e. we don’t
>> introduce any new configuration
>> Improve logging to distinguish the DNS resolution case
>>
>> Thanks,
>> Kirk
>>
>> > On Mar 6, 2023, at 9:15 AM, Philip Nee <philip...@gmail.com> wrote:
>> >
>> > Cheers Kafka Community,
>> >
>> > I just wanna give this thread bump, as it has been a bit quiet for the
>> past
>> > week. I have not updated the KIP based on Chris and Jason's feedback,
>> as I
>> > would also like to know more about what do people think.
>> >
>> > Jason - Thanks for the suggestion, I think your suggestion makes a lot
>> of
>> > sense.
>> >
>> > Thanks!
>> > P
>> >
>> > On Tue, Feb 28, 2023 at 2:45 PM Jason Gustafson
>> <ja...@confluent.io.invalid>
>> > wrote:
>> >
>> >> Hi Philip,
>> >>
>> >>> Having an overall timeout also seems reasonable, but I wonder what
>> should
>> >> the client do after running out of the time? Should we throw a
>> >> non-retriable exception (instead of TimeoutExceptoin to stop the client
>> >> from retrying) and alert the user to examine the config and the DNS
>> server?
>> >>
>> >> Yeah, not sure exactly. I'd probably suggest a
>> >> `BootstrapConnectionException` or something like that with a clear
>> message
>> >> indicating the problem. What the user does with it is up to them, but
>> at
>> >> least it gives them the option to fail their application if that is
>> what
>> >> they prefer to do in this case. If they catch it and ignore it, I would
>> >> expect the client to just continue retrying. Logging for bootstrap
>> >> dns/connection failures will be helpful in any case.
>> >>
>> >> -Jason
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Tue, Feb 28, 2023 at 11:47 AM Philip Nee <philip...@gmail.com>
>> wrote:
>> >>
>> >>> Jason:
>> >>> Thanks for the feedback.  Now giving it a second thought, I think your
>> >>> suggestion of logging the error might make sense, as I could imagine
>> most
>> >>> users would just continue to retry, so it might not be necessary to
>> throw
>> >>> an exception anyway.
>> >>> Having an overall timeout also seems reasonable, but I wonder what
>> should
>> >>> the client do after running out of the time? Should we throw a
>> >>> non-retriable exception (instead of TimeoutExceptoin to stop the
>> client
>> >>> from retrying) and alert the user to examine the config and the DNS
>> >> server?
>> >>>
>> >>> Chris:
>> >>> I feel I still haven't answered your question about the pre-flight
>> check,
>> >>> as it seems exposing an API might be harder to push through.
>> >>>
>> >>> Thanks!
>> >>> P
>> >>>
>> >>> On Tue, Feb 28, 2023 at 10:53 AM Jason Gustafson
>> >>> <ja...@confluent.io.invalid>
>> >>> wrote:
>> >>>
>> >>>> One more random thought I had just as I pushed send. We're currently
>> >>>> treating this problem somewhat narrowly by focusing only on the DNS
>> >>>> resolution of the bootstrap servers. Even if the servers resolve,
>> >> there's
>> >>>> no guarantee that they are reachable by the client. Would it make
>> sense
>> >>> to
>> >>>> have a timeout which bounds the total time that the client should
>> wait
>> >> to
>> >>>> connect to the bootstrap servers? Something like `
>> >>>> bootstrap.servers.connection.timeout.ms`.
>> >>>>
>> >>>> -Jason
>> >>>>
>> >>>> On Tue, Feb 28, 2023 at 10:44 AM Jason Gustafson <ja...@confluent.io
>> >
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Philip,
>> >>>>>
>> >>>>> An alternative is not to fail at all. Every other network error is
>> >>> caught
>> >>>>> and handled internally in the client. We see this case as different
>> >>>> because
>> >>>>> a DNS resolution error may imply misconfiguration. Could it also
>> >> imply
>> >>>> that
>> >>>>> the DNS server is unavailable? I'm not sure why that case should be
>> >>>> handled
>> >>>>> differently than if the bootstrap servers themselves are
>> unavailable.
>> >>>> Would
>> >>>>> it be enough to log a clear warning in the logs if the bootstrap
>> >>> servers
>> >>>>> could not resolve?
>> >>>>>
>> >>>>> On the whole, the current fail-fast approach probably does more good
>> >>> than
>> >>>>> bad, but it does seem somewhat inconsistent overall and my guess is
>> >>> that
>> >>>>> dynamic environments will become increasingly common. It would be
>> >> nice
>> >>> to
>> >>>>> have a reasonable default behavior which could handle these cases
>> >>>>> gracefully without any additional logic. In any case, it would be
>> >> nice
>> >>> to
>> >>>>> see this option in the rejected alternatives at least if we do not
>> >> take
>> >>>> it.
>> >>>>>
>> >>>>> If we want to take the route of throwing an exception, then I think
>> >>> we're
>> >>>>> probably going to need a new configuration since I can't see what a
>> >>>>> reasonable timeout we would use as a default. The benefit of a
>> >>>>> configuration is that it would let us retain the current default
>> >>> behavior
>> >>>>> with timeout effectively set to 0 and it would also let users
>> >>> effectively
>> >>>>> disable the timeout by using a very large value. Otherwise, it seems
>> >>>> like a
>> >>>>> potential compatibility break to have a new exception type thrown at
>> >>> some
>> >>>>> arbitrary time without giving the user any control over it.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Jason
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Feb 28, 2023 at 8:08 AM Chris Egerton
>> >> <chr...@aiven.io.invalid
>> >>>>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Hi Philip,
>> >>>>>>
>> >>>>>> Yeah, it's basically DNS resolution we're talking about, though
>> >>> there's
>> >>>>>> some additional subtlety there with the logic introduced by KIP-235
>> >>> [1].
>> >>>>>> Essentially it should cover any scenario that causes a client
>> >>>> constructor
>> >>>>>> to fail with the current logic but would not after this KIP is
>> >>> released.
>> >>>>>>
>> >>>>>> We can generalize the Connect use case like this: a client
>> >> application
>> >>>>>> that
>> >>>>>> may connect to different Kafka clusters with a public-facing,
>> >>>> easy-to-use
>> >>>>>> API for restarting failed tasks and automatic handling of retriable
>> >>>>>> exceptions. The ease with which failed tasks can be restarted is
>> >>>>>> significant because it reduces the cost of failing on non-retriable
>> >>>>>> exceptions and makes fail-fast behavior easier to work with. And,
>> in
>> >>>> cases
>> >>>>>> like this where we can't really know whether the error we're
>> dealing
>> >>>> with
>> >>>>>> is retriable or not, it's better IMO to continue to allow
>> >> applications
>> >>>>>> like
>> >>>>>> these to fail fast. I do agree that it'd be nice to get input from
>> >> the
>> >>>>>> community, though.
>> >>>>>>
>> >>>>>> I was toying with the idea of a NetworkException subclass too. It's
>> >> a
>> >>>>>> simpler API, but it doesn't allow for preflight validation, which
>> >> can
>> >>> be
>> >>>>>> useful in scenarios where submitting new configurations for client
>> >>>>>> applications is expensive in terms of time or resources. Then
>> >> again, I
>> >>>>>> don't see why the two are mutually exclusive, and we might opt to
>> >> use
>> >>>> the
>> >>>>>> NetworkException subclass in this KIP and pursue an opt-in
>> >> validation
>> >>>> API
>> >>>>>> later on. Thoughts?
>> >>>>>>
>> >>>>>> [1] -
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-235%3A+Add+DNS+alias+support+for+secured+connection
>> >>>>>>
>> >>>>>> Cheers,
>> >>>>>>
>> >>>>>> Chris
>> >>>>>>
>> >>>>>> On Mon, Feb 27, 2023 at 7:06 PM Philip Nee <philip...@gmail.com>
>> >>> wrote:
>> >>>>>>
>> >>>>>>> Hey Chris,
>> >>>>>>>
>> >>>>>>> Thanks again for the feedback!
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> For the preflight DNS check (are we basically trying to resolve
>> >> the
>> >>>> DNS
>> >>>>>>> there?): Maybe it makes more sense to add it to the Config
>> >> modules?
>> >>> I
>> >>>>>> would
>> >>>>>>> like to hear what the community says as I'm not familiar with the
>> >>>>>> Connect
>> >>>>>>> use case.
>> >>>>>>>
>> >>>>>>> A "slower failing" alternative - I wonder if it makes sense for us
>> >>> to
>> >>>>>>> extend the NetworkException so that clients can be smarter at
>> >>> handling
>> >>>>>>> these exceptions. Of course, it is still retriable and requires
>> >>>> polling
>> >>>>>> the
>> >>>>>>> consumer, but then we can distinguish the DNS resolution error
>> >> from
>> >>>>>> other
>> >>>>>>> network errors.
>> >>>>>>>
>> >>>>>>> Thanks!
>> >>>>>>> P
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Mon, Feb 27, 2023 at 9:36 AM Chris Egerton
>> >>> <chr...@aiven.io.invalid
>> >>>>>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Hi Philip,
>> >>>>>>>>
>> >>>>>>>> Yeah,  "DNS resolution should occur..." seems like a better fit.
>> >>> 👍
>> >>>>>>>>
>> >>>>>>>> One other question I have is whether we should expose some kind
>> >> of
>> >>>>>> public
>> >>>>>>>> API for performing preflight validation of the bootstrap URLs.
>> >> If
>> >>> we
>> >>>>>>> change
>> >>>>>>>> the behavior of a client configured with a silly typo (e.g.,
>> >>>>>>>> "loclahost instead of localhost") from failing in the
>> >> constructor
>> >>> to
>> >>>>>>>> failing with a retriable exception, this might lead some client
>> >>>>>>>> applications to handle that failure by, well, retrying. For
>> >>>> reference,
>> >>>>>>> this
>> >>>>>>>> is exactly what we do in Kafka Connect right now; see [1] and
>> >> [2].
>> >>>> IMO
>> >>>>>>> it'd
>> >>>>>>>> be nice to be able to opt into keeping the current behavior so
>> >>> that
>> >>>>>>>> projects like Connect could still do preflight checks of the
>> >>>>>>>> bootstrap.servers property for connectors before starting them,
>> >>> and
>> >>>>>>> report
>> >>>>>>>> any issues by failing fast instead of continuously writing
>> >>>>>> warning/error
>> >>>>>>>> messages to their logs.
>> >>>>>>>>
>> >>>>>>>> I'm not sure about where this new API could go, but a few
>> >> options
>> >>>>>> might
>> >>>>>>> be:
>> >>>>>>>>
>> >>>>>>>> - Expose a public variant of the existing ClientUtils class
>> >>>>>>>> - Add static methods to the ConsumerConfig, ProducerConfig, and
>> >>>>>>>> AdminClientConfig classes
>> >>>>>>>> - Add those same static methods to the KafkaConsumer,
>> >>> KafkaProducer,
>> >>>>>> and
>> >>>>>>>> KafkaAdminClient classes
>> >>>>>>>>
>> >>>>>>>> If this seems reasonable, we should probably also specify in the
>> >>> KIP
>> >>>>>> that
>> >>>>>>>> Kafka Connect will leverage this preflight validation logic
>> >> before
>> >>>>>>>> instantiating any Kafka clients for use by connectors or tasks,
>> >>> and
>> >>>>>>>> continue to fail fast if there are typos in the
>> >> bootstrap.servers
>> >>>>>>> property,
>> >>>>>>>> or if temporary DNS resolution issues come up.
>> >>>>>>>>
>> >>>>>>>> [1] -
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://github.com/apache/kafka/blob/5f9d01668cae64b2cacd7872d82964fa78862aaf/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSinkTask.java#L606
>> >>>>>>>> [2] -
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L439
>> >>>>>>>>
>> >>>>>>>> Cheers,
>> >>>>>>>>
>> >>>>>>>> Chris
>> >>>>>>>>
>> >>>>>>>> On Fri, Feb 24, 2023 at 4:59 PM Philip Nee <philip...@gmail.com
>> >>>
>> >>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>> Hey Chris,
>> >>>>>>>>>
>> >>>>>>>>> Thanks for the quick response, and I apologize for the unclear
>> >>>>>> wording
>> >>>>>>>>> there, I guess "DNS lookup" would be a more appropriate
>> >> wording
>> >>>>>> here.
>> >>>>>>> So
>> >>>>>>>>> what I meant there was, to delegate the DNS lookup in the
>> >>>>>> constructor
>> >>>>>>> to
>> >>>>>>>>> the network client poll, and it will happen on the very first
>> >>>>>> poll.  I
>> >>>>>>>>> guess the logic could look like this:
>> >>>>>>>>>
>> >>>>>>>>> - if the client has been bootstrapped, do nothing.
>> >>>>>>>>> - Otherwise, perform DNS lookup, and acquire the bootstrap
>> >>> server
>> >>>>>>>> address.
>> >>>>>>>>>
>> >>>>>>>>> Thanks for the comment there, I'll change up the wording.
>> >> Maybe
>> >>>>>> revise
>> >>>>>>>> it
>> >>>>>>>>> as "DNS resolution should occur in the poll...." ?
>> >>>>>>>>>
>> >>>>>>>>> P
>> >>>>>>>>>
>> >>>>>>>>> On Fri, Feb 24, 2023 at 1:47 PM Chris Egerton
>> >>>>>> <chr...@aiven.io.invalid
>> >>>>>>>>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hi Philip,
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks for the KIP!
>> >>>>>>>>>>
>> >>>>>>>>>> QQ: In the "Proposed Changes" section, the KIP states that
>> >>>>>>>> "Bootstrapping
>> >>>>>>>>>> should now occur in the poll method before attempting to
>> >>> update
>> >>>>>> the
>> >>>>>>>>>> metadata. This includes resolving the addresses and
>> >>>> bootstrapping
>> >>>>>> the
>> >>>>>>>>>> metadata.". By "bootstrapping the metadata" do we mean
>> >>> actually
>> >>>>>>>>> contacting
>> >>>>>>>>>> the bootstrap servers, or just setting some internal state
>> >>>>>> related to
>> >>>>>>>> the
>> >>>>>>>>>> current set of servers that can be contacted for metadata? I
>> >>> ask
>> >>>>>>>> because
>> >>>>>>>>> it
>> >>>>>>>>>> seems like the language here implies the former, but if
>> >> that's
>> >>>> the
>> >>>>>>>> case,
>> >>>>>>>>>> this is already happening in poll (or at least, the first
>> >>>>>> invocation
>> >>>>>>> of
>> >>>>>>>>>> it), and if it's the latter, it's probably not necessary to
>> >>>>>> mention
>> >>>>>>> in
>> >>>>>>>>> the
>> >>>>>>>>>> KIP since it doesn't really impact user-facing behavior. It
>> >>> also
>> >>>>>>> seems
>> >>>>>>>>> like
>> >>>>>>>>>> that detail might impact how intertwined this and KIP-899
>> >> are,
>> >>>>>> though
>> >>>>>>>> the
>> >>>>>>>>>> similarity could still be superficial either way.
>> >>>>>>>>>>
>> >>>>>>>>>> Cheers,
>> >>>>>>>>>>
>> >>>>>>>>>> Chris
>> >>>>>>>>>>
>> >>>>>>>>>> On Thu, Feb 23, 2023 at 9:21 PM Philip Nee <
>> >>> philip...@gmail.com
>> >>>>>
>> >>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>> Hey Ismael,
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks for the feedback! The proposal is not to retry
>> >>>>>> automatically
>> >>>>>>>> but
>> >>>>>>>>>>> relies on the user polling the NetworkClient (basically,
>> >>>>>>>> consumer.poll)
>> >>>>>>>>>> to
>> >>>>>>>>>>> reattempt the bootstrap. If bootstrapping fails, a
>> >>>>>> NetworkException
>> >>>>>>>>>>> (retriable) will be thrown.
>> >>>>>>>>>>>
>> >>>>>>>>>>> Thanks!
>> >>>>>>>>>>> P
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> On Thu, Feb 23, 2023 at 1:34 PM Ismael Juma <
>> >>>> ism...@juma.me.uk>
>> >>>>>>>> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>> Thanks for the KIP. Not sure if I missed it, but how
>> >> long
>> >>>>>> will we
>> >>>>>>>>> retry
>> >>>>>>>>>>> for
>> >>>>>>>>>>>> and when do we give up and propagate the failure to the
>> >>>> user?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Ismael
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On Thu, Feb 23, 2023 at 9:30 AM Philip Nee <
>> >>>>>> philip...@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> Hi all!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I want to start a discussion thread about how we can
>> >>>> handle
>> >>>>>>>> client
>> >>>>>>>>>>>>> bootstrap failure due DNS lookup.  This requires a bit
>> >>> of
>> >>>>>>>>> behavioral
>> >>>>>>>>>>>>> change, so a KIP is proposed and attached to this
>> >> email.
>> >>>>>> Let me
>> >>>>>>>>> know
>> >>>>>>>>>>> what
>> >>>>>>>>>>>>> you think!
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> *A small remark here*: *As the title of this KIP might
>> >>>> sound
>> >>>>>>>>>>>>> familiar/similar to KIP-899, it is not the same.*
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> *In Summary:* I want to propose a KIP to change the
>> >>>> existing
>> >>>>>>>>>> bootstrap
>> >>>>>>>>>>>>> (upon instantiation) strategy because it is reasonable
>> >>> to
>> >>>>>> allow
>> >>>>>>>>>> clients
>> >>>>>>>>>>>> to
>> >>>>>>>>>>>>> retry
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> *KIP: *
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-909%3A+Allow+Clients+to+Rebootstrap+Upon+Failed+DNS+Resolution
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Thanks!
>> >>>>>>>>>>>>> Philip
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>>

Reply via email to