Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Greg Harris Thu, 02 Mar 2023 10:34:18 -0800

Hey Yash,

Thanks for the KIP, and sorry for the late review.


1. Have you considered a HTTP header to provide the client-configurable
timeout? A header might more naturally extend to all of the other endpoints
in the future, rather than duplicating the query parameter across endpoints.

2. I understand that this change is targeted at just long duration
Connector::validation calls, is that due to voluntary scope constraints?
Implementing configurable timeouts for all endpoints in a uniform way could
be desirable, even if the default timeout will work for nearly all of the
other calls.

3. Did you consider adding asynchronous validation as a user-facing
feature? I think that relying on the synchronous validation results in a
single HTTP request is a bit of an anti-pattern, and one that we've
inherited from the original REST design. It seems useful when using the
REST API by hand, but seems to be a liability when used in environments
with an external management layer.

4. Perhaps rather than allowing synchronous calls to Connector:validate to
increase in duration, we should provide a way for connectors to surface
validation problems later in their lifecycle. Currently there is the
ConnectorContext::raiseError that transitions the task to FAILED, perhaps a
similar API could asynchronously emit re-validation results. We've also had
a problem with long start() duration for the same reasons as validate().

I understand if you want to keep this KIP as tightly focused as possible,
but i'm worried that it is addressing the symptom and not the problem. I
want to make sure that this change is impactful and isn't obsoleted by a
later improvement.

Thanks,
Greg


On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <[email protected]> wrote:

> Hi all,
>
> Thanks for all the feedback and discussion. I've renamed the KIP title to
> "Kafka Connect REST API timeout improvements" since we're introducing a
> couple of improvements (cancelling create / update connector requests when
> config validations timeout and avoiding double config validations in
> distributed mode) along with making the request timeouts configurable. The
> new KIP link is
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> and I've called for a vote for the KIP here -
> https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
>
> Thanks,
> Yash
>
> On Sat, Nov 5, 2022 at 11:42 PM Sagar <[email protected]> wrote:
>
> > Hey Yash,
> >
> > Thanks for the explanation. I think it should be fine to delegate the
> > validation directly to the leader.
> >
> > Thanks!
> > Sagar.
> >
> > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <[email protected]> wrote:
> >
> > > Hi Sagar,
> > >
> > > Thanks for chiming in!
> > >
> > > > Having said that, why does the worker forward to the
> > > > leader? I am thinking if the worker can perform the validation on
> it's
> > > own,
> > > > we could let it do the validation instead of forwarding everything to
> > the
> > > > leader
> > >
> > > Only the leader is allowed to perform writes to the config topic, so
> any
> > > request that requires a config topic write ends up being forwarded to
> the
> > > leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
> > > endpoints call `Herder::putConnectorConfig` which internally does a
> > config
> > > validation first before initiating another herder request to write to
> the
> > > config topic (in distributed mode). If we want to allow the first
> worker
> > to
> > > do the config validation, and then forward the request to the leader
> just
> > > for the write to the config topic, we'd either need something like a
> skip
> > > validations query parameter on the endpoint like Chris talks about
> above
> > or
> > > else a new internal only endpoint that just does a write to the config
> > > topic without any prior config validation. However, we agreed that this
> > > optimization doesn't really seem necessary for now and can be done
> later
> > if
> > > deemed useful. I'd be happy to hear differing thoughts if any, however.
> > >
> > > > I think a bound is certainly needed but IMO it shouldn't go beyond
> > > > 10 mins considering this is just validation
> > >
> > > Yeah, I agree that this seems like a fair value - I've elected to go
> > with a
> > > default value of 10 minutes for the proposed worker configuration that
> > sets
> > > an upper bound for the timeout query parameter.
> > >
> > > Thanks,
> > > Yash
> > >
> > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <[email protected]>
> wrote:
> > >
> > > > Hi Chris,
> > > >
> > > > Thanks again for your feedback. I think a worker configuration for
> the
> > > > upper bound makes sense - I initially thought we could hardcode it
> > (just
> > > > like the current request timeout is), but there's no reason to set
> > > another
> > > > artificial bound that isn't user configurable which is exactly what
> > we're
> > > > trying to change here in the first place. I've updated the KIP based
> on
> > > all
> > > > our discussion above.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <[email protected]>
> > wrote:
> > > >
> > > >> Hey Yash,
> > > >>
> > > >> Thanks for the KIP! This looks like a useful feature.
> > > >>
> > > >> I think the discussion thread already has some great points by
> Chris.
> > > Just
> > > >> a couple of points/clarifications=>
> > > >>
> > > >> Regarding, pt#2 , I guess it might be better to forward to the
> leader
> > as
> > > >> suggested by Yash. Having said that, why does the worker forward to
> > the
> > > >> leader? I am thinking if the worker can perform the validation on
> it's
> > > >> own,
> > > >> we could let it do the validation instead of forwarding everything
> to
> > > the
> > > >> leader(even though it might be cheap to forward all requests to the
> > > >> leader).
> > > >>
> > > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go
> > > beyond
> > > >> 10 mins considering this is just validation. We shouldn't end up in
> a
> > > >> situation where a few faulty connectors end up blocking a lot of
> > request
> > > >> processing threads, so while increasing the config is certainly
> > helpful,
> > > >> we
> > > >> shouldn't set too high a value IMO. Of course I am also open to
> > > >> suggestions
> > > >> here.
> > > >>
> > > >> Thanks!
> > > >> Sagar.
> > > >>
> > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> <[email protected]
> > >
> > > >> wrote:
> > > >>
> > > >> > Hi Yash,
> > > >> >
> > > >> > RE 2: That's a great point about validations already being
> performed
> > > by
> > > >> the
> > > >> > leader. For completeness's sake, I'd like to note that this only
> > holds
> > > >> for
> > > >> > valid configurations; invalid ones are caught right now before
> being
> > > >> > forwarded to the leader. Still, I think it's fine to forward to
> the
> > > >> leader
> > > >> > for now and optimize further in the future if necessary. If
> frequent
> > > >> > validations are taking place they should be conducted via the `PUT
> > > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which
> > won't
> > > >> do
> > > >> > any forwarding at all.
> > > >> >
> > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout
> also
> > > >> seem
> > > >> > reasonable... maybe a low-importance worker property could work
> for
> > > >> that?
> > > >> > Not sure what would make sense for a default; probably somewhere
> in
> > > the
> > > >> > 10-60 minute range but would be interested in others' thoughts.
> > > >> >
> > > >> > Thanks for the clarification on the zombie fencing logic. I think
> we
> > > >> might
> > > >> > want to have some more subtle logic around the interaction between
> > > >> calls to
> > > >> > Admin::fenceProducers and a worker-level timeout property if we go
> > > that
> > > >> > route, but we can cross that particular bridge if we get back to
> it.
> > > >> >
> > > >> > Cheers,
> > > >> >
> > > >> > Chris
> > > >> >
> > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <[email protected]>
> > > wrote:
> > > >> >
> > > >> > > Hi Chris,
> > > >> > >
> > > >> > > Thanks a lot for the super quick response and the great
> feedback!
> > > >> > >
> > > >> > > 1. I think that makes a lot of sense, and I'd be happy to update
> > the
> > > >> KIP
> > > >> > to
> > > >> > > include this change in the scope. The current behavior where the
> > API
> > > >> > > response indicates a time out but the connector is
> created/updated
> > > >> > > eventually anyway can be pretty confusing and is generally not a
> > > good
> > > >> > user
> > > >> > > experience IMO.
> > > >> > >
> > > >> > > 2. Wow, thanks for pointing this out - it's a really good catch
> > and
> > > >> > > something I hadn't noticed was happening with the current
> > > >> implementation.
> > > >> > > While I do like the idea of having a query parameter that
> > determines
> > > >> > > whether validations can be skipped, I'm wondering if it might
> not
> > be
> > > >> > easier
> > > >> > > and cleaner to just do the leader check earlier and avoid doing
> > the
> > > >> > > unnecessary config validation on the first worker? Since each
> > config
> > > >> > > validation happens on its own thread, I'm not so sure about the
> > > >> concern
> > > >> > of
> > > >> > > overloading the leader even on larger clusters, especially since
> > > >> > > validations aren't typically long running operations.
> Furthermore,
> > > >> even
> > > >> > > with the current implementation, the leader will always be
> doing a
> > > >> config
> > > >> > > validation for connector create / update REST API requests on
> any
> > > >> worker.
> > > >> > >
> > > >> > > 3. That's a good point, and this way we can also restrict the
> APIs
> > > >> whose
> > > >> > > timeouts are configurable - I'm thinking `PUT
> > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> > /connectors`
> > > >> and
> > > >> > > `PUT /connectors/{connector}/config` are the ones where such a
> > > timeout
> > > >> > > parameter could be useful. Also, do you think we should enforce
> > some
> > > >> > > reasonable bounds for the timeout config?
> > > >> > >
> > > >> > > On the zombie fencing point, the implication was that the new
> > worker
> > > >> > > property would not control the timeout used for the call to
> > > >> > > Admin::fenceProducers. However, if we go with a timeout query
> > > >> parameter
> > > >> > > approach, even the timeout for the `PUT
> > > /connectors/{connector}/fence'
> > > >> > > endpoint will remain unaffected.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Yash
> > > >> > >
> > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > <[email protected]
> > > >> >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Hi Yash,
> > > >> > > >
> > > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I
> was
> > > >> > hesitant
> > > >> > > > to support cases where connector validation takes this long,
> but
> > > >> > > > considering the alternative is that we give users a 500 error
> > > >> response
> > > >> > > but
> > > >> > > > leave the request to create/modify the connector queued up in
> > the
> > > >> > > herder, I
> > > >> > > > think I can get behind the motivation here. There's also an
> > > >> argument to
> > > >> > > be
> > > >> > > > made about keeping Kafka Connect available even when the
> systems
> > > >> that
> > > >> > it
> > > >> > > > connects to are in a degraded state.
> > > >> > > >
> > > >> > > > I have a few alternatives I'd be interested in your thoughts
> on:
> > > >> > > >
> > > >> > > > 1. Since the primary concern here seems to be that custom
> > > connector
> > > >> > > > validation logic can take too long, do we have any thoughts on
> > > >> adding
> > > >> > > logic
> > > >> > > > to check for request timeout after validation has completed
> and,
> > > if
> > > >> it
> > > >> > > has,
> > > >> > > > aborting the attempt to create/modify the connector?
> > > >> > > >
> > > >> > > > 2. Right now it's possible that we'll perform two connector
> > config
> > > >> > > > validations per create/modify request; once on the worker that
> > > >> > initially
> > > >> > > > receives the request, and then again if that worker is not the
> > > >> leader
> > > >> > of
> > > >> > > > the cluster and has to forward the request to the leader. Any
> > > >> thoughts
> > > >> > on
> > > >> > > > optimizing this to only require a single validation per
> request?
> > > We
> > > >> > > > probably wouldn't want to force all validations to take place
> on
> > > the
> > > >> > > leader
> > > >> > > > (could lead to overloading it pretty quickly in large
> clusters),
> > > >> but we
> > > >> > > > could add an internal-only query parameter to skip validation
> > and
> > > >> then
> > > >> > > use
> > > >> > > > that parameter when forwarding requests from followers to the
> > > >> leader.
> > > >> > > >
> > > >> > > > 3. A worker property is pretty coarse-grained, and difficult
> to
> > > >> change.
> > > >> > > We
> > > >> > > > might allow per-request toggling of the timeout by adding a
> URL
> > > >> query
> > > >> > > > parameter like '?timeout=90s' to the REST API to allow
> tweaking
> > of
> > > >> the
> > > >> > > > timeout on a more granular basis, and without having to
> perform
> > a
> > > >> > worker
> > > >> > > > restart.
> > > >> > > >
> > > >> > > > I'd also like to clarify a point about the rejected
> alternative
> > > >> "Allow
> > > >> > > > configuring producer zombie fencing admin request timeout"--is
> > the
> > > >> > > > implication here that the "rest.api.request.timeout.ms"
> > property
> > > >> will
> > > >> > > not
> > > >> > > > control the REST timeout for requests to the 'PUT
> > > >> > > > /connectors/{connector}/fence' endpoint, or just that it won't
> > > >> control
> > > >> > > the
> > > >> > > > timeout that we use for the call to Admin::fenceProducers?
> > > >> > > >
> > > >> > > > Cheers,
> > > >> > > >
> > > >> > > > Chris
> > > >> > > >
> > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > [email protected]>
> > > >> > wrote:
> > > >> > > >
> > > >> > > > > Hi all,
> > > >> > > > >
> > > >> > > > > I'd like to start a discussion thread on this small KIP -
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > >> > > > >
> > > >> > > > > It proposes the addition of a new Kafka Connect worker
> > > >> configuration
> > > >> > to
> > > >> > > > > allow configuring REST API request timeouts.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > Yash
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Reply via email to