Hey Yash, Thanks for the KIP, and sorry for the late review.
1. Have you considered a HTTP header to provide the client-configurable timeout? A header might more naturally extend to all of the other endpoints in the future, rather than duplicating the query parameter across endpoints. 2. I understand that this change is targeted at just long duration Connector::validation calls, is that due to voluntary scope constraints? Implementing configurable timeouts for all endpoints in a uniform way could be desirable, even if the default timeout will work for nearly all of the other calls. 3. Did you consider adding asynchronous validation as a user-facing feature? I think that relying on the synchronous validation results in a single HTTP request is a bit of an anti-pattern, and one that we've inherited from the original REST design. It seems useful when using the REST API by hand, but seems to be a liability when used in environments with an external management layer. 4. Perhaps rather than allowing synchronous calls to Connector:validate to increase in duration, we should provide a way for connectors to surface validation problems later in their lifecycle. Currently there is the ConnectorContext::raiseError that transitions the task to FAILED, perhaps a similar API could asynchronously emit re-validation results. We've also had a problem with long start() duration for the same reasons as validate(). I understand if you want to keep this KIP as tightly focused as possible, but i'm worried that it is addressing the symptom and not the problem. I want to make sure that this change is impactful and isn't obsoleted by a later improvement. Thanks, Greg On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <[email protected]> wrote: > Hi all, > > Thanks for all the feedback and discussion. I've renamed the KIP title to > "Kafka Connect REST API timeout improvements" since we're introducing a > couple of improvements (cancelling create / update connector requests when > config validations timeout and avoiding double config validations in > distributed mode) along with making the request timeouts configurable. The > new KIP link is > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements > and I've called for a vote for the KIP here - > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp. > > Thanks, > Yash > > On Sat, Nov 5, 2022 at 11:42 PM Sagar <[email protected]> wrote: > > > Hey Yash, > > > > Thanks for the explanation. I think it should be fine to delegate the > > validation directly to the leader. > > > > Thanks! > > Sagar. > > > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <[email protected]> wrote: > > > > > Hi Sagar, > > > > > > Thanks for chiming in! > > > > > > > Having said that, why does the worker forward to the > > > > leader? I am thinking if the worker can perform the validation on > it's > > > own, > > > > we could let it do the validation instead of forwarding everything to > > the > > > > leader > > > > > > Only the leader is allowed to perform writes to the config topic, so > any > > > request that requires a config topic write ends up being forwarded to > the > > > leader. The `POST /connectors` and `PUT /connectors/{connector}/config` > > > endpoints call `Herder::putConnectorConfig` which internally does a > > config > > > validation first before initiating another herder request to write to > the > > > config topic (in distributed mode). If we want to allow the first > worker > > to > > > do the config validation, and then forward the request to the leader > just > > > for the write to the config topic, we'd either need something like a > skip > > > validations query parameter on the endpoint like Chris talks about > above > > or > > > else a new internal only endpoint that just does a write to the config > > > topic without any prior config validation. However, we agreed that this > > > optimization doesn't really seem necessary for now and can be done > later > > if > > > deemed useful. I'd be happy to hear differing thoughts if any, however. > > > > > > > I think a bound is certainly needed but IMO it shouldn't go beyond > > > > 10 mins considering this is just validation > > > > > > Yeah, I agree that this seems like a fair value - I've elected to go > > with a > > > default value of 10 minutes for the proposed worker configuration that > > sets > > > an upper bound for the timeout query parameter. > > > > > > Thanks, > > > Yash > > > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <[email protected]> > wrote: > > > > > > > Hi Chris, > > > > > > > > Thanks again for your feedback. I think a worker configuration for > the > > > > upper bound makes sense - I initially thought we could hardcode it > > (just > > > > like the current request timeout is), but there's no reason to set > > > another > > > > artificial bound that isn't user configurable which is exactly what > > we're > > > > trying to change here in the first place. I've updated the KIP based > on > > > all > > > > our discussion above. > > > > > > > > Thanks, > > > > Yash > > > > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <[email protected]> > > wrote: > > > > > > > >> Hey Yash, > > > >> > > > >> Thanks for the KIP! This looks like a useful feature. > > > >> > > > >> I think the discussion thread already has some great points by > Chris. > > > Just > > > >> a couple of points/clarifications=> > > > >> > > > >> Regarding, pt#2 , I guess it might be better to forward to the > leader > > as > > > >> suggested by Yash. Having said that, why does the worker forward to > > the > > > >> leader? I am thinking if the worker can perform the validation on > it's > > > >> own, > > > >> we could let it do the validation instead of forwarding everything > to > > > the > > > >> leader(even though it might be cheap to forward all requests to the > > > >> leader). > > > >> > > > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go > > > beyond > > > >> 10 mins considering this is just validation. We shouldn't end up in > a > > > >> situation where a few faulty connectors end up blocking a lot of > > request > > > >> processing threads, so while increasing the config is certainly > > helpful, > > > >> we > > > >> shouldn't set too high a value IMO. Of course I am also open to > > > >> suggestions > > > >> here. > > > >> > > > >> Thanks! > > > >> Sagar. > > > >> > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton > <[email protected] > > > > > > >> wrote: > > > >> > > > >> > Hi Yash, > > > >> > > > > >> > RE 2: That's a great point about validations already being > performed > > > by > > > >> the > > > >> > leader. For completeness's sake, I'd like to note that this only > > holds > > > >> for > > > >> > valid configurations; invalid ones are caught right now before > being > > > >> > forwarded to the leader. Still, I think it's fine to forward to > the > > > >> leader > > > >> > for now and optimize further in the future if necessary. If > frequent > > > >> > validations are taking place they should be conducted via the `PUT > > > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which > > won't > > > >> do > > > >> > any forwarding at all. > > > >> > > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout > also > > > >> seem > > > >> > reasonable... maybe a low-importance worker property could work > for > > > >> that? > > > >> > Not sure what would make sense for a default; probably somewhere > in > > > the > > > >> > 10-60 minute range but would be interested in others' thoughts. > > > >> > > > > >> > Thanks for the clarification on the zombie fencing logic. I think > we > > > >> might > > > >> > want to have some more subtle logic around the interaction between > > > >> calls to > > > >> > Admin::fenceProducers and a worker-level timeout property if we go > > > that > > > >> > route, but we can cross that particular bridge if we get back to > it. > > > >> > > > > >> > Cheers, > > > >> > > > > >> > Chris > > > >> > > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <[email protected]> > > > wrote: > > > >> > > > > >> > > Hi Chris, > > > >> > > > > > >> > > Thanks a lot for the super quick response and the great > feedback! > > > >> > > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to update > > the > > > >> KIP > > > >> > to > > > >> > > include this change in the scope. The current behavior where the > > API > > > >> > > response indicates a time out but the connector is > created/updated > > > >> > > eventually anyway can be pretty confusing and is generally not a > > > good > > > >> > user > > > >> > > experience IMO. > > > >> > > > > > >> > > 2. Wow, thanks for pointing this out - it's a really good catch > > and > > > >> > > something I hadn't noticed was happening with the current > > > >> implementation. > > > >> > > While I do like the idea of having a query parameter that > > determines > > > >> > > whether validations can be skipped, I'm wondering if it might > not > > be > > > >> > easier > > > >> > > and cleaner to just do the leader check earlier and avoid doing > > the > > > >> > > unnecessary config validation on the first worker? Since each > > config > > > >> > > validation happens on its own thread, I'm not so sure about the > > > >> concern > > > >> > of > > > >> > > overloading the leader even on larger clusters, especially since > > > >> > > validations aren't typically long running operations. > Furthermore, > > > >> even > > > >> > > with the current implementation, the leader will always be > doing a > > > >> config > > > >> > > validation for connector create / update REST API requests on > any > > > >> worker. > > > >> > > > > > >> > > 3. That's a good point, and this way we can also restrict the > APIs > > > >> whose > > > >> > > timeouts are configurable - I'm thinking `PUT > > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST > > /connectors` > > > >> and > > > >> > > `PUT /connectors/{connector}/config` are the ones where such a > > > timeout > > > >> > > parameter could be useful. Also, do you think we should enforce > > some > > > >> > > reasonable bounds for the timeout config? > > > >> > > > > > >> > > On the zombie fencing point, the implication was that the new > > worker > > > >> > > property would not control the timeout used for the call to > > > >> > > Admin::fenceProducers. However, if we go with a timeout query > > > >> parameter > > > >> > > approach, even the timeout for the `PUT > > > /connectors/{connector}/fence' > > > >> > > endpoint will remain unaffected. > > > >> > > > > > >> > > Thanks, > > > >> > > Yash > > > >> > > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton > > > <[email protected] > > > >> > > > > >> > > wrote: > > > >> > > > > > >> > > > Hi Yash, > > > >> > > > > > > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I > was > > > >> > hesitant > > > >> > > > to support cases where connector validation takes this long, > but > > > >> > > > considering the alternative is that we give users a 500 error > > > >> response > > > >> > > but > > > >> > > > leave the request to create/modify the connector queued up in > > the > > > >> > > herder, I > > > >> > > > think I can get behind the motivation here. There's also an > > > >> argument to > > > >> > > be > > > >> > > > made about keeping Kafka Connect available even when the > systems > > > >> that > > > >> > it > > > >> > > > connects to are in a degraded state. > > > >> > > > > > > >> > > > I have a few alternatives I'd be interested in your thoughts > on: > > > >> > > > > > > >> > > > 1. Since the primary concern here seems to be that custom > > > connector > > > >> > > > validation logic can take too long, do we have any thoughts on > > > >> adding > > > >> > > logic > > > >> > > > to check for request timeout after validation has completed > and, > > > if > > > >> it > > > >> > > has, > > > >> > > > aborting the attempt to create/modify the connector? > > > >> > > > > > > >> > > > 2. Right now it's possible that we'll perform two connector > > config > > > >> > > > validations per create/modify request; once on the worker that > > > >> > initially > > > >> > > > receives the request, and then again if that worker is not the > > > >> leader > > > >> > of > > > >> > > > the cluster and has to forward the request to the leader. Any > > > >> thoughts > > > >> > on > > > >> > > > optimizing this to only require a single validation per > request? > > > We > > > >> > > > probably wouldn't want to force all validations to take place > on > > > the > > > >> > > leader > > > >> > > > (could lead to overloading it pretty quickly in large > clusters), > > > >> but we > > > >> > > > could add an internal-only query parameter to skip validation > > and > > > >> then > > > >> > > use > > > >> > > > that parameter when forwarding requests from followers to the > > > >> leader. > > > >> > > > > > > >> > > > 3. A worker property is pretty coarse-grained, and difficult > to > > > >> change. > > > >> > > We > > > >> > > > might allow per-request toggling of the timeout by adding a > URL > > > >> query > > > >> > > > parameter like '?timeout=90s' to the REST API to allow > tweaking > > of > > > >> the > > > >> > > > timeout on a more granular basis, and without having to > perform > > a > > > >> > worker > > > >> > > > restart. > > > >> > > > > > > >> > > > I'd also like to clarify a point about the rejected > alternative > > > >> "Allow > > > >> > > > configuring producer zombie fencing admin request timeout"--is > > the > > > >> > > > implication here that the "rest.api.request.timeout.ms" > > property > > > >> will > > > >> > > not > > > >> > > > control the REST timeout for requests to the 'PUT > > > >> > > > /connectors/{connector}/fence' endpoint, or just that it won't > > > >> control > > > >> > > the > > > >> > > > timeout that we use for the call to Admin::fenceProducers? > > > >> > > > > > > >> > > > Cheers, > > > >> > > > > > > >> > > > Chris > > > >> > > > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya < > > [email protected]> > > > >> > wrote: > > > >> > > > > > > >> > > > > Hi all, > > > >> > > > > > > > >> > > > > I'd like to start a discussion thread on this small KIP - > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW > > > >> > > > > > > > >> > > > > It proposes the addition of a new Kafka Connect worker > > > >> configuration > > > >> > to > > > >> > > > > allow configuring REST API request timeouts. > > > >> > > > > > > > >> > > > > Thanks, > > > >> > > > > Yash > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > >
