guozhangwang commented on pull request #11340: URL: https://github.com/apache/kafka/pull/11340#issuecomment-948136951
@RivenSun2 I talked to @hachikuji offline about the best options to fix it in near term, and we feel the async-commit approach may be more appropriate here. But you'd need to be careful about not just trying once and give up immediately to continue the rebalance. Here's the status quo: * We would try commit upto the configured rebalance.timeout, and if we exhaust that timeout but still cannot succeed (like in this case, keep getting unknown topic partition error), we would just log it and continue the rebalance. * Note that we have a flag `needsJoinPrepare` in AbstractCoordinator which is set before the `onJoinPrepare` call, which means that if the call itself throws out error, upon the next `poll` we would not try to trigger `onJoinPrepare` again. So to make async-commit work, here's a rough sketch of what we'd need to do: * We keep a reference of the last commit response future sent as part of the `onJoinPrepare`. * In `maybeAutoCommitOffsetsSync`, as we would rename it to `maybeAutoCommitOffsetsAsync`, we check if the response future is `null` or not; if it is `null` we just send out the request and get hold on the `future`. And then we call the networkClient.poll once and see if the `future` is completed. If yes and there's no error, we return `true` from `maybeAutoCommitOffsetsAsync` indicating it has suceeded, otherwise we return `false`. * When `maybeAutoCommitOffsetsAsync` returns false, the `onJoinPrepare` would return false immediately as well, and the caller would then reset the `needsJoinPrepare` flag so that next time it would still trigger `onJoinPrepare`. And then return to the `poll` call. By doing that, the `poll` call would not block on commit, but would return immediately after just one trial of the commit request, and the user may potentially call `poll` multiple times in order to complete the commit as part of the `onJoinPrepare` to continue the rebalance, but it would help resolving the longer than `poll` timeout blocking issues. As for the backing off, let's delegate that to KIP-580. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org