Just to clarify, I am not suggesting that we need to set default.api.timeout.ms to MAX_LONG. I am currently thinking that we may want to use the 60 seconds as timeout value but additionally keep retrying offset commit in e.g. onPartitionsRevoked(), and if MM can not finish offset commit after certain amount of retries, MM should fail fast.
We can discuss based on the reason of offset commit timeout. If the offset commit timeout is due to long request queue time in broker, then even if rebalance completes, the next broker will also not able to commit offset and finally all consumed data will be duplicated. In this case it seems reasonable to just let MM fail fast and let SRE investigate the performance issue in broker and possibly increase the default.api.timeout.ms value. If offset commit failed due to persistent network error in the given MM host, regardless of whether this MM host timeout or keep retrying, the behavior of this MM host should not affect behavior of other MM. And since this MM will not be able to communicate with other hosts, we would like this MM to fail fast, which will be done by the alternative approach after certain amount of retries. [ Full content available at: https://github.com/apache/kafka/pull/5492 ] This message was relayed via gitbox.apache.org for [email protected]
