Just to clarify, I am not suggesting that we need to set default.api.timeout.ms 
to MAX_LONG. I am currently thinking that we may want to use the 60 seconds as 
timeout value but additionally keep retrying offset commit in e.g. 
onPartitionsRevoked(), and if MM can not finish offset commit after certain 
amount of retries, MM should fail fast.

We can discuss based on the reason of offset commit timeout. If the offset 
commit timeout is due to long request queue time in broker, then even if 
rebalance completes, the next broker will also not able to commit offset and 
finally all consumed data will be duplicated. In this case it seems reasonable 
to just let MM fail fast and let SRE investigate the performance issue in 
broker and possibly increase the default.api.timeout.ms value.

If offset commit failed due to persistent network error in the given MM host, 
regardless of whether this MM host timeout or keep retrying, the behavior of 
this MM host should not affect behavior of other MM. And since this MM will not 
be able to communicate with other hosts, we would like this MM to fail fast, 
which will be done by the alternative approach after certain amount of retries.

[ Full content available at: https://github.com/apache/kafka/pull/5492 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to