Chris, Thanks for the quick reply. One thing to clarify from last time we talked to you. We resolved the incident last month by your recommendation (increasing batch.size and linger.ms for the config topic's producer), but we worry about this issue might come back when we ramp more KC connector/tasks in the future. Since it will take quite some time/effort to get the fix into Kafka upstream so we want to proactively address this issue by giving us another knob to turn during incident crunching time.
For the environment in our company, it is more than 500 tasks. 500 tasks is for one topic in that cluster but we have other topics also running through KC in that cluster and they need to share the same config topic. The total number is more like 5000 tasks. And we are planning to increase the parallelism more in the near future. During the incident time, the consumer groups keeps rebalancing and there is new coordinator being elected very frequently and all tasks' config needs to be written to config topic very frequently due to those rebalances. There are several round trip operations needs to finish with that hard-coded 30 second (configTopic consumer readToEnd, putting task config into configTopic for each task synchronously, readToEnd again, write commit message, readToEnd again), the consumer.readToEnd can also be long for a newly elected coordinator when there are many messages in the config topic accumulated in the topic due to quick consumer group rebalances (and the compact thread has not got the time to compact the topic). We worried about that hard-coded 30 second ceiling in a large cluster during incident time when lots of consumer rebalance was happening, we want to have some more ceiling buffers. Between the current hard-coded ceiling (30 seconds) and the default max.poll.timeout (300 seconds) for unhealthy worker detection, there seems some room we can tune. For the suggestion you mentioned to allow tuning the linger.ms only for the config topic but not for the status topic, this is already there. You can specify connect.config.storage.linger.ms to do exactly that. For the suggestion that setting higher default of linger.ms for config topic, I am not sure whether that works for all users. Some users with small cluster might prefer 0 linger.ms for shorter latency.
