I wonder if the optimal value would be related to the cluster settings? For example, we could make it the expected poll interval (don't recall the name of the exact property, sorry!) minus 5 seconds, and then if a write fails due to timeout, instruct users to increase the poll interval but note that excessively large values for it may delay the detection of unhealthy workers in the cluster.
Does that sound reasonable? On Fri, Mar 13, 2026, 02:59 Henry Haiying Cai <[email protected]> wrote: > Thanks Chris for the quick reply. > > For the patch to increase the hard-coded value > of READ_WRITE_TOTAL_TIMEOUT_MS, which value would be acceptable: 60s, 120s, > 180s, 240s, 300s? > > In the meantime we will also try to increase the batch.size and linger.ms > for the producer. > > > > On Thursday, March 12, 2026 at 07:12:21 PM PDT, Chris Egerton < > [email protected]> wrote: > > > > > > Hi, > > I think it'd be fine to increase the hardcoded value in a patch PR. Making > it configurable (possibly even as a function of the number of tasks?) would > be nice but we'd need a KIP to make that change. > > We can't make it too high, though, or the worker may fall out of the > cluster due to taking too long in between polls. > > For now, just as a temporary workaround, you (and anyone else running into > this) can possibly try tuning the producer config to use a non-zero linger > time, higher batch size, etc. in order to achieve higher throughout. > > Cheers, > > On Thu, Mar 12, 2026, 22:03 Henry Haiying Cai via dev < > [email protected]> wrote: > > Hi kafka-connect folks, > > > > READ_WRITE_TOTAL_TIMEOUT_MS is currently hardcoded as 30 second in > kafka-connect's KafkaConfigBackingStore.java ( > https://github.com/a0x8o/kafka/blob/master/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaConfigBackingStore.java#L244). > This parameter is used to control the upper bound time we allowed to write > configuration changes to the kafka-connect's config.storage topic (and this > topic is usually configured as one-partition topic) > > > > We have topics with more than 500 partitions and the connector tasks of > more than 500 (one task for each kafka partition). When the consumer > rebalance happens, kafka-connect code needs to finish the writing of all > task configs into the config.storage topic > within READ_WRITE_TOTAL_TIMEOUT_MS and sometimes it is not enough time. > > > > Can we increase READ_WRITE_TOTAL_TIMEOUT_MS to a larger value e.g. 300 > seconds? Or better to make it a configurable parameter? > > > >
