The parameter for poll.interval is max.poll.interval.ms with the default value of 300s. I can see 2 different ways of codings without introducing a KIP or some complex logic:
1. Increase the READ_WRITE_TOTAL_TIMEOUT_MS to a little bigger value e.g. 120s (which should still well under max.poll.interval.ms of 300s), but if 120s is more than half of max.poll.interval.ms we should decrease that value to 1/2 of max.poll.interval.ms. 2. Set READ_WRITE_TOTAL_TIMEOUT_MS = max(30s, max.poll.interval.ms / 2). (this will honor the existing setting of READ_WRITE_TOTAL_TIMEOUT_MS of at least 30s I think option 2 is easier to understand. More flexible way is to introduce a new config parameter for READ_WRITE_TOTAL_TIMEOUT_MS but this needs a KIP On Friday, March 13, 2026 at 06:32:28 AM PDT, Chris Egerton <[email protected]> wrote: I wonder if the optimal value would be related to the cluster settings? For example, we could make it the expected poll interval (don't recall the name of the exact property, sorry!) minus 5 seconds, and then if a write fails due to timeout, instruct users to increase the poll interval but note that excessively large values for it may delay the detection of unhealthy workers in the cluster. Does that sound reasonable? On Fri, Mar 13, 2026, 02:59 Henry Haiying Cai <[email protected]> wrote: > Thanks Chris for the quick reply. > > For the patch to increase the hard-coded value of > READ_WRITE_TOTAL_TIMEOUT_MS, which value would be acceptable: 60s, 120s, > 180s, 240s, 300s? > > In the meantime we will also try to increase the batch.size and linger.ms for > the producer. > > > > On Thursday, March 12, 2026 at 07:12:21 PM PDT, Chris Egerton > <[email protected]> wrote: > > > > > > Hi, > > I think it'd be fine to increase the hardcoded value in a patch PR. Making it > configurable (possibly even as a function of the number of tasks?) would be > nice but we'd need a KIP to make that change. > > We can't make it too high, though, or the worker may fall out of the cluster > due to taking too long in between polls. > > For now, just as a temporary workaround, you (and anyone else running into > this) can possibly try tuning the producer config to use a non-zero linger > time, higher batch size, etc. in order to achieve higher throughout. > > Cheers, > > On Thu, Mar 12, 2026, 22:03 Henry Haiying Cai via dev <[email protected]> > wrote: >> Hi kafka-connect folks, >> >> READ_WRITE_TOTAL_TIMEOUT_MS is currently hardcoded as 30 second in >> kafka-connect's KafkaConfigBackingStore.java >> (https://github.com/a0x8o/kafka/blob/master/connect/runtime/src/main/java/org/apache/kafka/connect/storage/KafkaConfigBackingStore.java#L244). >> This parameter is used to control the upper bound time we allowed to write >> configuration changes to the kafka-connect's config.storage topic (and this >> topic is usually configured as one-partition topic) >> >> We have topics with more than 500 partitions and the connector tasks of more >> than 500 (one task for each kafka partition). When the consumer rebalance >> happens, kafka-connect code needs to finish the writing of all task configs >> into the config.storage topic within READ_WRITE_TOTAL_TIMEOUT_MS and >> sometimes it is not enough time. >> >> Can we increase READ_WRITE_TOTAL_TIMEOUT_MS to a larger value e.g. 300 >> seconds? Or better to make it a configurable parameter? >> > >
