We have a 0.9 kafka cluster consists of 7 kafka-brokers colocated with 7 zookeepers
Producers/consumers are going full tilt at this cluster and we measure delays in response time while we run disruptive tests Delays in response time are 'acceptable' (a few seconds max) for the following disruptions on 1 or 2 of these hosts - Soft kill kafka-broker/zookepeer - Hard kill kafka-broker/zookepeer - soft/hard reboot host However when we disrupt the network for 1 or 2 hosts using a sequence such as below, we do see delays of ~ 30 seconds (1 host = topic leader) to ~ 60 seconds (2 hosts including topic leader) - Shut down a network interface, '/etc/init.d/network stop'. - Sleep for a few minutes - Start network interface, '/etc/init.d/network start' Those delays suggest that perhaps there is a 30 second timeout somewhere that explains the 30 second delay, and it is conjectured that shortening this timeout may possibly reduce the resultant delay. We have tried changing various timeouts such as those listed below, but have had no success so far - replica.socket.timeout.ms - request.timeout.ms - zookeeper.session.timeout - zookeeper.session.timeout - connections.max.idle.ms - controller.socket.timeout.ms - group.max.session.timeout.ms - group.min.session.timeout.ms Searches have not yielded any solutions. Any help/guidance is greatly appreciated. Thanks and regards, *--Guru Balse* *Principal Software Engineer |Service Cloud | Salesforce.com | Landmark 10th @ HQ | 510-859-6975* <http://smart.salesforce.com/sig/gbalse//us_mb/default/link.html>