We have a 0.9 kafka cluster consists of 7 kafka-brokers colocated with 7
zookeepers

Producers/consumers are going full tilt at this cluster and we measure
delays in response time while we run disruptive tests

Delays in response time are 'acceptable' (a few seconds max) for the
following disruptions on 1 or 2 of these hosts

   - Soft kill kafka-broker/zookepeer
   - Hard kill kafka-broker/zookepeer
   - soft/hard reboot host


However when we disrupt the network for 1 or 2 hosts using a sequence such
as below, we do see delays of ~ 30 seconds (1 host = topic leader) to ~ 60
seconds (2 hosts including topic leader)

   - Shut down a network interface, '/etc/init.d/network stop'.
   - Sleep for a few minutes
   - Start network interface, '/etc/init.d/network start'


Those delays suggest that perhaps there is a 30 second timeout somewhere
that explains the 30 second delay, and it is conjectured that shortening
this timeout may possibly reduce the resultant delay.

We have tried changing various timeouts such as those listed below, but
have had no success so far

   - replica.socket.timeout.ms
   - request.timeout.ms
   - zookeeper.session.timeout
   - zookeeper.session.timeout
   - connections.max.idle.ms
   - controller.socket.timeout.ms
   - group.max.session.timeout.ms
   - group.min.session.timeout.ms

Searches have not yielded any solutions.  Any help/guidance is greatly
appreciated.

Thanks and regards,


*--Guru Balse*
*Principal Software Engineer |Service Cloud | Salesforce.com | Landmark
10th @ HQ  | 510-859-6975*


<http://smart.salesforce.com/sig/gbalse//us_mb/default/link.html>

Reply via email to