symat commented on issue #1048: ZOOKEEPER-3188: Improve resilience to network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-520381529
 
 
   I created a simple docker config with multiple virtual networks and managed 
to test the situation when some of the containers loose the access to one of 
the virtual networks. I uploaded the docker related scripts / configs here: 
https://github.com/symat/zookeeper-docker-test
   
   During these manual tests I found some situations when the previous patch 
didn't work. 
   - the InitialMessage sent during the leader election contained only a single 
election address. If this address was not reachable by the recipient of the 
InitialMessage, then the connection was never successfully initiated. I changed 
the format of the InitialMessage to send all the election addresses and the 
other side will use only the one which is reachable.
   - When an existing tcp connection to an electionAddress is broken, the 
server will try to send notification messages re-using the existing SendWorker 
threads. I would assume that the SendWorker.send() method should die when it 
tries to flush the output stream on the socket which destination is already 
unreachable. However, for some reason it doesn't die. (this could be 
investigated further) Anyway, I added a small logic for the connection 
initiation to verify if the existing destination in SendWorker is still 
reachable. If the destination is unreachable in the SendWorker thread, then we 
can gracefully finish it and during the next connection attempt we will choose 
a destination what is reachable.
   
   With these modifications I was able to test the following situation 
successfully:
   
   1. starting a zookeeper with nodes, each server listening on two addresses 
(on two separate virtual networks)
   2. waiting the initial leader election to happen
   3. removing the current leader from the virtual network that is used by the 
others as destination
   4. it took a few seconds until all the servers recognised the loss of 
connections, and in 5-15 seconds the connections were re-established and the 
new leader election finished
   
   I will think how to unittest these features. (or should we crate some 
docker-based automated integration test?)
   
   In the mean while I would appreciate a deep review of these changes, as I am 
quite new in the Zookeeper code...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to