[grpc-io] Fail over takes too long because of TCP retransmission

John Shahid Tue, 20 Nov 2018 07:35:11 -0800

Hi all,

We just ran into an interesting issue.  We are using grpc-go for both
the client and server implementation.  There are two instance of the
server deployed for HA.  Clients use dns name lookup and usually are
split evenly between the two servers.


One of the servers had a network issue and wasn't reachable (we were
able to simulate this situation by adding an iptables rule to drop
packets destined to one of the two servers).  The DNS server immediately
detect that one of the servers isn't reachable and removes it from the
pool.  What we observed is that clients connected to that instance will
keep getting "context deadline exceeded" errors for about 15 minutes.
The tcpdump show multiple retransmission attempts.  The client will
eventually (after ~15 minutes) reconnect to the healthy instance.

Is there a way to speed up the fail over without changing the number of
TCP retransmissions in `/proc/sys/net/ipv4/tcp_retries2' ?

Thanks,

JS

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/875zwrojam.fsf%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

[grpc-io] Fail over takes too long because of TCP retransmission

Reply via email to