We ended up adding the following to `Dial':

                grpc.WithKeepaliveParams(keepalive.ClientParameters{
                        Time: 10 * time.Second,
                })

This required bumping grpc to a commit that included the fix in
https://github.com/grpc/grpc-go/pull/2307 which sets the
TCP_USER_TIMEOUT socket option on Linux.  On a side note, this issue
doesn't affect windows clients.  It looks like by default windows
retransmissions are much lower than on GNU/Linux.


John Shahid <[email protected]> writes:

> Hi all,
>
> We just ran into an interesting issue.  We are using grpc-go for both
> the client and server implementation.  There are two instance of the
> server deployed for HA.  Clients use dns name lookup and usually are
> split evenly between the two servers.
>
> One of the servers had a network issue and wasn't reachable (we were
> able to simulate this situation by adding an iptables rule to drop
> packets destined to one of the two servers).  The DNS server immediately
> detect that one of the servers isn't reachable and removes it from the
> pool.  What we observed is that clients connected to that instance will
> keep getting "context deadline exceeded" errors for about 15 minutes.
> The tcpdump show multiple retransmission attempts.  The client will
> eventually (after ~15 minutes) reconnect to the healthy instance.
>
> Is there a way to speed up the fail over without changing the number of
> TCP retransmissions in `/proc/sys/net/ipv4/tcp_retries2' ?
>
> Thanks,
>
> JS

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/87lg5fhc7r.fsf%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to