We ended up adding the following to `Dial':
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second,
})
This required bumping grpc to a commit that included the fix in
https://github.com/grpc/grpc-go/pull/2307 which sets the
TCP_USER_TIMEOUT socket option on Linux. On a side note, this issue
doesn't affect windows clients. It looks like by default windows
retransmissions are much lower than on GNU/Linux.
John Shahid <[email protected]> writes:
> Hi all,
>
> We just ran into an interesting issue. We are using grpc-go for both
> the client and server implementation. There are two instance of the
> server deployed for HA. Clients use dns name lookup and usually are
> split evenly between the two servers.
>
> One of the servers had a network issue and wasn't reachable (we were
> able to simulate this situation by adding an iptables rule to drop
> packets destined to one of the two servers). The DNS server immediately
> detect that one of the servers isn't reachable and removes it from the
> pool. What we observed is that clients connected to that instance will
> keep getting "context deadline exceeded" errors for about 15 minutes.
> The tcpdump show multiple retransmission attempts. The client will
> eventually (after ~15 minutes) reconnect to the healthy instance.
>
> Is there a way to speed up the fail over without changing the number of
> TCP retransmissions in `/proc/sys/net/ipv4/tcp_retries2' ?
>
> Thanks,
>
> JS
--
You received this message because you are subscribed to the Google Groups
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/grpc-io.
To view this discussion on the web visit
https://groups.google.com/d/msgid/grpc-io/87lg5fhc7r.fsf%40gmail.com.
For more options, visit https://groups.google.com/d/optout.