Todd Lipcon commented on KUDU-2343:

The issue here appears to be that the ConnectionCache is using the server UUID 
as the key. In the case of the masters, the client does not actually use a UUID 
to identify a server, so even though it learns that the leader master has 
changed, when it attempts to send an RPC to it, it mistakenly pulls a the 
_wrong connection_ out of the cache. Thus it thinks it's sending an RPC to the 
new leader but still sends it to the old one, which faithfully responds that it 
is not the leader. This goes on until the client is restarted (or the old 
leader master happens regains leadership)

I checked this back a bunch of versions and it appears it was introduced 
between 1.2 and 1.3 when we did some pretty serious refactoring on the Java 

> Java client doesn't properly reconnect to leader master when old leader is 
> online
> ---------------------------------------------------------------------------------
>                 Key: KUDU-2343
>                 URL: https://issues.apache.org/jira/browse/KUDU-2343
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, java
>    Affects Versions: 1.3.1, 1.4.0, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
> In the following sequence of events, the Java client doesn't properly fail 
> over to locate a new master, and in fact gets "stuck" until the client is 
> restarted:
> - client connects to the cluster and caches the master locations
> - client opens a table and caches tablet locations
> - the master fails over to a new leader
> - the tablet either goes down or fails over, causing the client to need to 
> update its tablet locations
> In this case, it gets stuck in a retry loop where it will never be able to 
> connect to the new leader master.

This message was sent by Atlassian JIRA

Reply via email to