[ 
https://issues.apache.org/jira/browse/GEODE-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Donal Evans updated GEODE-9808:
-------------------------------
    Labels: pull-request-available  (was: needsTriage pull-request-available)

> Client ops fail with NoLocatorsAvailableException when all servers leave the 
> DS 
> --------------------------------------------------------------------------------
>
>                 Key: GEODE-9808
>                 URL: https://issues.apache.org/jira/browse/GEODE-9808
>             Project: Geode
>          Issue Type: Bug
>          Components: client/server
>    Affects Versions: 1.15.0
>            Reporter: Bill Burcham
>            Assignee: Donal Evans
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> When there are no cache servers (only locators) in a cluster, client 
> operations will fail with a misleading exception:
> {noformat}
> org.apache.geode.cache.client.NoAvailableLocatorsException: Unable to connect 
> to any locators in the list 
> [gemfire-cluster-locator-0.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
>  
> gemfire-cluster-locator-1.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334,
>  
> gemfire-cluster-locator-2.gemfire-cluster-locator.namespace-1850250019.svc.cluster.local:10334]
>     at 
> org.apache.geode.cache.client.internal.AutoConnectionSourceImpl.findServer(AutoConnectionSourceImpl.java:174)
>     at 
> org.apache.geode.cache.client.internal.ConnectionFactoryImpl.createClientToServerConnection(ConnectionFactoryImpl.java:211)
>     at 
> org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.createPooledConnection(ConnectionManagerImpl.java:196)
>     at 
> org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.forceCreateConnection(ConnectionManagerImpl.java:227)
>     at 
> org.apache.geode.cache.client.internal.pooling.ConnectionManagerImpl.exchangeConnection(ConnectionManagerImpl.java:365)
>     at 
> org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:161)
>     at 
> org.apache.geode.cache.client.internal.OpExecutorImpl.execute(OpExecutorImpl.java:120)
>     at 
> org.apache.geode.cache.client.internal.PoolImpl.execute(PoolImpl.java:805)
>     at org.apache.geode.cache.client.internal.PutOp.execute(PutOp.java:91)
> {noformat}
> Even the client is able to connect to a locator, we encounter a 
> NoAvailableLocatorsException exception with the message "Unable to connect to 
> any locators in the list".
> Investigating the product code we see:
>  # If there are no cache servers in the cluster, ServerLocator.pickServer() 
> will definitely construct a ClientConnectionResponse(null) which causes that 
> object’s hasResult() to respond with false in the loop termination in 
> AutoConnectionSourceImpl.queryLocators()
>  # Not only is the exception wording misleading in 
> AutoConnectionSourceImpl.findServer()—it’s also misleading in at least two 
> other calling locations in AutoConnectionSourceImpl: findReplacementServer() 
> and findServersForQueue().
>  # In each of those cases the calling method translates a null response from 
> queryLocators() into a throw of a NoAvailableLocatorsException
>  # an appropriate exception, NoAvailableServersException, already exists, for 
> the case where we were able to contact a locator but the locator was not able 
> to find any cache servers
>  # According to my Git spelunking queryLocators() has been obfuscating the 
> true cause of the failure since at least 2015
> Without analyzing ServerLocator.pickServer() 
> (LocatorLoadSnapshot.getServerForConnection()) to discern why two locators 
> might disagree on how many cache servers are in the cluster, it seems to me 
> that we should modify AutoConnectionSourceImpl.queryLocators() so that:
>  * if it gets a ServerLocationResponse with hasResult() true, it immediately 
> returns that as it does now
>  * otherwise it keeps trying and it keeps track of the last (non-null) 
> ServerLocationResponse it has received
>  * it returns the last non-null ServerLocationResponse it received (otherwise 
> it returns null)
> With that in hand, we can change the three call locations in 
> AutoConnectionSourceImpl: findServer(), findReplacementServer(), and 
> findServersForQueue() to each throw NoAvailableLocatorsException if no 
> locator responded, or NoAvailableServersException if a locator responded with 
> a ClientConnectionResponse for which hasResult() returns null.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to