[ 
https://issues.apache.org/jira/browse/HBASE-25343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241377#comment-17241377
 ] 

Duo Zhang commented on HBASE-25343:
-----------------------------------

I think the problem here is that, we should try to avoid acessing the failed 
meta replica again if it is not available for now.

Of course this is just a nice to have, with random selection usually we could 
find a health replica soon.

If you still have interest, you could see how we clear the stale cache in 
retrying caller, the logic is in AsyncRpcRetryingCaller.onError, where we will 
pass in a updateCachedLocation action, which takes a Throwable. In the 
implementation of the action, usually we will clear the location cache, or 
stale master stub to avoid using the wrong cache next time. Maybe we could do 
the same for CatalogReplicaLoadBalanceSelector, for example, in 
AsyncNonMetaRegionLocator.locateInMeta method, after issuing a scan, in the 
onError method of the AdvancedScanResultConsumer, we could check if we are 
using the load balancer mode, if so, we get the replica id from the scan, and 
tell the CatalogReplicaLoadBalanceSelector to temporarily disable this replica 
for a while.

Hope this could help.

Thanks.

> Add HA support on top of Load Balance mode
> ------------------------------------------
>
>                 Key: HBASE-25343
>                 URL: https://issues.apache.org/jira/browse/HBASE-25343
>             Project: HBase
>          Issue Type: Sub-task
>          Components: meta replicas
>    Affects Versions: 2.4.0
>            Reporter: Huaxiang Sun
>            Assignee: Huaxiang Sun
>            Priority: Major
>             Fix For: 2.4.1
>
>
> This is a follow-up enhancement with Stack, Duo. With the newly introduced 
> meta replica LoadBalance mode, if there is something wrong with one of meta 
> replica regions, the current logic is that it keeps trying until the meta 
> replica region is onlined again or it reports error, i.e, there is no HA at 
> LoadBalance mode. HA can be implemented if it reports timeout with one meta 
> replica region and tries another meta replica region.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to