[jira] [Commented] (HBASE-23269) Hbase crashed due to two versions of regionservers when rolling upgrading

Jianzhen Xu (Jira) Wed, 04 Dec 2019 18:43:19 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988400#comment-16988400
 ]


Jianzhen Xu commented on HBASE-23269:
-------------------------------------

[~zhangduo] I proposeed a pr, which filters higher version instances in groups. 
Is that appropriate?

> Hbase crashed due to two versions of regionservers when rolling upgrading
> -------------------------------------------------------------------------
>
>                 Key: HBASE-23269
>                 URL: https://issues.apache.org/jira/browse/HBASE-23269
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.4.0, 1.4.2, 1.4.9, 1.4.10, 1.4.11
>            Reporter: Jianzhen Xu
>            Assignee: Jianzhen Xu
>            Priority: Critical
>         Attachments: 9.png, image-2019-11-07-14-49-41-253.png, 
> image-2019-11-07-14-50-11-877.png, image-2019-11-07-14-51-38-858.png
>
>
> Currently, when hbase turns on the rs_group function and needs to upgrade to 
> a higher version, the meta table maybe assign failed, which eventually makes 
> the whole cluster unavailable and the availability drops to 0.This applies to 
> all versions that introduce rs_group functionality in hbase-1.4.*. Including 
> the patch of rs_group is introduced in the version below 1.4, upgrade to 
> version 1.4 will also appear.
>  When this happens during an upgrade:
>  * When rolling upgrading regionservers, it must appear if the first rs of 
> the upgrade is not in the same rs_group as the meta table.
>  The phenomenon is as follows:
> !image-2019-11-07-14-50-11-877.png!
> !image-2019-11-07-14-51-38-858.png!
> The reason for this is as follows: during a rolling upgrade of the first 
> regionserver node (denoted as RS1),RS1 started up and re-registered to 
> zk,master triggered the operation through watcher perception in 
> RegionServerTracker, and finally came to this 
> method-HMaster.checkIfShouldMoveSystemRegionAsync()。
> The logic of this method is as follows:
>  
> {code:java}
> // code placeholder
> public void checkIfShouldMoveSystemRegionAsync() {
>   new Thread(new Runnable() {
>     @Override
>     public void run() {
>       try {
>         synchronized (checkIfShouldMoveSystemRegionLock) {
>           // RS register on ZK after reports startup on master
>           List<HRegionInfo> regionsShouldMove = new ArrayList<>();
>           for (ServerName server : getExcludedServersForSystemTable()) {
>             regionsShouldMove.addAll(getCarryingSystemTables(server));
>           }
>           if (!regionsShouldMove.isEmpty()) {
>             List<RegionPlan> plans = new ArrayList<>();
>             for (HRegionInfo regionInfo : regionsShouldMove) {
>               RegionPlan plan = getRegionPlan(regionInfo, true);
>               if (regionInfo.isMetaRegion()) {
>                 // Must move meta region first.
>                 balance(plan);
>               } else {
>                 plans.add(plan);
>               }
>             }
>             for (RegionPlan plan : plans) {
>               balance(plan);
>             }
>           }
>         }
>       } catch (Throwable t) {
>         LOG.error(t);
>       }
>     }
>   }).start();
> }{code}
>  
>  # First execute getExcludedServersForSystemTable()：Get the highest version 
> value in all regionservers and return all RSs below that version value, 
> labeled LowVersionRSList
>  # If 1 does not return null, iterate.If there is a region with system table 
> on rs, add this region to the List that needs move.If the first rs upgraded 
> at this point is not in the rs_group where the system table is located, the 
> region of the meta table is added to regionsShouldMove
>  # Get a Regionplan for the region in regionsShouldMove,, and the parameter 
> forceNewPlan is true:
>  ## Gets all regionserver which version is below the highest version;
>  ##  Exclude regionservers from 1) for all rs online status. The result is 
> that only the rs has been upgraded will in collection, marked as destServers ;
>  ## Since forceNewPlan is set to true, destination server will be obtained 
> through balance.randomassignmet (region, destServers). Since rs_group 
> function is enabled, the balance here is RSGroupBasedLoadBalancer.The logic 
> in this method is:
>  ### the destServers in 3.2 obtained intersect with all online regionservers 
> in the rs_group of the current region.When region is a system table and not 
> in the same rs_group, the result here is null.If null is returned, 
> destination regionserver is hard-coded as BOGUS_SERVER_NAME(localhost,1);
> Therefore, when master assigns region of the system table to localhost,1, it 
> will naturally assign failed.If the above master logic is not noticed and 
> this problem occurs, you can randomly upgrade a node in the rs_group where 
> the system table is located, and it will automatically recover.
> During the actual upgrade process, you will rarely know this problem without 
> looking at the master code.However, the official document does not indicate 
> that when using the rs_group function, the rs_group where the system table is 
> located needs to be upgraded first. It is easy to get into this process and 
> eventually crash.The system tables are assigned to the highest version of rs 
> for compatibility purposes, the comment says.
> Therefore, without changing the code logic, it can be noted in the official 
> documentation that the rs_group of the system table is the priority to be 
> upgraded when the cluster is upgraded with the rs_group function.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-23269) Hbase crashed due to two versions of regionservers when rolling upgrading

Reply via email to