[ https://issues.apache.org/jira/browse/HBASE-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976335#comment-16976335 ]
HBase QA commented on HBASE-23269: ---------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s{color} | {color:blue} Docker mode activated. {color} | | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 10s{color} | {color:red} https://github.com/apache/hbase/pull/840 does not apply to branch-1. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/in-progress/precommit-patchnames for help. {color} | \\ \\ || Subsystem || Report/Notes || | GITHUB PR | https://github.com/apache/hbase/pull/840 | | JIRA Issue | HBASE-23269 | | Console output | https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-840/1/console | | versions | git=2.17.1 | | Powered by | Apache Yetus 0.11.1 https://yetus.apache.org | This message was automatically generated. > Hbase crashed due to two versions of regionservers when rolling upgrading > ------------------------------------------------------------------------- > > Key: HBASE-23269 > URL: https://issues.apache.org/jira/browse/HBASE-23269 > Project: HBase > Issue Type: Improvement > Components: master > Affects Versions: 1.4.0, 1.4.2, 1.4.9, 1.4.10, 1.4.11 > Reporter: Jianzhen Xu > Assignee: Jianzhen Xu > Priority: Critical > Attachments: 9.png, image-2019-11-07-14-49-41-253.png, > image-2019-11-07-14-50-11-877.png, image-2019-11-07-14-51-38-858.png > > > Currently, when hbase turns on the rs_group function and needs to upgrade to > a higher version, the meta table maybe assign failed, which eventually makes > the whole cluster unavailable and the availability drops to 0.This applies to > all versions that introduce rs_group functionality in hbase-1.4.*. Including > the patch of rs_group is introduced in the version below 1.4, upgrade to > version 1.4 will also appear. > When this happens during an upgrade: > * When rolling upgrading regionservers, it must appear if the first rs of > the upgrade is not in the same rs_group as the meta table. > The phenomenon is as follows: > !image-2019-11-07-14-50-11-877.png! > !image-2019-11-07-14-51-38-858.png! > The reason for this is as follows: during a rolling upgrade of the first > regionserver node (denoted as RS1),RS1 started up and re-registered to > zk,master triggered the operation through watcher perception in > RegionServerTracker, and finally came to this > method-HMaster.checkIfShouldMoveSystemRegionAsync()。 > The logic of this method is as follows: > > {code:java} > // code placeholder > public void checkIfShouldMoveSystemRegionAsync() { > new Thread(new Runnable() { > @Override > public void run() { > try { > synchronized (checkIfShouldMoveSystemRegionLock) { > // RS register on ZK after reports startup on master > List<HRegionInfo> regionsShouldMove = new ArrayList<>(); > for (ServerName server : getExcludedServersForSystemTable()) { > regionsShouldMove.addAll(getCarryingSystemTables(server)); > } > if (!regionsShouldMove.isEmpty()) { > List<RegionPlan> plans = new ArrayList<>(); > for (HRegionInfo regionInfo : regionsShouldMove) { > RegionPlan plan = getRegionPlan(regionInfo, true); > if (regionInfo.isMetaRegion()) { > // Must move meta region first. > balance(plan); > } else { > plans.add(plan); > } > } > for (RegionPlan plan : plans) { > balance(plan); > } > } > } > } catch (Throwable t) { > LOG.error(t); > } > } > }).start(); > }{code} > > # First execute getExcludedServersForSystemTable():Get the highest version > value in all regionservers and return all RSs below that version value, > labeled LowVersionRSList > # If 1 does not return null, iterate.If there is a region with system table > on rs, add this region to the List that needs move.If the first rs upgraded > at this point is not in the rs_group where the system table is located, the > region of the meta table is added to regionsShouldMove > # Get a Regionplan for the region in regionsShouldMove,, and the parameter > forceNewPlan is true: > ## Gets all regionserver which version is below the highest version; > ## Exclude regionservers from 1) for all rs online status. The result is > that only the rs has been upgraded will in collection, marked as destServers ; > ## Since forceNewPlan is set to true, destination server will be obtained > through balance.randomassignmet (region, destServers). Since rs_group > function is enabled, the balance here is RSGroupBasedLoadBalancer.The logic > in this method is: > ### the destServers in 3.2 obtained intersect with all online regionservers > in the rs_group of the current region.When region is a system table and not > in the same rs_group, the result here is null.If null is returned, > destination regionserver is hard-coded as BOGUS_SERVER_NAME(localhost,1); > Therefore, when master assigns region of the system table to localhost,1, it > will naturally assign failed.If the above master logic is not noticed and > this problem occurs, you can randomly upgrade a node in the rs_group where > the system table is located, and it will automatically recover. > During the actual upgrade process, you will rarely know this problem without > looking at the master code.However, the official document does not indicate > that when using the rs_group function, the rs_group where the system table is > located needs to be upgraded first. It is easy to get into this process and > eventually crash.The system tables are assigned to the highest version of rs > for compatibility purposes, the comment says. > Therefore, without changing the code logic, it can be noted in the official > documentation that the rs_group of the system table is the priority to be > upgraded when the cluster is upgraded with the rs_group function. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)