[
https://issues.apache.org/jira/browse/HBASE-23269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976342#comment-16976342
]
HBase QA commented on HBASE-23269:
----------------------------------
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m
0s{color} | {color:blue} Docker mode activated. {color} |
| {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 10s{color}
| {color:red} https://github.com/apache/hbase/pull/840 does not apply to
branch-1. Rebase required? Wrong Branch? See
https://yetus.apache.org/documentation/in-progress/precommit-patchnames for
help. {color} |
\\
\\
|| Subsystem || Report/Notes ||
| GITHUB PR | https://github.com/apache/hbase/pull/840 |
| JIRA Issue | HBASE-23269 |
| Console output |
https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-840/2/console |
| versions | git=2.17.1 |
| Powered by | Apache Yetus 0.11.1 https://yetus.apache.org |
This message was automatically generated.
> Hbase crashed due to two versions of regionservers when rolling upgrading
> -------------------------------------------------------------------------
>
> Key: HBASE-23269
> URL: https://issues.apache.org/jira/browse/HBASE-23269
> Project: HBase
> Issue Type: Improvement
> Components: master
> Affects Versions: 1.4.0, 1.4.2, 1.4.9, 1.4.10, 1.4.11
> Reporter: Jianzhen Xu
> Assignee: Jianzhen Xu
> Priority: Critical
> Attachments: 9.png, image-2019-11-07-14-49-41-253.png,
> image-2019-11-07-14-50-11-877.png, image-2019-11-07-14-51-38-858.png
>
>
> Currently, when hbase turns on the rs_group function and needs to upgrade to
> a higher version, the meta table maybe assign failed, which eventually makes
> the whole cluster unavailable and the availability drops to 0.This applies to
> all versions that introduce rs_group functionality in hbase-1.4.*. Including
> the patch of rs_group is introduced in the version below 1.4, upgrade to
> version 1.4 will also appear.
> When this happens during an upgrade:
> * When rolling upgrading regionservers, it must appear if the first rs of
> the upgrade is not in the same rs_group as the meta table.
> The phenomenon is as follows:
> !image-2019-11-07-14-50-11-877.png!
> !image-2019-11-07-14-51-38-858.png!
> The reason for this is as follows: during a rolling upgrade of the first
> regionserver node (denoted as RS1),RS1 started up and re-registered to
> zk,master triggered the operation through watcher perception in
> RegionServerTracker, and finally came to this
> method-HMaster.checkIfShouldMoveSystemRegionAsync()。
> The logic of this method is as follows:
>
> {code:java}
> // code placeholder
> public void checkIfShouldMoveSystemRegionAsync() {
> new Thread(new Runnable() {
> @Override
> public void run() {
> try {
> synchronized (checkIfShouldMoveSystemRegionLock) {
> // RS register on ZK after reports startup on master
> List<HRegionInfo> regionsShouldMove = new ArrayList<>();
> for (ServerName server : getExcludedServersForSystemTable()) {
> regionsShouldMove.addAll(getCarryingSystemTables(server));
> }
> if (!regionsShouldMove.isEmpty()) {
> List<RegionPlan> plans = new ArrayList<>();
> for (HRegionInfo regionInfo : regionsShouldMove) {
> RegionPlan plan = getRegionPlan(regionInfo, true);
> if (regionInfo.isMetaRegion()) {
> // Must move meta region first.
> balance(plan);
> } else {
> plans.add(plan);
> }
> }
> for (RegionPlan plan : plans) {
> balance(plan);
> }
> }
> }
> } catch (Throwable t) {
> LOG.error(t);
> }
> }
> }).start();
> }{code}
>
> # First execute getExcludedServersForSystemTable():Get the highest version
> value in all regionservers and return all RSs below that version value,
> labeled LowVersionRSList
> # If 1 does not return null, iterate.If there is a region with system table
> on rs, add this region to the List that needs move.If the first rs upgraded
> at this point is not in the rs_group where the system table is located, the
> region of the meta table is added to regionsShouldMove
> # Get a Regionplan for the region in regionsShouldMove,, and the parameter
> forceNewPlan is true:
> ## Gets all regionserver which version is below the highest version;
> ## Exclude regionservers from 1) for all rs online status. The result is
> that only the rs has been upgraded will in collection, marked as destServers ;
> ## Since forceNewPlan is set to true, destination server will be obtained
> through balance.randomassignmet (region, destServers). Since rs_group
> function is enabled, the balance here is RSGroupBasedLoadBalancer.The logic
> in this method is:
> ### the destServers in 3.2 obtained intersect with all online regionservers
> in the rs_group of the current region.When region is a system table and not
> in the same rs_group, the result here is null.If null is returned,
> destination regionserver is hard-coded as BOGUS_SERVER_NAME(localhost,1);
> Therefore, when master assigns region of the system table to localhost,1, it
> will naturally assign failed.If the above master logic is not noticed and
> this problem occurs, you can randomly upgrade a node in the rs_group where
> the system table is located, and it will automatically recover.
> During the actual upgrade process, you will rarely know this problem without
> looking at the master code.However, the official document does not indicate
> that when using the rs_group function, the rs_group where the system table is
> located needs to be upgraded first. It is easy to get into this process and
> eventually crash.The system tables are assigned to the highest version of rs
> for compatibility purposes, the comment says.
> Therefore, without changing the code logic, it can be noted in the official
> documentation that the rs_group of the system table is the priority to be
> upgraded when the cluster is upgraded with the rs_group function.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)