Sean Busbey created HBASE-22263:
-----------------------------------
Summary: Master creates duplicate ServerCrashProcedure on
initialization, leading to assignment hanging in region-dense clusters
Key: HBASE-22263
URL: https://issues.apache.org/jira/browse/HBASE-22263
Project: HBase
Issue Type: Bug
Components: proc-v2, Region Assignment
Affects Versions: 1.4.0, 1.3.0, 1.2.0, 1.5.0
Reporter: Sean Busbey
Assignee: Sean Busbey
h3. Problem:
During Master initialization we
# restore existing procedures that still need to run from prior active Master
instances
# look for signs that Region Servers have died and need to be recovered while
we were out and schedule a ServerCrashProcedure (SCP) for each them
# turn on the assignment manager
The normal turn of events for a ServerCrashProcedure will attempt to use a bulk
assignment to maintain the set of regions on a RS if possible. However, we wait
around and retry a bit later if the assignment manager isn’t ready yet.
Note that currently #2 has no notion of wether or not a previous active Master
instances has already done a check. This means we might schedule an SCP for a
ServerName (host, port, start code) that already has an SCP scheduled. Ideally,
such a duplicate should be a no-op.
However, before step #2 schedules the SCP it first marks the region server as
dead and not yet processed, with the expectation that the SCP it just created
will look if there is log splitting work and then mark the server as easy for
region assignment. At the same time, any restored SCPs that are past the step
of log splitting will be waiting for the AssignmentManager still. As a part of
restoring themselves, they do not update with the current master instance to
show that they are past the point of WAL processing.
Once the AssignmentManager starts in #3 the restored SCP continues; it will
eventually get to the assignment phase and find that its server is marked as
dead and in need of wal processing. Such assignments are skipped with a log
message. Thus as we iterate over the regions to assign we’ll skip all of them.
This non-intuitively shifts the “no-op” status from the newer SCP we scheduled
at #2 to the older SCP that was restored in #1.
Bulk assignment works by sending the assign calls via a pool to allow more
parallelism. Once we’ve set up the pool we just wait to see if the region state
updates to online. Unfortunately, since all of the assigns got skipped, we’ll
never change the state for any of these regions. That means the bulk assign,
and the older SCP that started it, will wait until it hits a timeout.
By default the timeout for a bulk assignment is the smaller of {{(# Regions in
the plan * 10s)}} or {{(# Regions in the most loaded RS in the plan * 1s + 60s
+ # of RegionServers in the cluster * 30s)}}. For even modest clusters with
several hundreds of regions per region server, this means the “no-op” SCP will
end up waiting ~tens-of-minutes (e.g. ~50 minutes for an average region density
of 300 regions per region server on a 100 node cluster. ~11 minutes for 300
regions per region server on a 10 node cluster). During this time, the SCP will
hold one of the available procedure execution slots for both the overall pool
and for the specific server queue.
As previously mentioned, restored SCPs will retry their submission if the
assignment manager has not yet been activated (done in #3), this can cause them
to be scheduled after the newer SCPs (created in #2). Thus the order of
execution of no-op and usable SCPs can vary from run-to-run of master
initialization.
This means that unless you get lucky with SCP ordering, impacted regions will
remain as RIT for an extended period of time. If you get particularly unlucky
and a critical system table is included in the regions that are being
recovered, then master initialization itself will end up blocked on this
sequence of SCP timeouts. If there are enough of them to exceed the master
initialization timeouts, then the situation can be self-sustaining as
additional master fails over cause even more duplicative SCPs to be scheduled.
h3. Indicators:
* Master appears to hang; failing to assign regions to available region
servers.
* Master appears to hang during initialization; shows waiting for the meta or
namespace regions.
* Repeated master restarts allow some progress to be made on assignments for a
limited period of time.
* Master UI shows a large number of Server Crash Procedures in RUNNABLE state
and the number increases by roughly the number of Region Servers on master
restart.
* Master log shows a large number of messages that assignment of a region has
failed because it was last seen on a region server that has not yet been
processed. These messages come from the AssignmentManager logger. This message
should normally only occur when a Region Server dies just before some
assignment is about to happen. When this combination of issues happens the
message will happen repeatedly; every time a new defunct SCP is processed it’ll
happen for each region.
Example of aforementioned message:
{code:java}
2019-04-15 11:19:04,610 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Skip assigning test6,5f89022e,1555251200249.946ff80e7602e66853d33899819983a1.,
it is on a dead but not processed yet server:
regionserver-8.hbase.example.com,22101,1555349031179
{code}
h3. Reproduction:
The procedure we currently have to reproduce this issue requires specific
timings that can be hard to get correctly, so this might require multiple tries.
Before starting, the test cluster should have the following properties:
* Only 1 master. Not a strict requirement but it helps in the following steps.
* Hundreds of regions per region server. If you need more, fire up your
preferred data generation tool and tell it to create a large enough table.
Those regions can be empty, no need to fill them with actual data.
* At least ten times more region servers than available CPUs on the master. If
the number of CPUs is too high, set {{hbase.master.procedure.threads}} in the
safety valves to number-of-region-servers divided by 10. For example, if you
have 24 cores and a 10 nodes cluster, set the configuration to 2 or 3.
Set your environment the following way:
* Access to cluster-wide shutdown/startup via whatever control plane you use
* Access to master-role specific restart via whatever control plane you use
* A shell on the master node that tails the master log (use {{tail -F}} with a
capital F to ride over log rolls).
* A shell on the master node that’s ready to grab the HMaster’s pid and {{kill
-9}} it.
The procedure is:
* Start with a cluster as described above.
* Restart the entire HBase cluster.
* When the master log shows “Clean cluster startup. Assigning user regions”
and it starts assigning regions, {{kill -9}} the master.
* Stop the entire HBase cluster
* Restart the entire HBase cluster
* Once the master shows “Found regions out on cluster or in RIT”, {{kill -9}}
the master again.
* Restart only the Master.
* (Optionally repeat the above two steps)
You know that you hit the bug when the given indicators show up.
If it seems like it’s able to assign a lot of regions again, try {{kill -9}}
again and restart the Master role.
h3. Workaround:
Identified and tested workaround to mitigate this problem involves
configuration tunings to speed up regions assignment on Region Servers and also
reduce the time spent by BulkAssignment threads on the Master side. We do not
recommend setting this configurations for normal operations.
* {{hbase.master.procedure.threads}} - This property defines the general
procedure pool size, and in the context of this issue, is the pool for
executing SCPs. Increasing this pool would allow more SCPs for different Region
Servers to run in parallel, allowing for more regions assignments to be
processed. However, on the core of this problem is the fact that Master may
have multiple SCPs for same Region Server. These are not run in parallel,
therefore tuning this parameter will not be sufficient. We recommend setting
this parameter to the number of Region Servers in the cluster, so that under
normal scenarios where there is one SCP for each Region Server, all those can
run in parallel.
* {{hbase.bulk.assignment.perregion.open.time}} - This property determines how
long a bulk assignment thread on Master's BulkAssigner should wait for all its
regions to get assigned. Setting to a value as low as 100 (milliseconds) will
allow the no-op SCPs to complete faster. Which will open up execution spots for
SCPs that can do actual assignment work.
* {{hbase.master.namespace.init.timeout}} - The master has a time limit for
how long it takes to assign the namespace table. Given that we want to limit
master restarts, this is better adjusted upwards.
Since this issue is especially pronounced on clusters with a large number of
regions-per-region-server the following additional config can also help:
* {{hbase.regionserver.executor.openregion.threads}} - This relates to the
number of threads on each Region Server responsible for handling the region
assignment requests. Provided individual Region Servers are not overloaded
already tuning this value higher than the default (3) should help expedite
region assignment.
h3. Acknowledgements:
Thanks to the ton of folks who helped diagnose, chase down, and document this
issue, its reproduction, and the workaround. Especially [~jdcryans],
[~wchevreuil], [[email protected]], [~avirmani], Shamik Dave, [~esteban], and
[~elserj].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)