[
https://issues.apache.org/jira/browse/HBASE-22263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on HBASE-22263 started by Sean Busbey.
-------------------------------------------
> Master creates duplicate ServerCrashProcedure on initialization, leading to
> assignment hanging in region-dense clusters
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-22263
> URL: https://issues.apache.org/jira/browse/HBASE-22263
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Affects Versions: 1.2.0, 1.3.0, 1.4.0, 1.5.0
> Reporter: Sean Busbey
> Assignee: Sean Busbey
> Priority: Blocker
>
> h3. Problem:
> During Master initialization we
> # restore existing procedures that still need to run from prior active
> Master instances
> # look for signs that Region Servers have died and need to be recovered
> while we were out and schedule a ServerCrashProcedure (SCP) for each them
> # turn on the assignment manager
> The normal turn of events for a ServerCrashProcedure will attempt to use a
> bulk assignment to maintain the set of regions on a RS if possible. However,
> we wait around and retry a bit later if the assignment manager isn’t ready
> yet.
> Note that currently #2 has no notion of wether or not a previous active
> Master instances has already done a check. This means we might schedule an
> SCP for a ServerName (host, port, start code) that already has an SCP
> scheduled. Ideally, such a duplicate should be a no-op.
> However, before step #2 schedules the SCP it first marks the region server as
> dead and not yet processed, with the expectation that the SCP it just created
> will look if there is log splitting work and then mark the server as easy for
> region assignment. At the same time, any restored SCPs that are past the step
> of log splitting will be waiting for the AssignmentManager still. As a part
> of restoring themselves, they do not update with the current master instance
> to show that they are past the point of WAL processing.
> Once the AssignmentManager starts in #3 the restored SCP continues; it will
> eventually get to the assignment phase and find that its server is marked as
> dead and in need of wal processing. Such assignments are skipped with a log
> message. Thus as we iterate over the regions to assign we’ll skip all of
> them. This non-intuitively shifts the “no-op” status from the newer SCP we
> scheduled at #2 to the older SCP that was restored in #1.
> Bulk assignment works by sending the assign calls via a pool to allow more
> parallelism. Once we’ve set up the pool we just wait to see if the region
> state updates to online. Unfortunately, since all of the assigns got skipped,
> we’ll never change the state for any of these regions. That means the bulk
> assign, and the older SCP that started it, will wait until it hits a timeout.
> By default the timeout for a bulk assignment is the smaller of {{(# Regions
> in the plan * 10s)}} or {{(# Regions in the most loaded RS in the plan * 1s +
> 60s + # of RegionServers in the cluster * 30s)}}. For even modest clusters
> with several hundreds of regions per region server, this means the “no-op”
> SCP will end up waiting ~tens-of-minutes (e.g. ~50 minutes for an average
> region density of 300 regions per region server on a 100 node cluster. ~11
> minutes for 300 regions per region server on a 10 node cluster). During this
> time, the SCP will hold one of the available procedure execution slots for
> both the overall pool and for the specific server queue.
> As previously mentioned, restored SCPs will retry their submission if the
> assignment manager has not yet been activated (done in #3), this can cause
> them to be scheduled after the newer SCPs (created in #2). Thus the order of
> execution of no-op and usable SCPs can vary from run-to-run of master
> initialization.
> This means that unless you get lucky with SCP ordering, impacted regions will
> remain as RIT for an extended period of time. If you get particularly unlucky
> and a critical system table is included in the regions that are being
> recovered, then master initialization itself will end up blocked on this
> sequence of SCP timeouts. If there are enough of them to exceed the master
> initialization timeouts, then the situation can be self-sustaining as
> additional master fails over cause even more duplicative SCPs to be scheduled.
> h3. Indicators:
> * Master appears to hang; failing to assign regions to available region
> servers.
> * Master appears to hang during initialization; shows waiting for the meta
> or namespace regions.
> * Repeated master restarts allow some progress to be made on assignments for
> a limited period of time.
> * Master UI shows a large number of Server Crash Procedures in RUNNABLE
> state and the number increases by roughly the number of Region Servers on
> master restart.
> * Master log shows a large number of messages that assignment of a region
> has failed because it was last seen on a region server that has not yet been
> processed. These messages come from the AssignmentManager logger. This
> message should normally only occur when a Region Server dies just before some
> assignment is about to happen. When this combination of issues happens the
> message will happen repeatedly; every time a new defunct SCP is processed
> it’ll happen for each region.
> Example of aforementioned message:
> {code:java}
> 2019-04-15 11:19:04,610 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Skip assigning
> test6,5f89022e,1555251200249.946ff80e7602e66853d33899819983a1., it is on a
> dead but not processed yet server:
> regionserver-8.hbase.example.com,22101,1555349031179
> {code}
> h3. Reproduction:
> The procedure we currently have to reproduce this issue requires specific
> timings that can be hard to get correctly, so this might require multiple
> tries.
> Before starting, the test cluster should have the following properties:
> * Only 1 master. Not a strict requirement but it helps in the following
> steps.
> * Hundreds of regions per region server. If you need more, fire up your
> preferred data generation tool and tell it to create a large enough table.
> Those regions can be empty, no need to fill them with actual data.
> * At least ten times more region servers than available CPUs on the master.
> If the number of CPUs is too high, set {{hbase.master.procedure.threads}} in
> the safety valves to number-of-region-servers divided by 10. For example, if
> you have 24 cores and a 10 nodes cluster, set the configuration to 2 or 3.
> Set your environment the following way:
> * Access to cluster-wide shutdown/startup via whatever control plane you use
> * Access to master-role specific restart via whatever control plane you use
> * A shell on the master node that tails the master log (use {{tail -F}} with
> a capital F to ride over log rolls).
> * A shell on the master node that’s ready to grab the HMaster’s pid and
> {{kill -9}} it.
> The procedure is:
> * Start with a cluster as described above.
> * Restart the entire HBase cluster.
> * When the master log shows “Clean cluster startup. Assigning user regions”
> and it starts assigning regions, {{kill -9}} the master.
> * Stop the entire HBase cluster
> * Restart the entire HBase cluster
> * Once the master shows “Found regions out on cluster or in RIT”, {{kill
> -9}} the master again.
> * Restart only the Master.
> * (Optionally repeat the above two steps)
> You know that you hit the bug when the given indicators show up.
> If it seems like it’s able to assign a lot of regions again, try {{kill -9}}
> again and restart the Master role.
> h3. Workaround:
> Identified and tested workaround to mitigate this problem involves
> configuration tunings to speed up regions assignment on Region Servers and
> also reduce the time spent by BulkAssignment threads on the Master side. We
> do not recommend setting this configurations for normal operations.
> * {{hbase.master.procedure.threads}} - This property defines the general
> procedure pool size, and in the context of this issue, is the pool for
> executing SCPs. Increasing this pool would allow more SCPs for different
> Region Servers to run in parallel, allowing for more regions assignments to
> be processed. However, on the core of this problem is the fact that Master
> may have multiple SCPs for same Region Server. These are not run in parallel,
> therefore tuning this parameter will not be sufficient. We recommend setting
> this parameter to the number of Region Servers in the cluster, so that under
> normal scenarios where there is one SCP for each Region Server, all those can
> run in parallel.
> * {{hbase.bulk.assignment.perregion.open.time}} - This property determines
> how long a bulk assignment thread on Master's BulkAssigner should wait for
> all its regions to get assigned. Setting to a value as low as 100
> (milliseconds) will allow the no-op SCPs to complete faster. Which will open
> up execution spots for SCPs that can do actual assignment work.
> * {{hbase.master.namespace.init.timeout}} - The master has a time limit for
> how long it takes to assign the namespace table. Given that we want to limit
> master restarts, this is better adjusted upwards.
> Since this issue is especially pronounced on clusters with a large number of
> regions-per-region-server the following additional config can also help:
> * {{hbase.regionserver.executor.openregion.threads}} - This relates to the
> number of threads on each Region Server responsible for handling the region
> assignment requests. Provided individual Region Servers are not overloaded
> already tuning this value higher than the default (3) should help expedite
> region assignment.
> h3. Acknowledgements:
> Thanks to the ton of folks who helped diagnose, chase down, and document this
> issue, its reproduction, and the workaround. Especially [~jdcryans],
> [~wchevreuil], [[email protected]], [~avirmani], Shamik Dave, [~esteban], and
> [~elserj].
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)