Andrew Purtell created HBASE-22301:
--------------------------------------
Summary: Consider rolling the WAL if the HDFS write pipeline is
slow
Key: HBASE-22301
URL: https://issues.apache.org/jira/browse/HBASE-22301
Project: HBase
Issue Type: Improvement
Components: wal
Reporter: Andrew Purtell
Assignee: Andrew Purtell
Fix For: 3.0.0, 1.5.0, 2.3.0
Consider the case when a subset of the HDFS fleet is unhealthy but suffering a
gray failure not an outright outage. HDFS operations, notably syncs, are
abnormally slow on pipelines which include this subset of hosts. If the
regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be
consumed waiting for acks from the datanodes in the pipeline (recall that some
of them are sick). Imagine a write heavy application distributing load
uniformly over the cluster at a fairly high rate. With the WAL subsystem slowed
by HDFS level issues, all handlers can be blocked waiting to append to the WAL.
Once all handlers are blocked, the application will experience backpressure.
This is with branch-1 code. I think branch-2's async WAL can mitigate but still
can be susceptible. branch-2 sync WAL is susceptible.
We already roll the WAL writer if the pipeline suffers the failure of a
datanode and the replication factor on the pipeline is too low. We should also
consider how much time it took for the write pipeline to complete a sync the
last time we measured it, or the max over the interval from now to the last
time we checked. If the sync time exceeds a configured threshold, roll the log
writer then too. Fortunately we don't need to know which datanode is making the
WAL write pipeline slow, only that syncs on the pipeline are too slow and
exceeding a threshold. This is enough information to know when to roll it. Once
we roll it, we will get three new randomly selected datanodes. On most clusters
the probability the new pipeline includes the slow datanode will be low. (And
if for some reason it does end up with a problematic datanode again, we roll
again.)
This is not a silver bullet but this can be a reasonably effective mitigation.
Provide a metric for tracking when log roll is requested (and for what reason).
Emit a log line at log roll time that includes datanode pipeline details for
further debugging and analysis, similar to the existing slow FSHLog sync log
line.
If we roll too many times within a short interval of time this probably means
there is a widespread problem with the fleet and so our mitigation is not
helping and may be exacerbating those problems or operator difficulties. Ensure
log roll requests triggered by this new feature happen infrequently enough to
not cause difficulties under either normal or abnormal conditions. A very
simple strategy that could work well under both normal and abnormal conditions
is to define a fairly lengthy interval, default 5 minutes, and then insure we
do not roll more than once during this interval for this reason.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)