Andrew Purtell created HBASE-22301:
--------------------------------------

             Summary: Consider rolling the WAL if the HDFS write pipeline is 
slow
                 Key: HBASE-22301
                 URL: https://issues.apache.org/jira/browse/HBASE-22301
             Project: HBase
          Issue Type: Improvement
          Components: wal
            Reporter: Andrew Purtell
            Assignee: Andrew Purtell
             Fix For: 3.0.0, 1.5.0, 2.3.0


Consider the case when a subset of the HDFS fleet is unhealthy but suffering a 
gray failure not an outright outage. HDFS operations, notably syncs, are 
abnormally slow on pipelines which include this subset of hosts. If the 
regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be 
consumed waiting for acks from the datanodes in the pipeline (recall that some 
of them are sick). Imagine a write heavy application distributing load 
uniformly over the cluster at a fairly high rate. With the WAL subsystem slowed 
by HDFS level issues, all handlers can be blocked waiting to append to the WAL. 
Once all handlers are blocked, the application will experience backpressure.

This is with branch-1 code. I think branch-2's async WAL can mitigate but still 
can be susceptible. branch-2 sync WAL is susceptible. 

We already roll the WAL writer if the pipeline suffers the failure of a 
datanode and the replication factor on the pipeline is too low. We should also 
consider how much time it took for the write pipeline to complete a sync the 
last time we measured it, or the max over the interval from now to the last 
time we checked. If the sync time exceeds a configured threshold, roll the log 
writer then too. Fortunately we don't need to know which datanode is making the 
WAL write pipeline slow, only that syncs on the pipeline are too slow and 
exceeding a threshold. This is enough information to know when to roll it. Once 
we roll it, we will get three new randomly selected datanodes. On most clusters 
the probability the new pipeline includes the slow datanode will be low. (And 
if for some reason it does end up with a problematic datanode again, we roll 
again.)

This is not a silver bullet but this can be a reasonably effective mitigation.

Provide a metric for tracking when log roll is requested (and for what reason).

Emit a log line at log roll time that includes datanode pipeline details for 
further debugging and analysis, similar to the existing slow FSHLog sync log 
line.

If we roll too many times within a short interval of time this probably means 
there is a widespread problem with the fleet and so our mitigation is not 
helping and may be exacerbating those problems or operator difficulties. Ensure 
log roll requests triggered by this new feature happen infrequently enough to 
not cause difficulties under either normal or abnormal conditions. A very 
simple strategy that could work well under both normal and abnormal conditions 
is to define a fairly lengthy interval, default 5 minutes, and then insure we 
do not roll more than once during this interval for this reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to