[ 
https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826674#comment-16826674
 ] 

David Manning commented on HBASE-22301:
---------------------------------------

I've spent hours looking at logs and assessing the background state of slow 
syncs throughout various clusters. With the current default of 10 slow syncs 
per 1 minute interval, we could expect the number of WAL rolls to increase by 
15% on a heavily utilized cluster, or 5-7% on other clusters. In these cases, I 
would not expect improved performance from most of these added rolls, as these 
are not problematic WAL pipelines, but "normal" pipelines. Perhaps these 
clusters could be better tuned to avoid so many background slow syncs, but it 
seems reasonable to think others in the community will also have similar 
clusters.

Based on these data, I would recommend a much higher default of 250 for 
{{slowSyncRollThreshold}}. For the default of 5 syncer threads, this is almost 
one slow sync per second if all syncer threads are utilized and waiting on the 
WAL writer. This would have remedied the problem in our incident, as we were 
seeing 500-800+ slow syncs reported per minute on affected pipelines (each sync 
taking 100-500ms+, each sync reported by all 5 threads). However, the threshold 
is also high enough to prevent sizable increases in WAL rolls in normally 
operating clusters. From my log investigations, in normal operation of a heavy 
utilized cluster, this would still add <1% new WAL rolls. It would add ~5% new 
WAL rolls during spiky traffic (some clusters with large multi requests or 
hotspotted servers that may be doing some GC thrashing.) For normal operation 
of a normal cluster, it would add a negligible amount of WAL rolls (<0.01%).

Unfortunately, such a high value would not detect a case of the WAL being so 
slow that it couldn't even perform 50 syncs per minute. If we want to do this, 
we'll need to be fancier with the logic. Perhaps we would have to sum up all 
the {{timeInNanos}} in {{postSync}} over the {{slowSyncCheckInterval}}, and 
then check if we spent greater than X% of the interval in slow syncs. This 
could catch issues where we spent 5 slow syncs of 10 seconds each, or 100 slow 
syncs of 500ms each, and request a WAL roll in either case. FWIW, I like this 
approach, but realize that it adds complexity while we're striving for 
simplicity.
{quote}We could divide the count by number of syncer threads. Or, multiply the 
theshold by number of threads. Or, simply set a higher threshold.
{quote}
If we stick with the count-based approach, I recommend 50 multiplied by the 
number of threads. If we don't want to include number of threads, then I 
recommend a threshold of 250 (50 times the default of 5 threads).

> Consider rolling the WAL if the HDFS write pipeline is slow
> -----------------------------------------------------------
>
>                 Key: HBASE-22301
>                 URL: https://issues.apache.org/jira/browse/HBASE-22301
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>             Fix For: 3.0.0, 1.5.0, 2.3.0
>
>         Attachments: HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, 
> HBASE-22301-branch-1.patch
>
>
> Consider the case when a subset of the HDFS fleet is unhealthy but suffering 
> a gray failure not an outright outage. HDFS operations, notably syncs, are 
> abnormally slow on pipelines which include this subset of hosts. If the 
> regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be 
> consumed waiting for acks from the datanodes in the pipeline (recall that 
> some of them are sick). Imagine a write heavy application distributing load 
> uniformly over the cluster at a fairly high rate. With the WAL subsystem 
> slowed by HDFS level issues, all handlers can be blocked waiting to append to 
> the WAL. Once all handlers are blocked, the application will experience 
> backpressure. All (HBase) clients eventually have too many outstanding writes 
> and block.
> Because the application is distributing writes near uniformly in the 
> keyspace, the probability any given service endpoint will dispatch a request 
> to an impacted regionserver, even a single regionserver, approaches 1.0. So 
> the probability that all service endpoints will be affected approaches 1.0.
> In order to break the logjam, we need to remove the slow datanodes. Although 
> there is HDFS level monitoring, mechanisms, and procedures for this, we 
> should also attempt to take mitigating action at the HBase layer as soon as 
> we find ourselves in trouble. It would be enough to remove the affected 
> datanodes from the writer pipelines. A super simple strategy that can be 
> effective is described below:
> This is with branch-1 code. I think branch-2's async WAL can mitigate but 
> still can be susceptible. branch-2 sync WAL is susceptible. 
> We already roll the WAL writer if the pipeline suffers the failure of a 
> datanode and the replication factor on the pipeline is too low. We should 
> also consider how much time it took for the write pipeline to complete a sync 
> the last time we measured it, or the max over the interval from now to the 
> last time we checked. If the sync time exceeds a configured threshold, roll 
> the log writer then too. Fortunately we don't need to know which datanode is 
> making the WAL write pipeline slow, only that syncs on the pipeline are too 
> slow and exceeding a threshold. This is enough information to know when to 
> roll it. Once we roll it, we will get three new randomly selected datanodes. 
> On most clusters the probability the new pipeline includes the slow datanode 
> will be low. (And if for some reason it does end up with a problematic 
> datanode again, we roll again.)
> This is not a silver bullet but this can be a reasonably effective mitigation.
> Provide a metric for tracking when log roll is requested (and for what 
> reason).
> Emit a log line at log roll time that includes datanode pipeline details for 
> further debugging and analysis, similar to the existing slow FSHLog sync log 
> line.
> If we roll too many times within a short interval of time this probably means 
> there is a widespread problem with the fleet and so our mitigation is not 
> helping and may be exacerbating those problems or operator difficulties. 
> Ensure log roll requests triggered by this new feature happen infrequently 
> enough to not cause difficulties under either normal or abnormal conditions. 
> A very simple strategy that could work well under both normal and abnormal 
> conditions is to define a fairly lengthy interval, default 5 minutes, and 
> then insure we do not roll more than once during this interval for this 
> reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to