[
https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831120#comment-16831120
]
Andrew Purtell commented on HBASE-22301:
----------------------------------------
Precommit results look good. This checkstyle nit I won't be able to fix:
{code}TestLogRolling.java:106: @Test:3: Method length is 164 lines (max
allowed is 150). [MethodLength]{code}
I have a +1 already for the branch-1 work. Unless objection I am going to
commit this to branch-1, branch-2, and master today, after running more local
checks. If you have any concerns please post them now.
> Consider rolling the WAL if the HDFS write pipeline is slow
> -----------------------------------------------------------
>
> Key: HBASE-22301
> URL: https://issues.apache.org/jira/browse/HBASE-22301
> Project: HBase
> Issue Type: Improvement
> Components: wal
> Reporter: Andrew Purtell
> Assignee: Andrew Purtell
> Priority: Minor
> Fix For: 3.0.0, 1.5.0, 2.3.0
>
> Attachments: HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch,
> HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch,
> HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch,
> HBASE-22301-branch-1.patch, HBASE-22301-branch-2.patch, HBASE-22301.patch
>
>
> Consider the case when a subset of the HDFS fleet is unhealthy but suffering
> a gray failure not an outright outage. HDFS operations, notably syncs, are
> abnormally slow on pipelines which include this subset of hosts. If the
> regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be
> consumed waiting for acks from the datanodes in the pipeline (recall that
> some of them are sick). Imagine a write heavy application distributing load
> uniformly over the cluster at a fairly high rate. With the WAL subsystem
> slowed by HDFS level issues, all handlers can be blocked waiting to append to
> the WAL. Once all handlers are blocked, the application will experience
> backpressure. All (HBase) clients eventually have too many outstanding writes
> and block.
> Because the application is distributing writes near uniformly in the
> keyspace, the probability any given service endpoint will dispatch a request
> to an impacted regionserver, even a single regionserver, approaches 1.0. So
> the probability that all service endpoints will be affected approaches 1.0.
> In order to break the logjam, we need to remove the slow datanodes. Although
> there is HDFS level monitoring, mechanisms, and procedures for this, we
> should also attempt to take mitigating action at the HBase layer as soon as
> we find ourselves in trouble. It would be enough to remove the affected
> datanodes from the writer pipelines. A super simple strategy that can be
> effective is described below:
> This is with branch-1 code. I think branch-2's async WAL can mitigate but
> still can be susceptible. branch-2 sync WAL is susceptible.
> We already roll the WAL writer if the pipeline suffers the failure of a
> datanode and the replication factor on the pipeline is too low. We should
> also consider how much time it took for the write pipeline to complete a sync
> the last time we measured it, or the max over the interval from now to the
> last time we checked. If the sync time exceeds a configured threshold, roll
> the log writer then too. Fortunately we don't need to know which datanode is
> making the WAL write pipeline slow, only that syncs on the pipeline are too
> slow and exceeding a threshold. This is enough information to know when to
> roll it. Once we roll it, we will get three new randomly selected datanodes.
> On most clusters the probability the new pipeline includes the slow datanode
> will be low. (And if for some reason it does end up with a problematic
> datanode again, we roll again.)
> This is not a silver bullet but this can be a reasonably effective mitigation.
> Provide a metric for tracking when log roll is requested (and for what
> reason).
> Emit a log line at log roll time that includes datanode pipeline details for
> further debugging and analysis, similar to the existing slow FSHLog sync log
> line.
> If we roll too many times within a short interval of time this probably means
> there is a widespread problem with the fleet and so our mitigation is not
> helping and may be exacerbating those problems or operator difficulties.
> Ensure log roll requests triggered by this new feature happen infrequently
> enough to not cause difficulties under either normal or abnormal conditions.
> A very simple strategy that could work well under both normal and abnormal
> conditions is to define a fairly lengthy interval, default 5 minutes, and
> then insure we do not roll more than once during this interval for this
> reason.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)