[ https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827428#comment-16827428 ]
HBase QA commented on HBASE-22301: ---------------------------------- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 1s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} branch-1 Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 46s{color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 58s{color} | {color:green} branch-1 passed with JDK v1.8.0_212 {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} branch-1 passed with JDK v1.7.0_222 {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 46s{color} | {color:green} branch-1 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 2m 48s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} branch-1 passed with JDK v1.8.0_212 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s{color} | {color:green} branch-1 passed with JDK v1.7.0_222 {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 13s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed with JDK v1.8.0_212 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} the patch passed with JDK v1.7.0_222 {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 11s{color} | {color:green} The patch passed checkstyle in hbase-hadoop-compat {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 13s{color} | {color:green} The patch passed checkstyle in hbase-hadoop2-compat {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 23s{color} | {color:green} hbase-server: The patch generated 0 new + 94 unchanged - 6 fixed = 94 total (was 100) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 2m 54s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 1m 43s{color} | {color:green} Patch does not cause any errors with Hadoop 2.7.4. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green} the patch passed with JDK v1.8.0_212 {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 1s{color} | {color:green} the patch passed with JDK v1.7.0_222 {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 22s{color} | {color:green} hbase-hadoop-compat in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 28s{color} | {color:green} hbase-hadoop2-compat in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red}112m 48s{color} | {color:red} hbase-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 54s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}152m 9s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hbase.security.access.TestAdminOnlyOperations | | | hadoop.hbase.TestZooKeeper | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce base: https://builds.apache.org/job/PreCommit-HBASE-Build/204/artifact/patchprocess/Dockerfile | | JIRA Issue | HBASE-22301 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12967197/HBASE-22301-branch-1.patch | | Optional Tests | dupname asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux cdedc1cd4657 4.4.0-143-generic #169~14.04.2-Ubuntu SMP Wed Feb 13 15:00:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/hbase-personality.sh | | git revision | branch-1 / 5ea7851 | | maven | version: Apache Maven 3.0.5 | | Default Java | 1.7.0_222 | | Multi-JDK versions | /usr/lib/jvm/java-8-openjdk-amd64:1.8.0_212 /usr/lib/jvm/java-7-openjdk-amd64:1.7.0_222 | | unit | https://builds.apache.org/job/PreCommit-HBASE-Build/204/artifact/patchprocess/patch-unit-hbase-server.txt | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/204/testReport/ | | Max. process+thread count | 3727 (vs. ulimit of 10000) | | modules | C: hbase-hadoop-compat hbase-hadoop2-compat hbase-server U: . | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/204/console | | Powered by | Apache Yetus 0.9.0 http://yetus.apache.org | This message was automatically generated. > Consider rolling the WAL if the HDFS write pipeline is slow > ----------------------------------------------------------- > > Key: HBASE-22301 > URL: https://issues.apache.org/jira/browse/HBASE-22301 > Project: HBase > Issue Type: Improvement > Components: wal > Reporter: Andrew Purtell > Assignee: Andrew Purtell > Priority: Minor > Fix For: 3.0.0, 1.5.0, 2.3.0 > > Attachments: HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, > HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, > HBASE-22301-branch-1.patch > > > Consider the case when a subset of the HDFS fleet is unhealthy but suffering > a gray failure not an outright outage. HDFS operations, notably syncs, are > abnormally slow on pipelines which include this subset of hosts. If the > regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be > consumed waiting for acks from the datanodes in the pipeline (recall that > some of them are sick). Imagine a write heavy application distributing load > uniformly over the cluster at a fairly high rate. With the WAL subsystem > slowed by HDFS level issues, all handlers can be blocked waiting to append to > the WAL. Once all handlers are blocked, the application will experience > backpressure. All (HBase) clients eventually have too many outstanding writes > and block. > Because the application is distributing writes near uniformly in the > keyspace, the probability any given service endpoint will dispatch a request > to an impacted regionserver, even a single regionserver, approaches 1.0. So > the probability that all service endpoints will be affected approaches 1.0. > In order to break the logjam, we need to remove the slow datanodes. Although > there is HDFS level monitoring, mechanisms, and procedures for this, we > should also attempt to take mitigating action at the HBase layer as soon as > we find ourselves in trouble. It would be enough to remove the affected > datanodes from the writer pipelines. A super simple strategy that can be > effective is described below: > This is with branch-1 code. I think branch-2's async WAL can mitigate but > still can be susceptible. branch-2 sync WAL is susceptible. > We already roll the WAL writer if the pipeline suffers the failure of a > datanode and the replication factor on the pipeline is too low. We should > also consider how much time it took for the write pipeline to complete a sync > the last time we measured it, or the max over the interval from now to the > last time we checked. If the sync time exceeds a configured threshold, roll > the log writer then too. Fortunately we don't need to know which datanode is > making the WAL write pipeline slow, only that syncs on the pipeline are too > slow and exceeding a threshold. This is enough information to know when to > roll it. Once we roll it, we will get three new randomly selected datanodes. > On most clusters the probability the new pipeline includes the slow datanode > will be low. (And if for some reason it does end up with a problematic > datanode again, we roll again.) > This is not a silver bullet but this can be a reasonably effective mitigation. > Provide a metric for tracking when log roll is requested (and for what > reason). > Emit a log line at log roll time that includes datanode pipeline details for > further debugging and analysis, similar to the existing slow FSHLog sync log > line. > If we roll too many times within a short interval of time this probably means > there is a widespread problem with the fleet and so our mitigation is not > helping and may be exacerbating those problems or operator difficulties. > Ensure log roll requests triggered by this new feature happen infrequently > enough to not cause difficulties under either normal or abnormal conditions. > A very simple strategy that could work well under both normal and abnormal > conditions is to define a fairly lengthy interval, default 5 minutes, and > then insure we do not roll more than once during this interval for this > reason. -- This message was sent by Atlassian JIRA (v7.6.3#76005)