[jira] [Commented] (HBASE-22301) Consider rolling the WAL if the HDFS write pipeline is slow

HBase QA (JIRA) Fri, 26 Apr 2019 20:24:14 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827428#comment-16827428
 ]


HBase QA commented on HBASE-22301:
----------------------------------

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 14m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 2 new or modified test 
files. {color} |
|| || || || {color:brown} branch-1 Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
58s{color} | {color:green} branch-1 passed with JDK v1.8.0_212 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} branch-1 passed with JDK v1.7.0_222 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
46s{color} | {color:green} branch-1 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
48s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
50s{color} | {color:green} branch-1 passed with JDK v1.8.0_212 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} branch-1 passed with JDK v1.7.0_222 {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
13s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed with JDK v1.8.0_212 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed with JDK v1.7.0_222 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
11s{color} | {color:green} The patch passed checkstyle in hbase-hadoop-compat 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} The patch passed checkstyle in hbase-hadoop2-compat 
{color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
23s{color} | {color:green} hbase-server: The patch generated 0 new + 94 
unchanged - 6 fixed = 94 total (was 100) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  2m 
54s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green}  
1m 43s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.7.4. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed with JDK v1.8.0_212 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
1s{color} | {color:green} the patch passed with JDK v1.7.0_222 {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
22s{color} | {color:green} hbase-hadoop-compat in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  0m 
28s{color} | {color:green} hbase-hadoop2-compat in the patch passed. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red}112m 48s{color} 
| {color:red} hbase-server in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
54s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}152m  9s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hbase.security.access.TestAdminOnlyOperations |
|   | hadoop.hbase.TestZooKeeper |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce base: 
https://builds.apache.org/job/PreCommit-HBASE-Build/204/artifact/patchprocess/Dockerfile
 |
| JIRA Issue | HBASE-22301 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12967197/HBASE-22301-branch-1.patch
 |
| Optional Tests |  dupname  asflicense  javac  javadoc  unit  findbugs  
shadedjars  hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux cdedc1cd4657 4.4.0-143-generic #169~14.04.2-Ubuntu SMP Wed Feb 
13 15:00:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/hbase-personality.sh |
| git revision | branch-1 / 5ea7851 |
| maven | version: Apache Maven 3.0.5 |
| Default Java | 1.7.0_222 |
| Multi-JDK versions |  /usr/lib/jvm/java-8-openjdk-amd64:1.8.0_212 
/usr/lib/jvm/java-7-openjdk-amd64:1.7.0_222 |
| unit | 
https://builds.apache.org/job/PreCommit-HBASE-Build/204/artifact/patchprocess/patch-unit-hbase-server.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/204/testReport/ |
| Max. process+thread count | 3727 (vs. ulimit of 10000) |
| modules | C: hbase-hadoop-compat hbase-hadoop2-compat hbase-server U: . |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/204/console |
| Powered by | Apache Yetus 0.9.0 http://yetus.apache.org |


This message was automatically generated.



> Consider rolling the WAL if the HDFS write pipeline is slow
> -----------------------------------------------------------
>
>                 Key: HBASE-22301
>                 URL: https://issues.apache.org/jira/browse/HBASE-22301
>             Project: HBase
>          Issue Type: Improvement
>          Components: wal
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Minor
>             Fix For: 3.0.0, 1.5.0, 2.3.0
>
>         Attachments: HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, 
> HBASE-22301-branch-1.patch, HBASE-22301-branch-1.patch, 
> HBASE-22301-branch-1.patch
>
>
> Consider the case when a subset of the HDFS fleet is unhealthy but suffering 
> a gray failure not an outright outage. HDFS operations, notably syncs, are 
> abnormally slow on pipelines which include this subset of hosts. If the 
> regionserver's WAL is backed by an impacted pipeline, all WAL handlers can be 
> consumed waiting for acks from the datanodes in the pipeline (recall that 
> some of them are sick). Imagine a write heavy application distributing load 
> uniformly over the cluster at a fairly high rate. With the WAL subsystem 
> slowed by HDFS level issues, all handlers can be blocked waiting to append to 
> the WAL. Once all handlers are blocked, the application will experience 
> backpressure. All (HBase) clients eventually have too many outstanding writes 
> and block.
> Because the application is distributing writes near uniformly in the 
> keyspace, the probability any given service endpoint will dispatch a request 
> to an impacted regionserver, even a single regionserver, approaches 1.0. So 
> the probability that all service endpoints will be affected approaches 1.0.
> In order to break the logjam, we need to remove the slow datanodes. Although 
> there is HDFS level monitoring, mechanisms, and procedures for this, we 
> should also attempt to take mitigating action at the HBase layer as soon as 
> we find ourselves in trouble. It would be enough to remove the affected 
> datanodes from the writer pipelines. A super simple strategy that can be 
> effective is described below:
> This is with branch-1 code. I think branch-2's async WAL can mitigate but 
> still can be susceptible. branch-2 sync WAL is susceptible. 
> We already roll the WAL writer if the pipeline suffers the failure of a 
> datanode and the replication factor on the pipeline is too low. We should 
> also consider how much time it took for the write pipeline to complete a sync 
> the last time we measured it, or the max over the interval from now to the 
> last time we checked. If the sync time exceeds a configured threshold, roll 
> the log writer then too. Fortunately we don't need to know which datanode is 
> making the WAL write pipeline slow, only that syncs on the pipeline are too 
> slow and exceeding a threshold. This is enough information to know when to 
> roll it. Once we roll it, we will get three new randomly selected datanodes. 
> On most clusters the probability the new pipeline includes the slow datanode 
> will be low. (And if for some reason it does end up with a problematic 
> datanode again, we roll again.)
> This is not a silver bullet but this can be a reasonably effective mitigation.
> Provide a metric for tracking when log roll is requested (and for what 
> reason).
> Emit a log line at log roll time that includes datanode pipeline details for 
> further debugging and analysis, similar to the existing slow FSHLog sync log 
> line.
> If we roll too many times within a short interval of time this probably means 
> there is a widespread problem with the fleet and so our mitigation is not 
> helping and may be exacerbating those problems or operator difficulties. 
> Ensure log roll requests triggered by this new feature happen infrequently 
> enough to not cause difficulties under either normal or abnormal conditions. 
> A very simple strategy that could work well under both normal and abnormal 
> conditions is to define a fairly lengthy interval, default 5 minutes, and 
> then insure we do not roll more than once during this interval for this 
> reason.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-22301) Consider rolling the WAL if the HDFS write pipeline is slow

Reply via email to