[jira] [Updated] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.
[ https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] He Xiaoqiao updated HDFS-10453: --- Affects Version/s: 2.7.1 > ReplicationMonitor thread could stuck for long time due to the race between > replication and delete of same file in a large cluster. > --- > > Key: HDFS-10453 > URL: https://issues.apache.org/jira/browse/HDFS-10453 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.1 >Reporter: He Xiaoqiao > Attachments: HDFS-10453-branch-2.001.patch, HDFS-10453.001.patch > > > ReplicationMonitor thread could stuck for long time and loss data with little > probability. Consider the typical scenarioļ¼ > (1) create and close a file with the default replicas(3); > (2) increase replication (to 10) of the file. > (3) delete the file while ReplicationMonitor is scheduling blocks belong to > that file for replications. > if ReplicationMonitor stuck reappeared, NameNode will print log as: > {code:xml} > 2016-04-19 10:20:48,083 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 7 to reach 10 > (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) For more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy > .. > 2016-04-19 10:21:17,184 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 7 to reach 10 > (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) For more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy > 2016-04-19 10:21:17,184 WARN > org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough > replicas: expected size is 7 but only 0 storage types can be selected > (replication=10, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK, > DISK, DISK, DISK, DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) > 2016-04-19 10:21:17,184 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 7 to reach 10 > (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) All required storage types are unavailable: > unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} > {code} > This is because 2 threads (#NameNodeRpcServer and #ReplicationMonitor) > process same block at the same moment. > (1) ReplicationMonitor#computeReplicationWorkForBlocks get blocks to > replicate and leave the global lock. > (2) FSNamesystem#delete invoked to delete blocks then clear the reference in > blocksmap, needReplications, etc. the block's NumBytes will set > NO_ACK(Long.MAX_VALUE) which is used to indicate that the block deletion does > not need explicit ACK from the node. > (3) ReplicationMonitor#computeReplicationWorkForBlocks continue to > chooseTargets for the same blocks and no node will be selected after traverse > whole cluster because no node choice satisfy the goodness criteria > (remaining spaces achieve required size Long.MAX_VALUE). > During of stage#3 ReplicationMonitor stuck for long time, especial in a large > cluster. invalidateBlocks & neededReplications continues to grow and no > consumes. it will loss data at the worst. > This can mostly be avoided by skip chooseTarget for BlockCommand.NO_ACK block > and remove it from neededReplications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10433) Make retry also works well for Async DFS
[ https://issues.apache.org/jira/browse/HDFS-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305736#comment-15305736 ] Hadoop QA commented on HDFS-10433: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 14s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 17s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 25s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 22s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 23s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 55s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 39s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 32s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 19s {color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 11s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 53s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 36s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 36s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 22s {color} | {color:red} root: The patch generated 44 new + 224 unchanged - 6 fixed = 268 total (was 230) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 51s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 42s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 37s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 44s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 41s {color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 57s {color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 75m 47s {color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 136m 50s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.shortcircuit.TestShortCircuitCache | | | hadoop.hdfs.TestAsyncHDFSWithHA | | | hadoop.hdfs.server.datanode.TestDataNodeErasureCodingMetrics | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:2c91fd8 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12806837/h10433_20160528b.patch | | JIRA Issue | HDFS-10433 | | Optional Tests | asflicense xml compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux b56bdee95ca5 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / f5ff05c | | Default Java | 1.8.0_91 | | findbugs | v3.0.0 | | checkstyle |
[jira] [Commented] (HDFS-10467) Router-based HDFS federation
[ https://issues.apache.org/jira/browse/HDFS-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305716#comment-15305716 ] He Tianyi commented on HDFS-10467: -- Thanks for sharing. I've implemented similar approach as a separated project, see https://github.com/bytedance/nnproxy. I am currently using it for backing 2 namenodes with a mount table with 20+ entries in production and worked well. (about 12K TPS) Looks like HDFS Router Federation includes more features. Shall we work together? > Router-based HDFS federation > > > Key: HDFS-10467 > URL: https://issues.apache.org/jira/browse/HDFS-10467 > Project: Hadoop HDFS > Issue Type: New Feature > Components: fs >Affects Versions: 2.7.2 >Reporter: Inigo Goiri > Attachments: HDFS Router Federation.pdf > > > Add a Router to provide a federated view of multiple HDFS clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10433) Make retry also works well for Async DFS
[ https://issues.apache.org/jira/browse/HDFS-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-10433: --- Attachment: h10433_20160528c.patch h10433_20160528c.patch: fixes test failure. > Make retry also works well for Async DFS > > > Key: HDFS-10433 > URL: https://issues.apache.org/jira/browse/HDFS-10433 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs >Reporter: Xiaobing Zhou >Assignee: Tsz Wo Nicholas Sze > Attachments: h10433_20160524.patch, h10433_20160525.patch, > h10433_20160525b.patch, h10433_20160527.patch, h10433_20160528.patch, > h10433_20160528b.patch, h10433_20160528c.patch > > > In current Async DFS implementation, file system calls are invoked and > returns Future immediately to clients. Clients call Future#get to retrieve > final results. Future#get internally invokes a chain of callbacks residing in > ClientNamenodeProtocolTranslatorPB, ProtobufRpcEngine and ipc.Client. The > callback path bypasses the original retry layer/logic designed for > synchronous DFS. This proposes refactoring to make retry also works for Async > DFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10433) Make retry also works well for Async DFS
[ https://issues.apache.org/jira/browse/HDFS-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305707#comment-15305707 ] Hadoop QA commented on HDFS-10433: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 15s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 48s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 14s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 24s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 30s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 16s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 29s {color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 15s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 29s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 8m 15s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 8m 15s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 22s {color} | {color:red} root: The patch generated 44 new + 224 unchanged - 6 fixed = 268 total (was 230) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 12s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 35s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 50s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 21s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 9m 28s {color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 8s {color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 76m 30s {color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 22s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 136m 37s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestPersistBlocks | | | hadoop.hdfs.TestAsyncHDFSWithHA | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:2c91fd8 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12806837/h10433_20160528b.patch | | JIRA Issue | HDFS-10433 | | Optional Tests | asflicense xml compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux 990848a97abf 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / f5ff05c | | Default Java | 1.8.0_91 | | findbugs | v3.0.0 | | checkstyle |
[jira] [Updated] (HDFS-10433) Make retry also works well for Async DFS
[ https://issues.apache.org/jira/browse/HDFS-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-10433: --- Attachment: h10433_20160528b.patch h10433_20160528b.patch: makes testAsyncWithoutRetry to be run in both HA and non-HA cases. > Make retry also works well for Async DFS > > > Key: HDFS-10433 > URL: https://issues.apache.org/jira/browse/HDFS-10433 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs >Reporter: Xiaobing Zhou >Assignee: Tsz Wo Nicholas Sze > Attachments: h10433_20160524.patch, h10433_20160525.patch, > h10433_20160525b.patch, h10433_20160527.patch, h10433_20160528.patch, > h10433_20160528b.patch > > > In current Async DFS implementation, file system calls are invoked and > returns Future immediately to clients. Clients call Future#get to retrieve > final results. Future#get internally invokes a chain of callbacks residing in > ClientNamenodeProtocolTranslatorPB, ProtobufRpcEngine and ipc.Client. The > callback path bypasses the original retry layer/logic designed for > synchronous DFS. This proposes refactoring to make retry also works for Async > DFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10433) Make retry also works well for Async DFS
[ https://issues.apache.org/jira/browse/HDFS-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305650#comment-15305650 ] Hadoop QA commented on HDFS-10433: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 17s {color} | {color:blue} Docker mode activated. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s {color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s {color} | {color:green} The patch appears to include 4 new or modified test files. {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 39s {color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 10s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 13s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 22s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 15s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 35s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 17s {color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 16s {color} | {color:green} trunk passed {color} | | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 12s {color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 52s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 6m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 6m 11s {color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 20s {color} | {color:red} root: The patch generated 46 new + 224 unchanged - 6 fixed = 270 total (was 230) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 11s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 34s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s {color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s {color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 40s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 2m 17s {color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 12m 49s {color} | {color:green} hadoop-common in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 0m 54s {color} | {color:green} hadoop-hdfs-client in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 74m 14s {color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 27s {color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 132m 44s {color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestErasureCodeBenchmarkThroughput | \\ \\ || Subsystem || Report/Notes || | Docker | Image:yetus/hadoop:2c91fd8 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12806827/h10433_20160528.patch | | JIRA Issue | HDFS-10433 | | Optional Tests | asflicense xml compile javac javadoc mvninstall mvnsite unit findbugs checkstyle | | uname | Linux ebb0db7a1a3a 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh | | git revision | trunk / f5ff05c | | Default Java | 1.8.0_91 | | findbugs | v3.0.0 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/15598/artifact/patchprocess/diff-checkstyle-root.txt | |
[jira] [Updated] (HDFS-10433) Make retry also works well for Async DFS
[ https://issues.apache.org/jira/browse/HDFS-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz Wo Nicholas Sze updated HDFS-10433: --- Attachment: h10433_20160528.patch h10433_20160528.patch: sync'ed with trunk. > Make retry also works well for Async DFS > > > Key: HDFS-10433 > URL: https://issues.apache.org/jira/browse/HDFS-10433 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: hdfs >Reporter: Xiaobing Zhou >Assignee: Tsz Wo Nicholas Sze > Attachments: h10433_20160524.patch, h10433_20160525.patch, > h10433_20160525b.patch, h10433_20160527.patch, h10433_20160528.patch > > > In current Async DFS implementation, file system calls are invoked and > returns Future immediately to clients. Clients call Future#get to retrieve > final results. Future#get internally invokes a chain of callbacks residing in > ClientNamenodeProtocolTranslatorPB, ProtobufRpcEngine and ipc.Client. The > callback path bypasses the original retry layer/logic designed for > synchronous DFS. This proposes refactoring to make retry also works for Async > DFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10426) TestPendingInvalidateBlock failed in trunk
[ https://issues.apache.org/jira/browse/HDFS-10426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305451#comment-15305451 ] Masatake Iwasaki commented on HDFS-10426: - I found another intermittent failures in {{TestPendingInvalidateBlock}}. It would be nice if those are fixed too here. {noformat} Tests run: 2, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 31.54 sec <<< FAILURE! - in org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock testPendingDeletion(org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock) Time elapsed: 12.213 sec <<< FAILURE! java.lang.AssertionError: expected:<0> but was:<1> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock.testPendingDeletion(TestPendingInvalidateBlock.java:100) {noformat} {noformat} testPendingDeleteUnknownBlocks(org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock) Time elapsed: 18.993 sec <<< FAILURE! java.lang.AssertionError: expected:<0> but was:<2> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock.testPendingDeleteUnknownBlocks(TestPendingInvalidateBlock.java:177) {noformat} > TestPendingInvalidateBlock failed in trunk > -- > > Key: HDFS-10426 > URL: https://issues.apache.org/jira/browse/HDFS-10426 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Yiqun Lin >Assignee: Yiqun Lin > Attachments: HDFS-10426.001.patch, HDFS-10426.002.patch > > > The test {{TestPendingInvalidateBlock}} failed sometimes. The stack info: > {code} > org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock > testPendingDeletion(org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock) > Time elapsed: 7.703 sec <<< FAILURE! > java.lang.AssertionError: expected:<2> but was:<1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock.testPendingDeletion(TestPendingInvalidateBlock.java:92) > {code} > It looks that the {{invalidateBlock}} has been removed before we do the check > {code} > // restart NN > cluster.restartNameNode(true); > dfs.delete(foo, true); > Assert.assertEquals(0, cluster.getNamesystem().getBlocksTotal()); > Assert.assertEquals(REPLICATION, cluster.getNamesystem() > .getPendingDeletionBlocks()); > Assert.assertEquals(REPLICATION, > dfs.getPendingDeletionBlocksCount()); > {code} > And I look into the related configurations. I found the property > {{dfs.namenode.replication.interval}} was just set as 1 second in this test. > And after the delay time of {{dfs.namenode.startup.delay.block.deletion.sec}} > and the delete operation was slowly, it will cause this case. We can see the > stack info before, the failed test costs 7.7s more than 5+1 second. > One way can improve this. > * Increase the time of {{dfs.namenode.replication.interval}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10426) TestPendingInvalidateBlock failed in trunk
[ https://issues.apache.org/jira/browse/HDFS-10426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305449#comment-15305449 ] Masatake Iwasaki commented on HDFS-10426: - bq. And I have tested the way as you comment. But I found it still not affect the cluster.getNamesystem().getPendingDeletionBlocks(), and the invalidateBlocks will still be removed. Sorry, my assumption was wrong. {code} dfs.delete(foo, true); Assert.assertEquals(0, cluster.getNamesystem().getBlocksTotal()); Assert.assertEquals(REPLICATION, cluster.getNamesystem() .getPendingDeletionBlocks()); {code} I could see 2 causes for getting {{AssertionError: expected:<2> but was:<1>}} on the above. # ReplicationMonitor kicks in between {{delete}} and checking pending deletion blocks count. # {{delete}} is called before the NN processes block report from 1 of the DNs. Your 002 patch is addressing #1 but I think it would be better to avoid adding code for testing only to BlockManager. In order to fix #2, we should wait for replication count of the block reaches to 2 before {{delete}}. > TestPendingInvalidateBlock failed in trunk > -- > > Key: HDFS-10426 > URL: https://issues.apache.org/jira/browse/HDFS-10426 > Project: Hadoop HDFS > Issue Type: Bug > Components: test >Reporter: Yiqun Lin >Assignee: Yiqun Lin > Attachments: HDFS-10426.001.patch, HDFS-10426.002.patch > > > The test {{TestPendingInvalidateBlock}} failed sometimes. The stack info: > {code} > org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock > testPendingDeletion(org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock) > Time elapsed: 7.703 sec <<< FAILURE! > java.lang.AssertionError: expected:<2> but was:<1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:743) > at org.junit.Assert.assertEquals(Assert.java:118) > at org.junit.Assert.assertEquals(Assert.java:555) > at org.junit.Assert.assertEquals(Assert.java:542) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock.testPendingDeletion(TestPendingInvalidateBlock.java:92) > {code} > It looks that the {{invalidateBlock}} has been removed before we do the check > {code} > // restart NN > cluster.restartNameNode(true); > dfs.delete(foo, true); > Assert.assertEquals(0, cluster.getNamesystem().getBlocksTotal()); > Assert.assertEquals(REPLICATION, cluster.getNamesystem() > .getPendingDeletionBlocks()); > Assert.assertEquals(REPLICATION, > dfs.getPendingDeletionBlocksCount()); > {code} > And I look into the related configurations. I found the property > {{dfs.namenode.replication.interval}} was just set as 1 second in this test. > And after the delay time of {{dfs.namenode.startup.delay.block.deletion.sec}} > and the delete operation was slowly, it will cause this case. We can see the > stack info before, the failed test costs 7.7s more than 5+1 second. > One way can improve this. > * Increase the time of {{dfs.namenode.replication.interval}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-10453) ReplicationMonitor thread could stuck for long time due to the race between replication and delete of same file in a large cluster.
[ https://issues.apache.org/jira/browse/HDFS-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15305222#comment-15305222 ] He Xiaoqiao commented on HDFS-10453: hi [~shahrs87], thanks a lot for your watching this issue. It would be helpful with ur reviews and more suggestions. > ReplicationMonitor thread could stuck for long time due to the race between > replication and delete of same file in a large cluster. > --- > > Key: HDFS-10453 > URL: https://issues.apache.org/jira/browse/HDFS-10453 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Reporter: He Xiaoqiao > Attachments: HDFS-10453-branch-2.001.patch, HDFS-10453.001.patch > > > ReplicationMonitor thread could stuck for long time and loss data with little > probability. Consider the typical scenarioļ¼ > (1) create and close a file with the default replicas(3); > (2) increase replication (to 10) of the file. > (3) delete the file while ReplicationMonitor is scheduling blocks belong to > that file for replications. > if ReplicationMonitor stuck reappeared, NameNode will print log as: > {code:xml} > 2016-04-19 10:20:48,083 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 7 to reach 10 > (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) For more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy > .. > 2016-04-19 10:21:17,184 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 7 to reach 10 > (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) For more information, please enable DEBUG log level on > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy > 2016-04-19 10:21:17,184 WARN > org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough > replicas: expected size is 7 but only 0 storage types can be selected > (replication=10, selected=[], unavailable=[DISK, ARCHIVE], removed=[DISK, > DISK, DISK, DISK, DISK, DISK, DISK], policy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}) > 2016-04-19 10:21:17,184 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to > place enough replicas, still in need of 7 to reach 10 > (unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, > newBlock=false) All required storage types are unavailable: > unavailableStorages=[DISK, ARCHIVE], storagePolicy=BlockStoragePolicy{HOT:7, > storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]} > {code} > This is because 2 threads (#NameNodeRpcServer and #ReplicationMonitor) > process same block at the same moment. > (1) ReplicationMonitor#computeReplicationWorkForBlocks get blocks to > replicate and leave the global lock. > (2) FSNamesystem#delete invoked to delete blocks then clear the reference in > blocksmap, needReplications, etc. the block's NumBytes will set > NO_ACK(Long.MAX_VALUE) which is used to indicate that the block deletion does > not need explicit ACK from the node. > (3) ReplicationMonitor#computeReplicationWorkForBlocks continue to > chooseTargets for the same blocks and no node will be selected after traverse > whole cluster because no node choice satisfy the goodness criteria > (remaining spaces achieve required size Long.MAX_VALUE). > During of stage#3 ReplicationMonitor stuck for long time, especial in a large > cluster. invalidateBlocks & neededReplications continues to grow and no > consumes. it will loss data at the worst. > This can mostly be avoided by skip chooseTarget for BlockCommand.NO_ACK block > and remove it from neededReplications. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org