[jira] [Commented] (HDFS-14974) RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port
[ https://issues.apache.org/jira/browse/HDFS-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970741#comment-16970741 ] Ayush Saxena commented on HDFS-14974: - Thanx [~elgoiri] for putting this up. Makes sense to correct. Is this just problem with this test? > RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port > --- > > Key: HDFS-14974 > URL: https://issues.apache.org/jira/browse/HDFS-14974 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Priority: Major > > Currently, {{TestRouterSecurityManager#testCreateCredentials}} create a > Router with the default ports. However, these ports might be used. We should > set it to :0 for it to be assigned dynamically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14974) RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port
[ https://issues.apache.org/jira/browse/HDFS-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970740#comment-16970740 ] Íñigo Goiri edited comment on HDFS-14974 at 11/9/19 6:41 AM: - I happened to have Jupyter running in my machine in port . When running the tests, I got: {code} Caused by: java.net.BindException: Problem binding to [localhost:] java.net.BindException: Address already in use: bind; For more details see: http://wiki.apache.org/hadoop/BindException {code} A quick solution would be to do something like: {code} // Start routers with only an RPC service Configuration routerConf = new RouterConfigBuilder() .metrics() .rpc() .build(); routerConf.set("dfs.federation.router.rpc-address", "0.0.0.0:0"); conf.addResource(routerConf); Router router = new Router(); router.init(conf); router.start(); {code} But maybe we want to set this a little more general. The MiniRouterDFSCluster already does: {code} public Configuration generateRouterConfiguration(String nsId, String nnId) { Configuration conf = new HdfsConfiguration(false); conf.addResource(generateNamenodeConfiguration(nsId)); conf.setInt(DFS_ROUTER_HANDLER_COUNT_KEY, 10); conf.set(DFS_ROUTER_RPC_ADDRESS_KEY, "127.0.0.1:0"); conf.set(DFS_ROUTER_RPC_BIND_HOST_KEY, "0.0.0.0"); ... {code} was (Author: elgoiri): I happened to have Jupyter running in my machine in port . When running the tests, I got: {code} Caused by: java.net.BindException: Problem binding to [localhost:] java.net.BindException: Address already in use: bind; For more details see: http://wiki.apache.org/hadoop/BindException {code} A quick solution would be to do something like: {code} // Start routers with only an RPC service Configuration routerConf = new RouterConfigBuilder() .metrics() .rpc() .build(); routerConf.set("dfs.federation.router.rpc-address", "0.0.0.0:0"); conf.addResource(routerConf); Router router = new Router(); router.init(conf); router.start(); {code} But maybe we want to set this a little more general. The MiniRouterDFSCluster already does: {cluster} public Configuration generateRouterConfiguration(String nsId, String nnId) { Configuration conf = new HdfsConfiguration(false); conf.addResource(generateNamenodeConfiguration(nsId)); conf.setInt(DFS_ROUTER_HANDLER_COUNT_KEY, 10); conf.set(DFS_ROUTER_RPC_ADDRESS_KEY, "127.0.0.1:0"); conf.set(DFS_ROUTER_RPC_BIND_HOST_KEY, "0.0.0.0"); ... {cluster} > RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port > --- > > Key: HDFS-14974 > URL: https://issues.apache.org/jira/browse/HDFS-14974 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Priority: Major > > Currently, {{TestRouterSecurityManager#testCreateCredentials}} create a > Router with the default ports. However, these ports might be used. We should > set it to :0 for it to be assigned dynamically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14974) RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port
[ https://issues.apache.org/jira/browse/HDFS-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970740#comment-16970740 ] Íñigo Goiri edited comment on HDFS-14974 at 11/9/19 6:41 AM: - I happened to have Jupyter running in my machine in port . When running the tests, I got: {code} Caused by: java.net.BindException: Problem binding to [localhost:] java.net.BindException: Address already in use: bind; For more details see: http://wiki.apache.org/hadoop/BindException {code} A quick solution would be to do something like: {code} // Start routers with only an RPC service Configuration routerConf = new RouterConfigBuilder() .metrics() .rpc() .build(); routerConf.set("dfs.federation.router.rpc-address", "0.0.0.0:0"); conf.addResource(routerConf); Router router = new Router(); router.init(conf); router.start(); {code} But maybe we want to set this a little more general. The MiniRouterDFSCluster already does: {cluster} public Configuration generateRouterConfiguration(String nsId, String nnId) { Configuration conf = new HdfsConfiguration(false); conf.addResource(generateNamenodeConfiguration(nsId)); conf.setInt(DFS_ROUTER_HANDLER_COUNT_KEY, 10); conf.set(DFS_ROUTER_RPC_ADDRESS_KEY, "127.0.0.1:0"); conf.set(DFS_ROUTER_RPC_BIND_HOST_KEY, "0.0.0.0"); ... {cluster} was (Author: elgoiri): I happened to have Jupyter running in my machine in port . When running the tests, I got: {code} Caused by: java.net.BindException: Problem binding to [localhost:] java.net.BindException: Address already in use: bind; For more details see: http://wiki.apache.org/hadoop/BindException {code} > RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port > --- > > Key: HDFS-14974 > URL: https://issues.apache.org/jira/browse/HDFS-14974 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Priority: Major > > Currently, {{TestRouterSecurityManager#testCreateCredentials}} create a > Router with the default ports. However, these ports might be used. We should > set it to :0 for it to be assigned dynamically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14974) RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port
[ https://issues.apache.org/jira/browse/HDFS-14974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970740#comment-16970740 ] Íñigo Goiri commented on HDFS-14974: I happened to have Jupyter running in my machine in port . When running the tests, I got: {code} Caused by: java.net.BindException: Problem binding to [localhost:] java.net.BindException: Address already in use: bind; For more details see: http://wiki.apache.org/hadoop/BindException {code} > RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port > --- > > Key: HDFS-14974 > URL: https://issues.apache.org/jira/browse/HDFS-14974 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Priority: Major > > Currently, {{TestRouterSecurityManager#testCreateCredentials}} create a > Router with the default ports. However, these ports might be used. We should > set it to :0 for it to be assigned dynamically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14974) RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port
Íñigo Goiri created HDFS-14974: -- Summary: RBF: TestRouterSecurityManager#testCreateCredentials should use :0 for port Key: HDFS-14974 URL: https://issues.apache.org/jira/browse/HDFS-14974 Project: Hadoop HDFS Issue Type: Improvement Reporter: Íñigo Goiri Currently, {{TestRouterSecurityManager#testCreateCredentials}} create a Router with the default ports. However, these ports might be used. We should set it to :0 for it to be assigned dynamically. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-426) Add field modificationTime for Volume and Bucket
[ https://issues.apache.org/jira/browse/HDDS-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970728#comment-16970728 ] YiSheng Lien edited comment on HDDS-426 at 11/9/19 5:50 AM: Hello [~dineshchitlangia] [~arp], I'm going to append the modificationTime to Volume and Bucket. A question, If the *modificationTime of Key* is updated, should we propagate the *modificationTime of Key* to Volume and Bucket ? (I think we should do this.) Thanks was (Author: cxorm): Hello [~dineshchitlangia] [~arp], I'm going to append the modificationTime to Volume and Bucket. A question, If the modificationTime of Key is updated, should we propagate the modificationTime of Key to Volume and Bucket ? (I think we should do this.) Thanks > Add field modificationTime for Volume and Bucket > > > Key: HDDS-426 > URL: https://issues.apache.org/jira/browse/HDDS-426 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Manager >Reporter: Dinesh Chitlangia >Assignee: YiSheng Lien >Priority: Major > Labels: newbie > > There are update operations that can be performed for Volume, Bucket and Key. > While Key records the modification time, Volume and & Bucket do not capture > this. > > This Jira proposes to add the required field to Volume and Bucket in order to > capture the modficationTime. > > Current Status: > {noformat} > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoVolume /dummyvol > 2018-09-10 17:16:12 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "owner" : { > "name" : "bilbo" > }, > "quota" : { > "unit" : "TB", > "size" : 1048576 > }, > "volumeName" : "dummyvol", > "createdOn" : "Mon, 10 Sep 2018 17:11:32 GMT", > "createdBy" : "bilbo" > } > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoBucket /dummyvol/mybuck > 2018-09-10 17:15:25 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "volumeName" : "dummyvol", > "bucketName" : "mybuck", > "createdOn" : "Mon, 10 Sep 2018 17:12:09 GMT", > "acls" : [ { > "type" : "USER", > "name" : "hadoop", > "rights" : "READ_WRITE" > }, { > "type" : "GROUP", > "name" : "users", > "rights" : "READ_WRITE" > }, { > "type" : "USER", > "name" : "spark", > "rights" : "READ_WRITE" > } ], > "versioning" : "DISABLED", > "storageType" : "DISK" > } > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoKey /dummyvol/mybuck/myk1 > 2018-09-10 17:19:43 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "version" : 0, > "md5hash" : null, > "createdOn" : "Mon, 10 Sep 2018 17:19:04 GMT", > "modifiedOn" : "Mon, 10 Sep 2018 17:19:04 GMT", > "size" : 0, > "keyName" : "myk1", > "keyLocations" : [ ] > }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDDS-426) Add field modificationTime for Volume and Bucket
[ https://issues.apache.org/jira/browse/HDDS-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970728#comment-16970728 ] YiSheng Lien edited comment on HDDS-426 at 11/9/19 5:47 AM: Hello [~dineshchitlangia] [~arp], I'm going to append the modificationTime to Volume and Bucket. A question, If the modificationTime of Key is updated, should we propagate the modificationTime of Key to Volume and Bucket ? (I think we should do this.) Thanks was (Author: cxorm): Hello [~dineshchitlangia] [~arp], I'm going to append the modificationTime to Volume and Bucket. A question, If the modificationTime of Key is updated, should we propagate the modificationTime of Key to Volume and Bucket ? (I think we should do like this.) Thanks > Add field modificationTime for Volume and Bucket > > > Key: HDDS-426 > URL: https://issues.apache.org/jira/browse/HDDS-426 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Manager >Reporter: Dinesh Chitlangia >Assignee: YiSheng Lien >Priority: Major > Labels: newbie > > There are update operations that can be performed for Volume, Bucket and Key. > While Key records the modification time, Volume and & Bucket do not capture > this. > > This Jira proposes to add the required field to Volume and Bucket in order to > capture the modficationTime. > > Current Status: > {noformat} > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoVolume /dummyvol > 2018-09-10 17:16:12 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "owner" : { > "name" : "bilbo" > }, > "quota" : { > "unit" : "TB", > "size" : 1048576 > }, > "volumeName" : "dummyvol", > "createdOn" : "Mon, 10 Sep 2018 17:11:32 GMT", > "createdBy" : "bilbo" > } > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoBucket /dummyvol/mybuck > 2018-09-10 17:15:25 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "volumeName" : "dummyvol", > "bucketName" : "mybuck", > "createdOn" : "Mon, 10 Sep 2018 17:12:09 GMT", > "acls" : [ { > "type" : "USER", > "name" : "hadoop", > "rights" : "READ_WRITE" > }, { > "type" : "GROUP", > "name" : "users", > "rights" : "READ_WRITE" > }, { > "type" : "USER", > "name" : "spark", > "rights" : "READ_WRITE" > } ], > "versioning" : "DISABLED", > "storageType" : "DISK" > } > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoKey /dummyvol/mybuck/myk1 > 2018-09-10 17:19:43 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "version" : 0, > "md5hash" : null, > "createdOn" : "Mon, 10 Sep 2018 17:19:04 GMT", > "modifiedOn" : "Mon, 10 Sep 2018 17:19:04 GMT", > "size" : 0, > "keyName" : "myk1", > "keyLocations" : [ ] > }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-426) Add field modificationTime for Volume and Bucket
[ https://issues.apache.org/jira/browse/HDDS-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970728#comment-16970728 ] YiSheng Lien commented on HDDS-426: --- Hello [~dineshchitlangia] [~arp], I'm going to append the modificationTime to Volume and Bucket. A question, If the modificationTime of Key is updated, should we propagate the modificationTime of Key to Volume and Bucket ? (I think we should do like this.) Thanks > Add field modificationTime for Volume and Bucket > > > Key: HDDS-426 > URL: https://issues.apache.org/jira/browse/HDDS-426 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Manager >Reporter: Dinesh Chitlangia >Assignee: YiSheng Lien >Priority: Major > Labels: newbie > > There are update operations that can be performed for Volume, Bucket and Key. > While Key records the modification time, Volume and & Bucket do not capture > this. > > This Jira proposes to add the required field to Volume and Bucket in order to > capture the modficationTime. > > Current Status: > {noformat} > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoVolume /dummyvol > 2018-09-10 17:16:12 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "owner" : { > "name" : "bilbo" > }, > "quota" : { > "unit" : "TB", > "size" : 1048576 > }, > "volumeName" : "dummyvol", > "createdOn" : "Mon, 10 Sep 2018 17:11:32 GMT", > "createdBy" : "bilbo" > } > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoBucket /dummyvol/mybuck > 2018-09-10 17:15:25 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "volumeName" : "dummyvol", > "bucketName" : "mybuck", > "createdOn" : "Mon, 10 Sep 2018 17:12:09 GMT", > "acls" : [ { > "type" : "USER", > "name" : "hadoop", > "rights" : "READ_WRITE" > }, { > "type" : "GROUP", > "name" : "users", > "rights" : "READ_WRITE" > }, { > "type" : "USER", > "name" : "spark", > "rights" : "READ_WRITE" > } ], > "versioning" : "DISABLED", > "storageType" : "DISK" > } > hadoop@1987b5de4203:~$ ./bin/ozone oz -infoKey /dummyvol/mybuck/myk1 > 2018-09-10 17:19:43 WARN NativeCodeLoader:60 - Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > { > "version" : 0, > "md5hash" : null, > "createdOn" : "Mon, 10 Sep 2018 17:19:04 GMT", > "modifiedOn" : "Mon, 10 Sep 2018 17:19:04 GMT", > "size" : 0, > "keyName" : "myk1", > "keyLocations" : [ ] > }{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14967) TestWebHDFS - Many test cases are failing in Windows
[ https://issues.apache.org/jira/browse/HDFS-14967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970703#comment-16970703 ] Hadoop QA commented on HDFS-14967: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 52s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 39s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 55s{color} | {color:red} hadoop-hdfs-project_hadoop-hdfs generated 2 new + 578 unchanged - 2 fixed = 580 total (was 580) {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 39s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 13 new + 31 unchanged - 17 fixed = 44 total (was 48) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 31s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 99m 28s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}161m 37s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.tools.TestDFSZKFailoverController | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14967 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985401/HDFS-14967.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 0b8c6ec05f3c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 42fc888 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | javac | https://builds.apache.org/job/PreCommit-HDFS-Build/28283/artifact/out/diff-compile-javac-hadoop-hdfs-project_hadoop-hdfs.txt | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28283/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit |
[jira] [Commented] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970673#comment-16970673 ] Konstantin Shvachko commented on HDFS-14973: Hey [~xkrogen] good analysis. IIRC, the idea was to delay the first "wave" of getBlocks. The first 100 in your example, which is the number dispatcher threads. Indeed the first 20 will go right away without delay. This is how many calls we want to tolerate on the NameNode at once. One second later another 20 getBlocks() will hit the NameNode, and so on up to 100. The next wave of dispatcher threads after 100 should not hit the NameNode right away. It is supposed first to {{executePendingMove()}}, then call {{getBlocks()}}. And {{executePendingMove()}} naturally throttles the dispatcher, so it was not necessary to delay the subsequent ways. I remember it worked. It is possible that {{executePendingMove()}} became faster due to HDFS-11742, which I did not check, or something else had changed. > Balancer getBlocks RPC dispersal does not function properly > --- > > Key: HDFS-14973 > URL: https://issues.apache.org/jira/browse/HDFS-14973 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover >Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Attachments: HDFS-14973.000.patch, HDFS-14973.test.patch > > > In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls > issued by the Balancer/Mover more dispersed, to alleviate load on the > NameNode, since {{getBlocks}} can be very expensive and the Balancer should > not impact normal cluster operation. > Unfortunately, this functionality does not function as expected, especially > when the dispatcher thread count is low. The primary issue is that the delay > is applied only to the first N threads that are submitted to the dispatcher's > executor, where N is the size of the dispatcher's threadpool, but *not* to > the first R threads, where R is the number of allowed {{getBlocks}} QPS > (currently hardcoded to 20). For example, if the threadpool size is 100 (the > default), threads 0-19 have no delay, 20-99 have increased levels of delay, > and 100+ have no delay. As I understand it, the intent of the logic was that > the delay applied to the first 100 threads would force the dispatcher > executor's threads to all be consumed, thus blocking subsequent (non-delayed) > threads until the delay period has expired. However, threads 0-19 can finish > very quickly (their work can often be fulfilled in the time it takes to > execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), > thus opening up 20 new slots in the executor, which are then consumed by > non-delayed threads 100-119, and so on. So, although 80 threads have had a > delay applied, the non-delay threads rush through in the 20 non-delay slots. > This problem gets even worse when the dispatcher threadpool size is less than > the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no > threads ever have a delay applied_, and the feature is not enabled at all. > This problem wasn't surfaced in the original JIRA because the test > incorrectly measured the period across which {{getBlocks}} RPCs were > distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} > were used to track the time over which the {{getBlocks}} calls were made. > However, {{startGetBlocksTime}} was initialized at the time of creation of > the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even > worse, the Balancer in this test takes 2 iterations to complete balancing the > cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} > actually represents: > {code} > (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the > Dispatcher to complete an iteration of moving blocks) > {code} > Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen > during the period of initial block fetching. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDDS-2104) Refactor OMFailoverProxyProvider#loadOMClientConfigs
[ https://issues.apache.org/jira/browse/HDDS-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siyao Meng resolved HDDS-2104. -- Resolution: Fixed > Refactor OMFailoverProxyProvider#loadOMClientConfigs > > > Key: HDDS-2104 > URL: https://issues.apache.org/jira/browse/HDDS-2104 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task >Reporter: Siyao Meng >Assignee: Siyao Meng >Priority: Major > > Ref: https://github.com/apache/hadoop/pull/1360#discussion_r321586979 > Now that we decide to use client-side configuration for OM HA, some logic in > OMFailoverProxyProvider#loadOMClientConfigs becomes redundant. > The work will begin after HDDS-2007 is committed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14967) TestWebHDFS - Many test cases are failing in Windows
[ https://issues.apache.org/jira/browse/HDFS-14967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-14967: Status: Patch Available (was: Open) > TestWebHDFS - Many test cases are failing in Windows > - > > Key: HDFS-14967 > URL: https://issues.apache.org/jira/browse/HDFS-14967 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Attachments: HDFS-14967.001.patch > > > In TestWebHDFS test class, few test cases are not closing the MiniDFSCluster, > which results in remaining test failures in Windows. Once cluster status is > open, all consecutive test cases fail to get the lock on Data dir which > results in test case failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2454) Improve OM HA robot tests
[ https://issues.apache.org/jira/browse/HDDS-2454?focusedWorklogId=340829=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340829 ] ASF GitHub Bot logged work on HDDS-2454: Author: ASF GitHub Bot Created on: 09/Nov/19 00:05 Start Date: 09/Nov/19 00:05 Worklog Time Spent: 10m Work Description: hanishakoneru commented on pull request #136: HDDS-2454. Improve OM HA robot tests. URL: https://github.com/apache/hadoop-ozone/pull/136 ## What changes were proposed in this pull request? In one CI run, testOMHA.robot failed because robot framework SSH commands failed. This Jira aims to verify that the command execution succeeds. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2454 ## How was this patch tested? acceptance test - smoketest/omha/testOMHA.robot This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340829) Remaining Estimate: 0h Time Spent: 10m > Improve OM HA robot tests > - > > Key: HDDS-2454 > URL: https://issues.apache.org/jira/browse/HDDS-2454 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In one CI run, testOMHA.robot failed because robot framework SSH commands > failed. This Jira aims to verify that the command execution succeeds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2454) Improve OM HA robot tests
[ https://issues.apache.org/jira/browse/HDDS-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2454: - Labels: pull-request-available (was: ) > Improve OM HA robot tests > - > > Key: HDDS-2454 > URL: https://issues.apache.org/jira/browse/HDDS-2454 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Major > Labels: pull-request-available > > In one CI run, testOMHA.robot failed because robot framework SSH commands > failed. This Jira aims to verify that the command execution succeeds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2454) Improve OM HA robot tests
[ https://issues.apache.org/jira/browse/HDDS-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanisha Koneru updated HDDS-2454: - Status: Patch Available (was: Open) > Improve OM HA robot tests > - > > Key: HDDS-2454 > URL: https://issues.apache.org/jira/browse/HDDS-2454 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In one CI run, testOMHA.robot failed because robot framework SSH commands > failed. This Jira aims to verify that the command execution succeeds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2454) Improve OM HA robot tests
[ https://issues.apache.org/jira/browse/HDDS-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hanisha Koneru updated HDDS-2454: - Issue Type: Improvement (was: Bug) > Improve OM HA robot tests > - > > Key: HDDS-2454 > URL: https://issues.apache.org/jira/browse/HDDS-2454 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Major > > In one CI run, testOMHA.robot failed because robot framework SSH commands > failed. This Jira aims to verify that the command execution succeeds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2454) Improve OM HA robot tests
Hanisha Koneru created HDDS-2454: Summary: Improve OM HA robot tests Key: HDDS-2454 URL: https://issues.apache.org/jira/browse/HDDS-2454 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Hanisha Koneru Assignee: Hanisha Koneru In one CI run, testOMHA.robot failed because robot framework SSH commands failed. This Jira aims to verify that the command execution succeeds. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970633#comment-16970633 ] Hadoop QA commented on HDFS-14973: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 50s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 13s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 36s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 50s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 825 unchanged - 1 fixed = 826 total (was 826) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}108m 45s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}172m 1s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.tools.TestHdfsConfigFields | | | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.server.namenode.TestFSImage | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14973 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985394/HDFS-14973.000.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux a91b493a2d4d 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 42fc888 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28282/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28282/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results |
[jira] [Commented] (HDFS-14928) UI: unifying the WebUI across different components.
[ https://issues.apache.org/jira/browse/HDFS-14928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970610#comment-16970610 ] Íñigo Goiri commented on HDFS-14928: [^HDFS-14928.004.patch] LGTM. +1 > UI: unifying the WebUI across different components. > --- > > Key: HDFS-14928 > URL: https://issues.apache.org/jira/browse/HDFS-14928 > Project: Hadoop HDFS > Issue Type: Improvement > Components: ui >Reporter: Xieming Li >Assignee: Xieming Li >Priority: Trivial > Attachments: DN_orig.png, DN_with_legend.png.png, DN_wo_legend.png, > HDFS-14892-2.jpg, HDFS-14928.001.patch, HDFS-14928.002.patch, > HDFS-14928.003.patch, HDFS-14928.004.patch, HDFS-14928.jpg, NN_orig.png, > NN_with_legend.png, NN_wo_legend.png, RBF_orig.png, RBF_with_legend.png, > RBF_wo_legend.png > > > The WebUI of different components could be unified. > *Router:* > |Current| !RBF_orig.png|width=500! | > |Proposed 1 (With Icon) | !RBF_wo_legend.png|width=500! | > |Proposed 2 (With Icon and Legend)|!RBF_with_legend.png|width=500! | > *NameNode:* > |Current| !NN_orig.png|width=500! | > |Proposed 1 (With Icon) | !NN_wo_legend.png|width=500! | > |Proposed 2 (With Icon and Legend)| !NN_with_legend.png|width=500! | > *DataNode:* > |Current| !DN_orig.png|width=500! | > |Proposed 1 (With Icon) | !DN_wo_legend.png|width=500! | > |Proposed 2 (With Icon and Legend)| !DN_with_legend.png.png|width=500! | -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14967) TestWebHDFS - Many test cases are failing in Windows
[ https://issues.apache.org/jira/browse/HDFS-14967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970603#comment-16970603 ] Ayush Saxena commented on HDFS-14967: - Thanx [~prasad-acit] for the patch, On a quick look LGTM, Will have a check once more tomorrow if the JENKINS stays clean, since there is too much change due to indentation. > TestWebHDFS - Many test cases are failing in Windows > - > > Key: HDFS-14967 > URL: https://issues.apache.org/jira/browse/HDFS-14967 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Attachments: HDFS-14967.001.patch > > > In TestWebHDFS test class, few test cases are not closing the MiniDFSCluster, > which results in remaining test failures in Windows. Once cluster status is > open, all consecutive test cases fail to get the lock on Data dir which > results in test case failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14967) TestWebHDFS - Many test cases are failing in Windows
[ https://issues.apache.org/jira/browse/HDFS-14967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970595#comment-16970595 ] Renukaprasad C commented on HDFS-14967: --- Thanks [~ayushtkn], I agree with later solution. Please review the patch as per solution 2. > TestWebHDFS - Many test cases are failing in Windows > - > > Key: HDFS-14967 > URL: https://issues.apache.org/jira/browse/HDFS-14967 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Attachments: HDFS-14967.001.patch > > > In TestWebHDFS test class, few test cases are not closing the MiniDFSCluster, > which results in remaining test failures in Windows. Once cluster status is > open, all consecutive test cases fail to get the lock on Data dir which > results in test case failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14967) TestWebHDFS - Many test cases are failing in Windows
[ https://issues.apache.org/jira/browse/HDFS-14967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renukaprasad C updated HDFS-14967: -- Attachment: HDFS-14967.001.patch > TestWebHDFS - Many test cases are failing in Windows > - > > Key: HDFS-14967 > URL: https://issues.apache.org/jira/browse/HDFS-14967 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Renukaprasad C >Assignee: Renukaprasad C >Priority: Major > Attachments: HDFS-14967.001.patch > > > In TestWebHDFS test class, few test cases are not closing the MiniDFSCluster, > which results in remaining test failures in Windows. Once cluster status is > open, all consecutive test cases fail to get the lock on Data dir which > results in test case failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2451) Use lazy string evaluation in preconditions
[ https://issues.apache.org/jira/browse/HDDS-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Doroszlai updated HDDS-2451: --- Status: Patch Available (was: Open) > Use lazy string evaluation in preconditions > --- > > Key: HDDS-2451 > URL: https://issues.apache.org/jira/browse/HDDS-2451 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Major > Labels: performance, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Avoid eagerly evaluating error messages of preconditions (similarly to > HDDS-2318, but there may be other occurrences of the same issue). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated HDFS-14973: --- Description: In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since {{getBlocks}} can be very expensive and the Balancer should not impact normal cluster operation. Unfortunately, this functionality does not function as expected, especially when the dispatcher thread count is low. The primary issue is that the delay is applied only to the first N threads that are submitted to the dispatcher's executor, where N is the size of the dispatcher's threadpool, but *not* to the first R threads, where R is the number of allowed {{getBlocks}} QPS (currently hardcoded to 20). For example, if the threadpool size is 100 (the default), threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have no delay. As I understand it, the intent of the logic was that the delay applied to the first 100 threads would force the dispatcher executor's threads to all be consumed, thus blocking subsequent (non-delayed) threads until the delay period has expired. However, threads 0-19 can finish very quickly (their work can often be fulfilled in the time it takes to execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 new slots in the executor, which are then consumed by non-delayed threads 100-119, and so on. So, although 80 threads have had a delay applied, the non-delay threads rush through in the 20 non-delay slots. This problem gets even worse when the dispatcher threadpool size is less than the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no threads ever have a delay applied_, and the feature is not enabled at all. This problem wasn't surfaced in the original JIRA because the test incorrectly measured the period across which {{getBlocks}} RPCs were distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} were used to track the time over which the {{getBlocks}} calls were made. However, {{startGetBlocksTime}} was initialized at the time of creation of the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even worse, the Balancer in this test takes 2 iterations to complete balancing the cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} actually represents: {code} (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the Dispatcher to complete an iteration of moving blocks) {code} Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen during the period of initial block fetching. was: In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since {{getBlocks}} can be very expensive and the Balancer should not impact normal cluster operation. Unfortunately, this functionality does not function as expected, especially when the dispatcher thread count is low. The primary issue is that the delay is applied only to the first N threads that are submitted to the dispatcher's executor, where N is the size of the dispatcher's threadpool, but *not* to the first R threads, where R is the number of allowed {{getBlocks}} QPS (currently hardcoded to 20). For example, if the threadpool size is 100 (the default), threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have no delay. As I understand it, the intent of the logic was that the delay applied to the first 100 threads would force the dispatcher executor's threads to all be consumed, thus blocking subsequent (non-delayed) threads until the delay period has expired. However, threads 0-19 can finish very quickly (their work can often be fulfilled in the time it takes to execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 new slots in the executor, which are then consumed by non-delayed threads 100-119, and so on. So, although 80 threads have had a delay applied, the non-delay threads rush through in the 20 non-delay slots. This problem gets even worse when the dispatcher threadpool size is less than the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no threads ever have a delay applied_, and the feature is not enabled at all. This problem wasn't surfaced in the original JIRA because the test incorrectly measured the period across which {{getBlocks}} RPCs were distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} were used to track the time over which the {{getBlocks}} calls were made. However, {{startGetBlocksTime}} was initialized at the time of creation of the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even worse, the Balancer in this test
[jira] [Work logged] (HDDS-2451) Use lazy string evaluation in preconditions
[ https://issues.apache.org/jira/browse/HDDS-2451?focusedWorklogId=340737=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340737 ] ASF GitHub Bot logged work on HDDS-2451: Author: ASF GitHub Bot Created on: 08/Nov/19 20:38 Start Date: 08/Nov/19 20:38 Worklog Time Spent: 10m Work Description: adoroszlai commented on pull request #135: HDDS-2451. Use lazy string evaluation in preconditions URL: https://github.com/apache/hadoop-ozone/pull/135 ## What changes were proposed in this pull request? Use the version of `Preconditions.check...` that accepts `errorMessageTemplate` and `errorMessageArgs`. There are occurrences of the `errorMessage` version left, but they do not seem to be important: 1. use constant message, or 2. are infrequently used (eg. one-time init in `MetadataKeyFilters`), or 3. only append a plain `long` (container ID) to the message. https://issues.apache.org/jira/browse/HDDS-2451 ## How was this patch tested? Ran related unit tests and checkstyle. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340737) Remaining Estimate: 0h Time Spent: 10m > Use lazy string evaluation in preconditions > --- > > Key: HDDS-2451 > URL: https://issues.apache.org/jira/browse/HDDS-2451 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Major > Labels: performance, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Avoid eagerly evaluating error messages of preconditions (similarly to > HDDS-2318, but there may be other occurrences of the same issue). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2451) Use lazy string evaluation in preconditions
[ https://issues.apache.org/jira/browse/HDDS-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2451: - Labels: performance pull-request-available (was: performance) > Use lazy string evaluation in preconditions > --- > > Key: HDDS-2451 > URL: https://issues.apache.org/jira/browse/HDDS-2451 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Major > Labels: performance, pull-request-available > > Avoid eagerly evaluating error messages of preconditions (similarly to > HDDS-2318, but there may be other occurrences of the same issue). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2453) Add Freon tests for S3Bucket/MPU Keys
Bharat Viswanadham created HDDS-2453: Summary: Add Freon tests for S3Bucket/MPU Keys Key: HDDS-2453 URL: https://issues.apache.org/jira/browse/HDDS-2453 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Bharat Viswanadham This Jira is to create freon tests for # S3Bucket creation. # S3 MPU Key uploads. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2453) Add Freon tests for S3Bucket/MPU Keys
[ https://issues.apache.org/jira/browse/HDDS-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharat Viswanadham reassigned HDDS-2453: Assignee: Bharat Viswanadham > Add Freon tests for S3Bucket/MPU Keys > - > > Key: HDDS-2453 > URL: https://issues.apache.org/jira/browse/HDDS-2453 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > > This Jira is to create freon tests for > # S3Bucket creation. > # S3 MPU Key uploads. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970560#comment-16970560 ] Erik Krogen commented on HDFS-14973: Attached v000 patch with a fix. Conveniently, Guava has a [{{RateLimiter}}|https://guava.dev/releases/19.0/api/docs/index.html?com/google/common/util/concurrent/RateLimiter.html] class which does exactly what we need. The changes were minimal. > Balancer getBlocks RPC dispersal does not function properly > --- > > Key: HDFS-14973 > URL: https://issues.apache.org/jira/browse/HDFS-14973 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover >Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Attachments: HDFS-14973.000.patch, HDFS-14973.test.patch > > > In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls > issued by the Balancer/Mover more dispersed, to alleviate load on the > NameNode, since {{getBlocks}} can be very expensive and the Balancer should > not impact normal cluster operation. > Unfortunately, this functionality does not function as expected, especially > when the dispatcher thread count is low. The primary issue is that the delay > is applied only to the first N threads that are submitted to the dispatcher's > executor, where N is the size of the dispatcher's threadpool, but *not* to > the first R threads, where R is the number of allowed {{getBlocks}} QPS > (currently hardcoded to 20). For example, if the threadpool size is 100 (the > default), threads 0-19 have no delay, 20-99 have increased levels of delay, > and 100+ have no delay. As I understand it, the intent of the logic was that > the delay applied to the first 100 threads would force the dispatcher > executor's threads to all be consumed, thus blocking subsequent (non-delayed) > threads until the delay period has expired. However, threads 0-19 can finish > very quickly (their work can often be fulfilled in the time it takes to > execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), > thus opening up 20 new slots in the executor, which are then consumed by > non-delayed threads 100-119, and so on. So, although 80 threads have had a > delay applied, the non-delay threads rush through in the 20 non-delay slots. > This problem gets even worse when the dispatcher threadpool size is less than > the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no > threads ever have a delay applied_, and the feature is not enabled at all. > This problem wasn't surfaced in the original JIRA because the test > incorrectly measured the period across which {{getBlocks}} RPCs were > distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} > were used to track the time over which the {{getBlocks}} calls were made. > However, {{startGetBlocksTime}} was initialized at the time of creation of > the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even > worse, the Balancer in this test takes 2 iterations to complete balancing the > cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} > actually represents: > {code} > 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for > the Dispatcher to complete an iteration of moving blocks) > {code} > Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen > during the period of initial block fetching. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated HDFS-14973: --- Attachment: HDFS-14973.000.patch > Balancer getBlocks RPC dispersal does not function properly > --- > > Key: HDFS-14973 > URL: https://issues.apache.org/jira/browse/HDFS-14973 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover >Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Attachments: HDFS-14973.000.patch, HDFS-14973.test.patch > > > In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls > issued by the Balancer/Mover more dispersed, to alleviate load on the > NameNode, since {{getBlocks}} can be very expensive and the Balancer should > not impact normal cluster operation. > Unfortunately, this functionality does not function as expected, especially > when the dispatcher thread count is low. The primary issue is that the delay > is applied only to the first N threads that are submitted to the dispatcher's > executor, where N is the size of the dispatcher's threadpool, but *not* to > the first R threads, where R is the number of allowed {{getBlocks}} QPS > (currently hardcoded to 20). For example, if the threadpool size is 100 (the > default), threads 0-19 have no delay, 20-99 have increased levels of delay, > and 100+ have no delay. As I understand it, the intent of the logic was that > the delay applied to the first 100 threads would force the dispatcher > executor's threads to all be consumed, thus blocking subsequent (non-delayed) > threads until the delay period has expired. However, threads 0-19 can finish > very quickly (their work can often be fulfilled in the time it takes to > execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), > thus opening up 20 new slots in the executor, which are then consumed by > non-delayed threads 100-119, and so on. So, although 80 threads have had a > delay applied, the non-delay threads rush through in the 20 non-delay slots. > This problem gets even worse when the dispatcher threadpool size is less than > the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no > threads ever have a delay applied_, and the feature is not enabled at all. > This problem wasn't surfaced in the original JIRA because the test > incorrectly measured the period across which {{getBlocks}} RPCs were > distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} > were used to track the time over which the {{getBlocks}} calls were made. > However, {{startGetBlocksTime}} was initialized at the time of creation of > the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even > worse, the Balancer in this test takes 2 iterations to complete balancing the > cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} > actually represents: > {code} > 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for > the Dispatcher to complete an iteration of moving blocks) > {code} > Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen > during the period of initial block fetching. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated HDFS-14973: --- Status: Patch Available (was: Open) > Balancer getBlocks RPC dispersal does not function properly > --- > > Key: HDFS-14973 > URL: https://issues.apache.org/jira/browse/HDFS-14973 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover >Affects Versions: 3.0.0, 2.8.2, 2.7.4, 2.9.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Attachments: HDFS-14973.000.patch, HDFS-14973.test.patch > > > In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls > issued by the Balancer/Mover more dispersed, to alleviate load on the > NameNode, since {{getBlocks}} can be very expensive and the Balancer should > not impact normal cluster operation. > Unfortunately, this functionality does not function as expected, especially > when the dispatcher thread count is low. The primary issue is that the delay > is applied only to the first N threads that are submitted to the dispatcher's > executor, where N is the size of the dispatcher's threadpool, but *not* to > the first R threads, where R is the number of allowed {{getBlocks}} QPS > (currently hardcoded to 20). For example, if the threadpool size is 100 (the > default), threads 0-19 have no delay, 20-99 have increased levels of delay, > and 100+ have no delay. As I understand it, the intent of the logic was that > the delay applied to the first 100 threads would force the dispatcher > executor's threads to all be consumed, thus blocking subsequent (non-delayed) > threads until the delay period has expired. However, threads 0-19 can finish > very quickly (their work can often be fulfilled in the time it takes to > execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), > thus opening up 20 new slots in the executor, which are then consumed by > non-delayed threads 100-119, and so on. So, although 80 threads have had a > delay applied, the non-delay threads rush through in the 20 non-delay slots. > This problem gets even worse when the dispatcher threadpool size is less than > the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no > threads ever have a delay applied_, and the feature is not enabled at all. > This problem wasn't surfaced in the original JIRA because the test > incorrectly measured the period across which {{getBlocks}} RPCs were > distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} > were used to track the time over which the {{getBlocks}} calls were made. > However, {{startGetBlocksTime}} was initialized at the time of creation of > the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even > worse, the Balancer in this test takes 2 iterations to complete balancing the > cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} > actually represents: > {code} > 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for > the Dispatcher to complete an iteration of moving blocks) > {code} > Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen > during the period of initial block fetching. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDDS-2410) Ozoneperf docker cluster should use privileged containers
[ https://issues.apache.org/jira/browse/HDDS-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharat Viswanadham resolved HDDS-2410. -- Fix Version/s: 0.5.0 Resolution: Fixed > Ozoneperf docker cluster should use privileged containers > - > > Key: HDDS-2410 > URL: https://issues.apache.org/jira/browse/HDDS-2410 > Project: Hadoop Distributed Data Store > Issue Type: Task >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The profiler > [servlet|https://github.com/elek/hadoop-ozone/blob/master/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ProfileServlet.java] > (which helps to run java profiler in the background and publishes the result > on the web interface) requires privileged docker containers. > > This flag is missing from the ozoneperf docker-compose cluster (which is > designed to run performance tests). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2410) Ozoneperf docker cluster should use privileged containers
[ https://issues.apache.org/jira/browse/HDDS-2410?focusedWorklogId=340716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340716 ] ASF GitHub Bot logged work on HDDS-2410: Author: ASF GitHub Bot Created on: 08/Nov/19 19:49 Start Date: 08/Nov/19 19:49 Worklog Time Spent: 10m Work Description: bharatviswa504 commented on pull request #124: HDDS-2410. Ozoneperf docker cluster should use privileged containers URL: https://github.com/apache/hadoop-ozone/pull/124 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340716) Time Spent: 20m (was: 10m) > Ozoneperf docker cluster should use privileged containers > - > > Key: HDDS-2410 > URL: https://issues.apache.org/jira/browse/HDDS-2410 > Project: Hadoop Distributed Data Store > Issue Type: Task >Reporter: Marton Elek >Assignee: Marton Elek >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > The profiler > [servlet|https://github.com/elek/hadoop-ozone/blob/master/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/ProfileServlet.java] > (which helps to run java profiler in the background and publishes the result > on the web interface) requires privileged docker containers. > > This flag is missing from the ozoneperf docker-compose cluster (which is > designed to run performance tests). > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2452) Wrong condition for re-scheduling in ReportPublisher
[ https://issues.apache.org/jira/browse/HDDS-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Doroszlai updated HDDS-2452: --- Description: It seems the condition for scheduling next run of {{ReportPublisher}} is wrong: {code:title=https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/report/ReportPublisher.java#L74-L76} if (!executor.isShutdown() || !(context.getState() == DatanodeStates.SHUTDOWN)) { executor.schedule(this, {code} Given the condition above, the task may be scheduled again if the executor is shutdown, but the state machine is not set to shutdown (or vice versa). I think the condition should have an {{&&}}, not {{||}}. (Currently it is unlikely to happen, since [context state is set to shutdown before the report executor|https://github.com/apache/hadoop-ozone/blob/f928a0bdb4ea2e5195da39256c6dda9f1c855649/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L392-L393].) [~nanda], can you please confirm if this is a typo or intentional? was: It seems the condition for scheduling next run of {{ReportPublisher}} is wrong: {code:title=https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/report/ReportPublisher.java#L74-L76} if (!executor.isShutdown() || !(context.getState() == DatanodeStates.SHUTDOWN)) { executor.schedule(this, {code} Given the condition above, the task may be scheduled again if the executor is shutdown, but the state machine is not set to shutdown (or vice versa). (Currently it is unlikely to happen, since [context state is set to shutdown before the report executor|https://github.com/apache/hadoop-ozone/blob/f928a0bdb4ea2e5195da39256c6dda9f1c855649/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L392-L393].) [~nanda], can you please confirm if this is a typo or intentional? > Wrong condition for re-scheduling in ReportPublisher > > > Key: HDDS-2452 > URL: https://issues.apache.org/jira/browse/HDDS-2452 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Datanode >Reporter: Attila Doroszlai >Priority: Trivial > Labels: newbie > > It seems the condition for scheduling next run of {{ReportPublisher}} is > wrong: > {code:title=https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/report/ReportPublisher.java#L74-L76} > if (!executor.isShutdown() || > !(context.getState() == DatanodeStates.SHUTDOWN)) { > executor.schedule(this, > {code} > Given the condition above, the task may be scheduled again if the executor is > shutdown, but the state machine is not set to shutdown (or vice versa). I > think the condition should have an {{&&}}, not {{||}}. (Currently it is > unlikely to happen, since [context state is set to shutdown before the report > executor|https://github.com/apache/hadoop-ozone/blob/f928a0bdb4ea2e5195da39256c6dda9f1c855649/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L392-L393].) > [~nanda], can you please confirm if this is a typo or intentional? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2452) Wrong condition for re-scheduling in ReportPublisher
Attila Doroszlai created HDDS-2452: -- Summary: Wrong condition for re-scheduling in ReportPublisher Key: HDDS-2452 URL: https://issues.apache.org/jira/browse/HDDS-2452 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Datanode Reporter: Attila Doroszlai It seems the condition for scheduling next run of {{ReportPublisher}} is wrong: {code:title=https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/report/ReportPublisher.java#L74-L76} if (!executor.isShutdown() || !(context.getState() == DatanodeStates.SHUTDOWN)) { executor.schedule(this, {code} Given the condition above, the task may be scheduled again if the executor is shutdown, but the state machine is not set to shutdown (or vice versa). (Currently it is unlikely to happen, since [context state is set to shutdown before the report executor|https://github.com/apache/hadoop-ozone/blob/f928a0bdb4ea2e5195da39256c6dda9f1c855649/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/statemachine/DatanodeStateMachine.java#L392-L393].) [~nanda], can you please confirm if this is a typo or intentional? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970530#comment-16970530 ] Hadoop QA commented on HDFS-12288: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 25s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 3 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 24m 12s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 20s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 27s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 51s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 25s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 51s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 47s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 25s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}105m 48s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}181m 1s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.TestStripedFileAppend | | | hadoop.hdfs.server.sps.TestExternalStoragePolicySatisfier | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-12288 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985375/HDFS-12288.007.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 90334629317b 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 42fc888 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28281/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28281/testReport/ | | Max. process+thread count | 2704 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output |
[jira] [Commented] (HDFS-14959) [SBNN read] access time should be turned off
[ https://issues.apache.org/jira/browse/HDFS-14959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970525#comment-16970525 ] Hadoop QA commented on HDFS-14959: -- | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 24s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} dupname {color} | {color:green} 0m 0s{color} | {color:green} No case conflicting files found. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 38m 47s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 43s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 30s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 59m 52s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/hadoop-multibranch/job/PR-1706/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/1706 | | JIRA Issue | HDFS-14959 | | Optional Tests | dupname asflicense mvnsite | | uname | Linux 66ca89553e90 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | personality/hadoop.sh | | git revision | trunk / 42fc888 | | Max. process+thread count | 307 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/hadoop-multibranch/job/PR-1706/1/console | | versions | git=2.7.4 maven=3.3.9 | | Powered by | Apache Yetus 0.10.0 http://yetus.apache.org | This message was automatically generated. > [SBNN read] access time should be turned off > > > Key: HDFS-14959 > URL: https://issues.apache.org/jira/browse/HDFS-14959 > Project: Hadoop HDFS > Issue Type: Task > Components: documentation >Reporter: Wei-Chiu Chuang >Assignee: Chao Sun >Priority: Major > > Both Uber and Didi shared that access time has to be switched off to avoid > spiky NameNode RPC process time. If access time is not off entirely, > getBlockLocations RPCs have to update access time and must access the active > NameNode. (that's my understanding. haven't checked the code) > We should record this as a best practice in our doc. > (If you are on the ASF slack, check out this thread > https://the-asf.slack.com/archives/CAD7C52Q3/p1572033324008600) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2451) Use lazy string evaluation in preconditions
Attila Doroszlai created HDDS-2451: -- Summary: Use lazy string evaluation in preconditions Key: HDDS-2451 URL: https://issues.apache.org/jira/browse/HDDS-2451 Project: Hadoop Distributed Data Store Issue Type: Improvement Reporter: Attila Doroszlai Assignee: Attila Doroszlai Avoid eagerly evaluating error messages of preconditions (similarly to HDDS-2318, but there may be other occurrences of the same issue). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970500#comment-16970500 ] Erik Krogen commented on HDFS-14973: I have attached [^HDFS-14973.test.patch] which fixes the test to demonstrate that the throttling isn't working as expected: * Adjust the balancing in the test to be performed over only a single iteration, so that the Dispatcher's block move time isn't counted * Adjust the time at which the {{startGetBlocksTime}} is initialized, so that the DataNode startup time isn't counted * Make the number of max {{getBlocks}} RPCs configurable to have better control over the test I will work on putting a fix together. I think we _might_ be able to fix this by simply starting the delay at 1 second instead of 0 seconds, but I don't think it would be very hard to have a more strict throttling mechanism to avoid this entire class of problem, so I'm going to take a stab at that. If it turns out to be too complex, I'll try something simple. > Balancer getBlocks RPC dispersal does not function properly > --- > > Key: HDFS-14973 > URL: https://issues.apache.org/jira/browse/HDFS-14973 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover >Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Attachments: HDFS-14973.test.patch > > > In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls > issued by the Balancer/Mover more dispersed, to alleviate load on the > NameNode, since {{getBlocks}} can be very expensive and the Balancer should > not impact normal cluster operation. > Unfortunately, this functionality does not function as expected, especially > when the dispatcher thread count is low. The primary issue is that the delay > is applied only to the first N threads that are submitted to the dispatcher's > executor, where N is the size of the dispatcher's threadpool, but *not* to > the first R threads, where R is the number of allowed {{getBlocks}} QPS > (currently hardcoded to 20). For example, if the threadpool size is 100 (the > default), threads 0-19 have no delay, 20-99 have increased levels of delay, > and 100+ have no delay. As I understand it, the intent of the logic was that > the delay applied to the first 100 threads would force the dispatcher > executor's threads to all be consumed, thus blocking subsequent (non-delayed) > threads until the delay period has expired. However, threads 0-19 can finish > very quickly (their work can often be fulfilled in the time it takes to > execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), > thus opening up 20 new slots in the executor, which are then consumed by > non-delayed threads 100-119, and so on. So, although 80 threads have had a > delay applied, the non-delay threads rush through in the 20 non-delay slots. > This problem gets even worse when the dispatcher threadpool size is less than > the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no > threads ever have a delay applied_, and the feature is not enabled at all. > This problem wasn't surfaced in the original JIRA because the test > incorrectly measured the period across which {{getBlocks}} RPCs were > distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} > were used to track the time over which the {{getBlocks}} calls were made. > However, {{startGetBlocksTime}} was initialized at the time of creation of > the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even > worse, the Balancer in this test takes 2 iterations to complete balancing the > cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} > actually represents: > {code} > 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for > the Dispatcher to complete an iteration of moving blocks) > {code} > Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen > during the period of initial block fetching. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated HDFS-14973: --- Attachment: HDFS-14973.test.patch > Balancer getBlocks RPC dispersal does not function properly > --- > > Key: HDFS-14973 > URL: https://issues.apache.org/jira/browse/HDFS-14973 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover >Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Attachments: HDFS-14973.test.patch > > > In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls > issued by the Balancer/Mover more dispersed, to alleviate load on the > NameNode, since {{getBlocks}} can be very expensive and the Balancer should > not impact normal cluster operation. > Unfortunately, this functionality does not function as expected, especially > when the dispatcher thread count is low. The primary issue is that the delay > is applied only to the first N threads that are submitted to the dispatcher's > executor, where N is the size of the dispatcher's threadpool, but *not* to > the first R threads, where R is the number of allowed {{getBlocks}} QPS > (currently hardcoded to 20). For example, if the threadpool size is 100 (the > default), threads 0-19 have no delay, 20-99 have increased levels of delay, > and 100+ have no delay. As I understand it, the intent of the logic was that > the delay applied to the first 100 threads would force the dispatcher > executor's threads to all be consumed, thus blocking subsequent (non-delayed) > threads until the delay period has expired. However, threads 0-19 can finish > very quickly (their work can often be fulfilled in the time it takes to > execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), > thus opening up 20 new slots in the executor, which are then consumed by > non-delayed threads 100-119, and so on. So, although 80 threads have had a > delay applied, the non-delay threads rush through in the 20 non-delay slots. > This problem gets even worse when the dispatcher threadpool size is less than > the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no > threads ever have a delay applied_, and the feature is not enabled at all. > This problem wasn't surfaced in the original JIRA because the test > incorrectly measured the period across which {{getBlocks}} RPCs were > distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} > were used to track the time over which the {{getBlocks}} calls were made. > However, {{startGetBlocksTime}} was initialized at the time of creation of > the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even > worse, the Balancer in this test takes 2 iterations to complete balancing the > cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} > actually represents: > {code} > 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for > the Dispatcher to complete an iteration of moving blocks) > {code} > Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen > during the period of initial block fetching. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
[ https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated HDFS-14973: --- Description: In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since {{getBlocks}} can be very expensive and the Balancer should not impact normal cluster operation. Unfortunately, this functionality does not function as expected, especially when the dispatcher thread count is low. The primary issue is that the delay is applied only to the first N threads that are submitted to the dispatcher's executor, where N is the size of the dispatcher's threadpool, but *not* to the first R threads, where R is the number of allowed {{getBlocks}} QPS (currently hardcoded to 20). For example, if the threadpool size is 100 (the default), threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have no delay. As I understand it, the intent of the logic was that the delay applied to the first 100 threads would force the dispatcher executor's threads to all be consumed, thus blocking subsequent (non-delayed) threads until the delay period has expired. However, threads 0-19 can finish very quickly (their work can often be fulfilled in the time it takes to execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 new slots in the executor, which are then consumed by non-delayed threads 100-119, and so on. So, although 80 threads have had a delay applied, the non-delay threads rush through in the 20 non-delay slots. This problem gets even worse when the dispatcher threadpool size is less than the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no threads ever have a delay applied_, and the feature is not enabled at all. This problem wasn't surfaced in the original JIRA because the test incorrectly measured the period across which {{getBlocks}} RPCs were distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} were used to track the time over which the {{getBlocks}} calls were made. However, {{startGetBlocksTime}} was initialized at the time of creation of the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even worse, the Balancer in this test takes 2 iterations to complete balancing the cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} actually represents: {code} 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for the Dispatcher to complete an iteration of moving blocks) {code} Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen during the period of initial block fetching. was: In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since {{getBlocks}} can be very expensive and the Balancer should not impact normal cluster operation. Unfortunately, this functionality does not function as expected, especially when the dispatcher thread count is low. The primary issue is that the delay is applied only to the first N threads that are submitted to the dispatcher's executor, where N is the size of the dispatcher's threadpool, but *not* to the first R threads, where R is the number of allowed {{getBlocks}} QPS (currently hardcoded to 20). For example, if the threadpool size is 100 (the default), threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have no delay. As I understand it, the intent of the logic was that the delay applied to the first 100 threads would force the dispatcher executor's threads to all be consumed, thus blocking subsequent (non-delayed) threads until the delay period has expired. However, threads 0-19 can finish very quickly (their work can often be fulfilled in the time it takes to execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 new slots in the executor, which are then consumed by non-delayed threads 100-119, and so on. So, although 80 threads have had a delay applied, the non-delay threads rush through in the 20 non-delay slots. This problem gets even worse when the dispatcher threadpool size is less than the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no threads ever have a delay applied_, and the feature is not enabled at all. > Balancer getBlocks RPC dispersal does not function properly > --- > > Key: HDFS-14973 > URL: https://issues.apache.org/jira/browse/HDFS-14973 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer mover >Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >
[jira] [Commented] (HDDS-2274) Avoid buffer copying in Codec
[ https://issues.apache.org/jira/browse/HDDS-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970495#comment-16970495 ] Tsz-wo Sze commented on HDDS-2274: -- You are right. The improvement may not be possible since RocksDB API requires byte[]. Let me think about it more. > Avoid buffer copying in Codec > - > > Key: HDDS-2274 > URL: https://issues.apache.org/jira/browse/HDDS-2274 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Tsz-wo Sze >Assignee: Attila Doroszlai >Priority: Major > > Codec declares byte[] as a parameter in fromPersistedFormat(..) and a return > type in toPersistedFormat(..). It leads to buffer copying when using it with > ByteString. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14973) Balancer getBlocks RPC dispersal does not function properly
Erik Krogen created HDFS-14973: -- Summary: Balancer getBlocks RPC dispersal does not function properly Key: HDFS-14973 URL: https://issues.apache.org/jira/browse/HDFS-14973 Project: Hadoop HDFS Issue Type: Bug Components: balancer mover Affects Versions: 3.0.0, 2.8.2, 2.7.4, 2.9.0 Reporter: Erik Krogen Assignee: Erik Krogen In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since {{getBlocks}} can be very expensive and the Balancer should not impact normal cluster operation. Unfortunately, this functionality does not function as expected, especially when the dispatcher thread count is low. The primary issue is that the delay is applied only to the first N threads that are submitted to the dispatcher's executor, where N is the size of the dispatcher's threadpool, but *not* to the first R threads, where R is the number of allowed {{getBlocks}} QPS (currently hardcoded to 20). For example, if the threadpool size is 100 (the default), threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have no delay. As I understand it, the intent of the logic was that the delay applied to the first 100 threads would force the dispatcher executor's threads to all be consumed, thus blocking subsequent (non-delayed) threads until the delay period has expired. However, threads 0-19 can finish very quickly (their work can often be fulfilled in the time it takes to execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 new slots in the executor, which are then consumed by non-delayed threads 100-119, and so on. So, although 80 threads have had a delay applied, the non-delay threads rush through in the 20 non-delay slots. This problem gets even worse when the dispatcher threadpool size is less than the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no threads ever have a delay applied_, and the feature is not enabled at all. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.
[ https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970490#comment-16970490 ] Hadoop QA commented on HDFS-14720: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 40s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 42s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 27s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}111m 8s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 45s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}180m 15s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.server.namenode.ha.TestBootstrapAliasmap | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14720 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12985366/HDFS-14720.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 66a847fbba92 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 42fc888 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28280/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28280/testReport/ | | Max. process+thread count | 2733 (vs. ulimit
[jira] [Issue Comment Deleted] (HDDS-2392) Fix TestScmSafeMode#testSCMSafeModeRestrictedOp
[ https://issues.apache.org/jira/browse/HDDS-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated HDDS-2392: - Comment: was deleted (was: [~avijayan], thanks for working on this. - In RaftServerMetrics.addPeerCommitIndexGauge, it only needs an id instead of peer. Change the parameter to id, i.e. {code:java} void addPeerCommitIndexGauge(RaftPeerId peerId) { final String followerCommitIndexKey = String.format(LEADER_METRIC_PEER_COMMIT_INDEX, peerId); registry.gauge(followerCommitIndexKey, () -> () -> Optional.ofNullable(commitInfoCache.get(peerId)) .map(CommitInfoProto::getCommitIndex).orElse(0L)); } {code} - Then use server.getId() in RaftServerMetrics constructor and don't change LeaderState.) > Fix TestScmSafeMode#testSCMSafeModeRestrictedOp > --- > > Key: HDDS-2392 > URL: https://issues.apache.org/jira/browse/HDDS-2392 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Blocker > > After ratis upgrade (HDDS-2340), TestScmSafeMode#testSCMSafeModeRestrictedOp > fails as the DNs fail to restart XceiverServerRatis. > RaftServer#start() fails with following exception: > {code:java} > java.io.IOException: java.lang.IllegalStateException: Not started > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) > at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61) > at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70) > at > org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284) > at > org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:421) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215) > at > org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110) > at > org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalStateException: Not started > at > org.apache.ratis.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:504) > at > org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.getPort(ServerImpl.java:176) > at > org.apache.ratis.grpc.server.GrpcService.lambda$new$2(GrpcService.java:143) > at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62) > at > org.apache.ratis.grpc.server.GrpcService.getInetSocketAddress(GrpcService.java:182) > at > org.apache.ratis.server.impl.RaftServerImpl.lambda$new$0(RaftServerImpl.java:84) > at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62) > at > org.apache.ratis.server.impl.RaftServerImpl.getPeer(RaftServerImpl.java:136) > at > org.apache.ratis.server.impl.RaftServerMetrics.(RaftServerMetrics.java:70) > at > org.apache.ratis.server.impl.RaftServerMetrics.getRaftServerMetrics(RaftServerMetrics.java:62) > at > org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:119) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2392) Fix TestScmSafeMode#testSCMSafeModeRestrictedOp
[ https://issues.apache.org/jira/browse/HDDS-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970445#comment-16970445 ] Tsz-wo Sze commented on HDDS-2392: -- [~avijayan], thanks for working on this. - In RaftServerMetrics.addPeerCommitIndexGauge, it only needs an id instead of peer. Change the parameter to id, i.e. {code:java} void addPeerCommitIndexGauge(RaftPeerId peerId) { final String followerCommitIndexKey = String.format(LEADER_METRIC_PEER_COMMIT_INDEX, peerId); registry.gauge(followerCommitIndexKey, () -> () -> Optional.ofNullable(commitInfoCache.get(peerId)) .map(CommitInfoProto::getCommitIndex).orElse(0L)); } {code} - Then use server.getId() in RaftServerMetrics constructor and don't change LeaderState. > Fix TestScmSafeMode#testSCMSafeModeRestrictedOp > --- > > Key: HDDS-2392 > URL: https://issues.apache.org/jira/browse/HDDS-2392 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Hanisha Koneru >Assignee: Hanisha Koneru >Priority: Blocker > > After ratis upgrade (HDDS-2340), TestScmSafeMode#testSCMSafeModeRestrictedOp > fails as the DNs fail to restart XceiverServerRatis. > RaftServer#start() fails with following exception: > {code:java} > java.io.IOException: java.lang.IllegalStateException: Not started > at org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) > at org.apache.ratis.util.IOUtils.toIOException(IOUtils.java:61) > at org.apache.ratis.util.IOUtils.getFromFuture(IOUtils.java:70) > at > org.apache.ratis.server.impl.RaftServerProxy.getImpls(RaftServerProxy.java:284) > at > org.apache.ratis.server.impl.RaftServerProxy.start(RaftServerProxy.java:296) > at > org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.start(XceiverServerRatis.java:421) > at > org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.start(OzoneContainer.java:215) > at > org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:110) > at > org.apache.hadoop.ozone.container.common.states.endpoint.VersionEndpointTask.call(VersionEndpointTask.java:42) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.IllegalStateException: Not started > at > org.apache.ratis.thirdparty.com.google.common.base.Preconditions.checkState(Preconditions.java:504) > at > org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl.getPort(ServerImpl.java:176) > at > org.apache.ratis.grpc.server.GrpcService.lambda$new$2(GrpcService.java:143) > at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62) > at > org.apache.ratis.grpc.server.GrpcService.getInetSocketAddress(GrpcService.java:182) > at > org.apache.ratis.server.impl.RaftServerImpl.lambda$new$0(RaftServerImpl.java:84) > at org.apache.ratis.util.MemoizedSupplier.get(MemoizedSupplier.java:62) > at > org.apache.ratis.server.impl.RaftServerImpl.getPeer(RaftServerImpl.java:136) > at > org.apache.ratis.server.impl.RaftServerMetrics.(RaftServerMetrics.java:70) > at > org.apache.ratis.server.impl.RaftServerMetrics.getRaftServerMetrics(RaftServerMetrics.java:62) > at > org.apache.ratis.server.impl.RaftServerImpl.(RaftServerImpl.java:119) > at > org.apache.ratis.server.impl.RaftServerProxy.lambda$newRaftServerImpl$2(RaftServerProxy.java:208) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970430#comment-16970430 ] Bharat Viswanadham commented on HDDS-2356: -- Hi [~timmylicheng] As every run, we are seeing the new error and the stack trace and from log not got much information about the root cause. I think to debug this we need to know why for the Multipartupload key is not finding multipart upload or why some times we see InvalidMultipartupload error. We can see audit logs and see what request is passing for Multipartupload requests, and for the same key we can use listParts to know what are the parts OM is having in its MultipartInfoTable(This will help in InvalidPart error). And also I think we should enable trace/debug log to see the incoming requests, and why for Multipart upload we see these errors. (Not sure some bug in Cache logic, or some handling we missed for MPU requests) To debug this we need a complete OM log, audit log, S3gateway log. And also enable trace to see what requests are incoming, I think we log them in OzoneManagerProtocolServerSideTranslatorPB. > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 > Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM > on a separate VM >Reporter: Li Cheng >Assignee: Bharat Viswanadham >Priority: Blocker > Fix For: 0.5.0 > > Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, > image-2019-10-31-18-56-56-177.png > > > Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say > it's VM0. > I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path > on VM0, while reading data from VM0 local disk and write to mount path. The > dataset has various sizes of files from 0 byte to GB-level and it has a > number of ~50,000 files. > The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I > look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors > related with Multipart upload. This error eventually causes the writing to > terminate and OM to be closed. > > Updated on 11/06/2019: > See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs > are in the attachment. > 2019-11-05 18:12:37,766 ERROR > org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest: > MultipartUpload Commit is failed for Key:./2 > 0191012/plc_1570863541668_9278 in Volume/Bucket > s325d55ad283aa400af464c76d713c07ad/ozone-test > NO_SUCH_MULTIPART_UPLOAD_ERROR > org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload > is with specified uploadId fcda8608-b431-48b7-8386- > 0a332f1a709a-103084683261641950 > at > org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1 > 56) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB. > java:217) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100) > at > org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > > Updated on 10/28/2019: > See MISMATCH_MULTIPART_LIST error. > > 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete > Multipart Upload Request for bucket: ozone-test, key: > 20191012/plc_1570863541668_927 > 8 > MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: > Complete Multipart Upload Failed: volume:
[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970430#comment-16970430 ] Bharat Viswanadham edited comment on HDDS-2356 at 11/8/19 5:29 PM: --- Hi [~timmylicheng] As every run, we are seeing the new error and the stack trace and from log not got much information about the root cause. I think to debug this we need to know why for the Multipartupload key is not finding multipart upload or why some times we see InvalidMultipartupload error. We can see audit logs and see what request is passing for Multipartupload requests, and for the same key we can use listParts to know what are the parts OM is having in its MultipartInfoTable(This will help in InvalidPart error). And also I think we should enable trace/debug log to see the incoming requests, and why for Multipart upload we see these errors. (Not sure some bug in Cache logic, or some handling we missed for MPU requests) To debug this we need a complete OM log, audit log, S3gateway log. And also enable trace to see what requests are incoming, I think we log them in OzoneManagerProtocolServerSideTranslatorPB. Let us know if you have any suggestions. was (Author: bharatviswa): Hi [~timmylicheng] As every run, we are seeing the new error and the stack trace and from log not got much information about the root cause. I think to debug this we need to know why for the Multipartupload key is not finding multipart upload or why some times we see InvalidMultipartupload error. We can see audit logs and see what request is passing for Multipartupload requests, and for the same key we can use listParts to know what are the parts OM is having in its MultipartInfoTable(This will help in InvalidPart error). And also I think we should enable trace/debug log to see the incoming requests, and why for Multipart upload we see these errors. (Not sure some bug in Cache logic, or some handling we missed for MPU requests) To debug this we need a complete OM log, audit log, S3gateway log. And also enable trace to see what requests are incoming, I think we log them in OzoneManagerProtocolServerSideTranslatorPB. > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 > Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM > on a separate VM >Reporter: Li Cheng >Assignee: Bharat Viswanadham >Priority: Blocker > Fix For: 0.5.0 > > Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, > image-2019-10-31-18-56-56-177.png > > > Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say > it's VM0. > I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path > on VM0, while reading data from VM0 local disk and write to mount path. The > dataset has various sizes of files from 0 byte to GB-level and it has a > number of ~50,000 files. > The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I > look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors > related with Multipart upload. This error eventually causes the writing to > terminate and OM to be closed. > > Updated on 11/06/2019: > See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs > are in the attachment. > 2019-11-05 18:12:37,766 ERROR > org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest: > MultipartUpload Commit is failed for Key:./2 > 0191012/plc_1570863541668_9278 in Volume/Bucket > s325d55ad283aa400af464c76d713c07ad/ozone-test > NO_SUCH_MULTIPART_UPLOAD_ERROR > org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload > is with specified uploadId fcda8608-b431-48b7-8386- > 0a332f1a709a-103084683261641950 > at > org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1 > 56) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB. > java:217) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100) > at
[jira] [Updated] (HDDS-2427) Exclude webapps from hadoop-ozone-filesystem-lib-current uber jar
[ https://issues.apache.org/jira/browse/HDDS-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bharat Viswanadham updated HDDS-2427: - Fix Version/s: 0.5.0 > Exclude webapps from hadoop-ozone-filesystem-lib-current uber jar > - > > Key: HDDS-2427 > URL: https://issues.apache.org/jira/browse/HDDS-2427 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This has caused issue for DN UI loading. > hadoop-ozone-filesystem-lib-current-xx.jar is in the classpath which > accidentally loaded Ozone datanode web application instead of Hadoop datanode > application. This leads to the reported error. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14969) Fix HDFS client unnecessary failover log printing
[ https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970373#comment-16970373 ] Erik Krogen commented on HDFS-14969: [~vagarychen] and [~shv] as FYI > Fix HDFS client unnecessary failover log printing > - > > Key: HDFS-14969 > URL: https://issues.apache.org/jira/browse/HDFS-14969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > > In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14969) Fix HDFS client unnecessary failover log printing
[ https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970371#comment-16970371 ] Erik Krogen edited comment on HDFS-14969 at 11/8/19 4:27 PM: - +1 on this. It has been an issue ever since the multiple SbNN feature was introduced in HDFS-6440. As we've started deploying this feature, we've been getting complaints from users -- any time their job fails, they think it is an infrastructure failure because they find these logs There is hard-coded logic right now to skip printing the exception if it's the first StandbyException encountered, due to the assumption that there are only two NNs, so under a normal scenario you should only see at most one StandbyException. We should either remove this log entirely (downgrade to DEBUG), or update the logic to be aware of how many NNs are configured. was (Author: xkrogen): +1 on this. It has been an issue ever since the multiple SbNN feature was introduced in HDFS-6440. As we've started moving towards this, we've been getting complaints from users -- any time their job fails, they think it is an infrastructure failure because they find these logs There is hard-coded logic right now to skip printing the exception if it's the first StandbyException encountered, due to the assumption that there are only two NNs, so under a normal scenario you should only see at most one StandbyException. We should either remove this log entirely (downgrade to DEBUG), or update the logic to be aware of how many NNs are configured. > Fix HDFS client unnecessary failover log printing > - > > Key: HDFS-14969 > URL: https://issues.apache.org/jira/browse/HDFS-14969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > > In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14969) Fix HDFS client unnecessary failover log printing
[ https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970371#comment-16970371 ] Erik Krogen edited comment on HDFS-14969 at 11/8/19 4:27 PM: - +1 on this. It has been an issue ever since the multiple SbNN feature was introduced in HDFS-6440. As we've started deploying this feature, we've been getting complaints from users -- any time their job fails, they think it is an infrastructure failure because they find these logs There is hard-coded logic right now to skip printing the exception if it's the first StandbyException encountered, due to the assumption that there are only two NNs, so under a normal scenario you would only see at most one StandbyException. We should either remove this log entirely (downgrade to DEBUG), or update the logic to be aware of how many NNs are configured. was (Author: xkrogen): +1 on this. It has been an issue ever since the multiple SbNN feature was introduced in HDFS-6440. As we've started deploying this feature, we've been getting complaints from users -- any time their job fails, they think it is an infrastructure failure because they find these logs There is hard-coded logic right now to skip printing the exception if it's the first StandbyException encountered, due to the assumption that there are only two NNs, so under a normal scenario you should only see at most one StandbyException. We should either remove this log entirely (downgrade to DEBUG), or update the logic to be aware of how many NNs are configured. > Fix HDFS client unnecessary failover log printing > - > > Key: HDFS-14969 > URL: https://issues.apache.org/jira/browse/HDFS-14969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > > In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14969) Fix HDFS client unnecessary failover log printing
[ https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970371#comment-16970371 ] Erik Krogen commented on HDFS-14969: +1 on this. It has been an issue ever since the multiple SbNN feature was introduced in HDFS-6440. As we've started moving towards this, we've been getting complaints from users -- any time their job fails, they think it is an infrastructure failure because they find these logs There is hard-coded logic right now to skip printing the exception if it's the first StandbyException encountered, due to the assumption that there are only two NNs, so under a normal scenario you should only see at most one StandbyException. We should either remove this log entirely (downgrade to DEBUG), or update the logic to be aware of how many NNs are configured. > Fix HDFS client unnecessary failover log printing > - > > Key: HDFS-14969 > URL: https://issues.apache.org/jira/browse/HDFS-14969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > > In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14969) Fix HDFS client unnecessary failover log printing
[ https://issues.apache.org/jira/browse/HDFS-14969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated HDFS-14969: --- Description: In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and then a client starts rpc with the 1st NN, it will be silent when failover from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd NN, it prints some unnecessary logs, in some scenarios, these logs will be very numerous: {code:java} 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) ...{code} was: In multi-NameNodes scenery, suppose there are 3 NNs and the 3rd is ANN, and then a client starts rpc with the 1st NN, it will be silent when failover from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd NN, it prints some unnecessary logs, in some scenarios, these logs will be very numerous: {code:java} 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error at org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) at org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) ...{code} > Fix HDFS client unnecessary failover log printing > - > > Key: HDFS-14969 > URL: https://issues.apache.org/jira/browse/HDFS-14969 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > > In multi-NameNodes scenario, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit > https://s.apache.org/sbnn-error > at > org.apache.hadoop.hdfs.server.namenode.ha.StandbyState.checkOperation(StandbyState.java:98) > at > org.apache.hadoop.hdfs.server.namenode.NameNode$NameNodeHAContext.checkOperation(NameNode.java:2052) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkOperation(FSNamesystem.java:1459) > ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2450) Datanode ReplicateContainer thread pool should be configurable
[ https://issues.apache.org/jira/browse/HDDS-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDDS-2450: Status: Patch Available (was: Open) > Datanode ReplicateContainer thread pool should be configurable > -- > > Key: HDDS-2450 > URL: https://issues.apache.org/jira/browse/HDDS-2450 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The replicateContainer command uses a ReplicationSupervisor object to > implement a threadpool used to process replication commands. > In DatanodeStateMachine this thread pool is initialized with a hard coded > number of threads (10). This should be made configurable with a default value > of 10. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2450) Datanode ReplicateContainer thread pool should be configurable
[ https://issues.apache.org/jira/browse/HDDS-2450?focusedWorklogId=340577=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340577 ] ASF GitHub Bot logged work on HDDS-2450: Author: ASF GitHub Bot Created on: 08/Nov/19 16:09 Start Date: 08/Nov/19 16:09 Worklog Time Spent: 10m Work Description: sodonnel commented on pull request #134: HDDS-2450 Datanode ReplicateContainer thread pool should be configurable URL: https://github.com/apache/hadoop-ozone/pull/134 ## What changes were proposed in this pull request? The replicateContainer command uses a ReplicationSupervisor object to implement a threadpool used to process replication commands. In DatanodeStateMachine this thread pool is initialized with a hard coded number of threads (10). This should be made configurable with a default value of 10. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2450 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340577) Remaining Estimate: 0h Time Spent: 10m > Datanode ReplicateContainer thread pool should be configurable > -- > > Key: HDDS-2450 > URL: https://issues.apache.org/jira/browse/HDDS-2450 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The replicateContainer command uses a ReplicationSupervisor object to > implement a threadpool used to process replication commands. > In DatanodeStateMachine this thread pool is initialized with a hard coded > number of threads (10). This should be made configurable with a default value > of 10. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2450) Datanode ReplicateContainer thread pool should be configurable
[ https://issues.apache.org/jira/browse/HDDS-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2450: - Labels: pull-request-available (was: ) > Datanode ReplicateContainer thread pool should be configurable > -- > > Key: HDDS-2450 > URL: https://issues.apache.org/jira/browse/HDDS-2450 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > > The replicateContainer command uses a ReplicationSupervisor object to > implement a threadpool used to process replication commands. > In DatanodeStateMachine this thread pool is initialized with a hard coded > number of threads (10). This should be made configurable with a default value > of 10. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2450) Datanode ReplicateContainer thread pool should be configurable
[ https://issues.apache.org/jira/browse/HDDS-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970353#comment-16970353 ] Stephen O'Donnell commented on HDDS-2450: - I suggest a new configuration called "hdds.datanode.replication.streams.limit" with a default of 10 to make this configurable. > Datanode ReplicateContainer thread pool should be configurable > -- > > Key: HDDS-2450 > URL: https://issues.apache.org/jira/browse/HDDS-2450 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: Ozone Datanode >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > > The replicateContainer command uses a ReplicationSupervisor object to > implement a threadpool used to process replication commands. > In DatanodeStateMachine this thread pool is initialized with a hard coded > number of threads (10). This should be made configurable with a default value > of 10. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970352#comment-16970352 ] Chen Zhang commented on HDFS-12288: --- update patch v7 to fix failed test > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, > HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, > HDFS-12288.006.patch, HDFS-12288.007.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-12288) Fix DataNode's xceiver count calculation
[ https://issues.apache.org/jira/browse/HDFS-12288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen Zhang updated HDFS-12288: -- Attachment: HDFS-12288.007.patch > Fix DataNode's xceiver count calculation > > > Key: HDFS-12288 > URL: https://issues.apache.org/jira/browse/HDFS-12288 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs >Reporter: Lukas Majercak >Assignee: Chen Zhang >Priority: Major > Attachments: HDFS-12288.001.patch, HDFS-12288.002.patch, > HDFS-12288.003.patch, HDFS-12288.004.patch, HDFS-12288.005.patch, > HDFS-12288.006.patch, HDFS-12288.007.patch > > > The problem with the ThreadGroup.activeCount() method is that the method is > only a very rough estimate, and in reality returns the total number of > threads in the thread group as opposed to the threads actually running. > In some DNs, we saw this to return 50~ for a long time, even though the > actual number of DataXceiver threads was next to none. > This is a big issue as we use the xceiverCount to make decisions on the NN > for choosing replication source DN or returning DNs to clients for R/W. > The plan is to reuse the DataNodeMetrics.dataNodeActiveXceiversCount value > which only accounts for actual number of DataXcevier threads currently > running and thus represents the load on the DN much better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2399) Update mailing list information in CONTRIBUTION and README files
[ https://issues.apache.org/jira/browse/HDDS-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neo Yang reassigned HDDS-2399: -- Assignee: Neo Yang > Update mailing list information in CONTRIBUTION and README files > > > Key: HDDS-2399 > URL: https://issues.apache.org/jira/browse/HDDS-2399 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: Neo Yang >Priority: Major > Labels: newbie, pull-request-available > Fix For: 0.5.0 > > Time Spent: 20m > Remaining Estimate: 0h > > We have new mailing lists: > [ozone-...@hadoop.apache.org|mailto:ozone-...@hadoop.apache.org] > [ozone-iss...@hadoop.apache.org|mailto:ozone-iss...@hadoop.apache.org] > [ozone-comm...@hadoop.apache.org|mailto:ozone-comm...@hadoop.apache.org] > > We need to update CONTRIBUTION.md and README.md to use ozone-dev instead of > hdfs-dev (optionally we can mention the issues/commits lists, but only in > CONTRIBUTION.md) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2449) Delete block command should use a thread pool
Stephen O'Donnell created HDDS-2449: --- Summary: Delete block command should use a thread pool Key: HDDS-2449 URL: https://issues.apache.org/jira/browse/HDDS-2449 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone Datanode Affects Versions: 0.5.0 Reporter: Stephen O'Donnell Assignee: Stephen O'Donnell The datanode receives commands over the heartbeat and queues all commands on a single queue in StateContext.commandQueue. Inside DatanodeStateMachine a single thread is used to process this queue (started by initCommandHander thread) and it passes each command to a ‘handler’. Each command type has its own handler. The delete block command immediately executes the command on the thread used to process the command queue. Therefore if the delete is slow for some reason (it must access disk, so this is possible) it could cause other commands to backup. This should be changed to use a threadpool to queue the deleteBlock command, in a similar way to ReplicateContainerCommand. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.
[ https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970332#comment-16970332 ] Xiaoqiao He commented on HDFS-14720: [^HDFS-14720.003.patch] LGTM, +1. Thanks [~hemanthboyina]. > DataNode shouldn't report block as bad block if the block length is > Long.MAX_VALUE. > --- > > Key: HDFS-14720 > URL: https://issues.apache.org/jira/browse/HDFS-14720 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14720.001.patch, HDFS-14720.002.patch, > HDFS-14720.003.patch > > > {noformat} > 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Can't replicate block > BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because > on-disk length 175085 is shorter than NameNode recorded length > 9223372036854775807.{noformat} > If the block length is Long.MAX_VALUE, means file belongs to this block is > deleted from the namenode and DN got the command after deletion of file. In > this case command should be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14970) HDFS : fsck "-list-corruptfileblocks" command not giving expected output
[ https://issues.apache.org/jira/browse/HDFS-14970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina reassigned HDFS-14970: Assignee: hemanthboyina > HDFS : fsck "-list-corruptfileblocks" command not giving expected output > > > Key: HDFS-14970 > URL: https://issues.apache.org/jira/browse/HDFS-14970 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Affects Versions: 3.1.2 > Environment: HA Cluster >Reporter: Souryakanta Dwivedy >Assignee: hemanthboyina >Priority: Major > Attachments: image-2019-11-08-18-44-03-349.png, > image-2019-11-08-18-45-53-858.png > > > HDFS fsck "-list-corruptfileblocks" option not giving expected output > Step :- > Check the currupt files with fsck it will give the correct output > !image-2019-11-08-18-44-03-349.png! > > Check the currupt files with fsck -list-corruptfileblocks option it > will > not provide the expected output which is wrong behavior > > !image-2019-11-08-18-45-53-858.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2273) Avoid buffer copying in GrpcReplicationService
[ https://issues.apache.org/jira/browse/HDDS-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsz-wo Sze updated HDDS-2273: - Fix Version/s: 0.5.0 Resolution: Fixed Status: Resolved (was: Patch Available) I just have merged the pull request. Thanks, [~adoroszlai]! > Avoid buffer copying in GrpcReplicationService > -- > > Key: HDDS-2273 > URL: https://issues.apache.org/jira/browse/HDDS-2273 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Tsz-wo Sze >Assignee: Attila Doroszlai >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 10m > Remaining Estimate: 0h > > In GrpcOutputStream, it writes data to a ByteArrayOutputStream and copies > them to a ByteString. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2450) Datanode ReplicateContainer thread pool should be configurable
Stephen O'Donnell created HDDS-2450: --- Summary: Datanode ReplicateContainer thread pool should be configurable Key: HDDS-2450 URL: https://issues.apache.org/jira/browse/HDDS-2450 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone Datanode Affects Versions: 0.5.0 Reporter: Stephen O'Donnell Assignee: Stephen O'Donnell The replicateContainer command uses a ReplicationSupervisor object to implement a threadpool used to process replication commands. In DatanodeStateMachine this thread pool is initialized with a hard coded number of threads (10). This should be made configurable with a default value of 10. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2448) Delete container command should used a thread pool
Stephen O'Donnell created HDDS-2448: --- Summary: Delete container command should used a thread pool Key: HDDS-2448 URL: https://issues.apache.org/jira/browse/HDDS-2448 Project: Hadoop Distributed Data Store Issue Type: Improvement Components: Ozone Datanode Affects Versions: 0.5.0 Reporter: Stephen O'Donnell Assignee: Stephen O'Donnell The datanode receives commands over the heartbeat and queues all commands on a single queue in StateContext.commandQueue. Inside DatanodeStateMachine a single thread is used to process this queue (started by initCommandHander thread) and it passes each command to a ‘handler’. Each command type has its own handler. The delete container command immediately executes the command on the thread used to process the command queue. Therefore if the delete is slow for some reason (it must access disk, so this is possible) it could cause other commands to backup. This should be changed to use a threadpool to queue the deleteContainer command, in a similar way to ReplicateContainerCommand. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14971) HDFS : help info of fsck "-list-corruptfileblocks" command needs to be rectified
[ https://issues.apache.org/jira/browse/HDFS-14971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970263#comment-16970263 ] bright.zhou commented on HDFS-14971: I want to work on this, pls assign to me > HDFS : help info of fsck "-list-corruptfileblocks" command needs to be > rectified > > > Key: HDFS-14971 > URL: https://issues.apache.org/jira/browse/HDFS-14971 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Affects Versions: 3.1.2 > Environment: HA Cluster >Reporter: Souryakanta Dwivedy >Priority: Minor > Attachments: image-2019-11-08-18-58-41-220.png > > > HDFS : help info of fsck "-list-corruptfileblocks" command needs to be > rectified > Check the help info of fsck -list-corruptfileblocks it is specified as > "-list-corruptfileblocks print out list of missing blocks and files they > belong to" > It should be rectified as corrupted blocks and files as it is going provide > information about corrupted blocks and files not missing blocks and files > Expected output :- > "-list-corruptfileblocks print out list of corrupted blocks and files they > belong to" > > !image-2019-11-08-18-58-41-220.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14505) "touchz" command should check quota limit before deleting an already existing file
[ https://issues.apache.org/jira/browse/HDFS-14505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970256#comment-16970256 ] hemanthboyina commented on HDFS-14505: -- {code:java} ./hdfs dfs -ls /dir2 -rw-r--r-- 1 sbanerjee hadoop 0 2019-05-21 15:10 /dir2/file4 HW15685:bin sbanerjee$ ./hdfs dfs -touchz /dir2/file4 touchz: The NameSpace quota (directories and files) of directory /dir2 is exceeded: quota=3 file count=5 {code} are there any operations you have done [~shashikant] ? I am not able to reproduce this . > "touchz" command should check quota limit before deleting an already existing > file > -- > > Key: HDFS-14505 > URL: https://issues.apache.org/jira/browse/HDFS-14505 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shashikant Banerjee >Assignee: hemanthboyina >Priority: Major > > {code:java} > HW15685:bin sbanerjee$ ./hdfs dfs -ls /dir2 > 2019-05-21 15:14:01,080 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > Found 1 items > -rw-r--r-- 1 sbanerjee hadoop 0 2019-05-21 15:10 /dir2/file4 > HW15685:bin sbanerjee$ ./hdfs dfs -touchz /dir2/file4 > 2019-05-21 15:14:12,247 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > touchz: The NameSpace quota (directories and files) of directory /dir2 is > exceeded: quota=3 file count=5 > HW15685:bin sbanerjee$ ./hdfs dfs -ls /dir2 > 2019-05-21 15:14:20,607 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > {code} > Here, the "touchz" command failed to create the file as the quota limit was > hit, but ended up deleting the original file which existed. It should do the > quota check before deleting the file so that after successful deletion, > creation should succeed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.
[ https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970250#comment-16970250 ] hemanthboyina commented on HDFS-14720: -- thanks for the review [~hexiaoqiao] , updated the patch . please check > DataNode shouldn't report block as bad block if the block length is > Long.MAX_VALUE. > --- > > Key: HDFS-14720 > URL: https://issues.apache.org/jira/browse/HDFS-14720 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14720.001.patch, HDFS-14720.002.patch, > HDFS-14720.003.patch > > > {noformat} > 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Can't replicate block > BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because > on-disk length 175085 is shorter than NameNode recorded length > 9223372036854775807.{noformat} > If the block length is Long.MAX_VALUE, means file belongs to this block is > deleted from the namenode and DN got the command after deletion of file. In > this case command should be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.
[ https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hemanthboyina updated HDFS-14720: - Attachment: HDFS-14720.003.patch > DataNode shouldn't report block as bad block if the block length is > Long.MAX_VALUE. > --- > > Key: HDFS-14720 > URL: https://issues.apache.org/jira/browse/HDFS-14720 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14720.001.patch, HDFS-14720.002.patch, > HDFS-14720.003.patch > > > {noformat} > 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Can't replicate block > BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because > on-disk length 175085 is shorter than NameNode recorded length > 9223372036854775807.{noformat} > If the block length is Long.MAX_VALUE, means file belongs to this block is > deleted from the namenode and DN got the command after deletion of file. In > this case command should be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDDS-1701) Move dockerbin script to libexec
[ https://issues.apache.org/jira/browse/HDDS-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek resolved HDDS-1701. --- Fix Version/s: 0.5.0 Resolution: Fixed > Move dockerbin script to libexec > > > Key: HDDS-1701 > URL: https://issues.apache.org/jira/browse/HDDS-1701 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: YiSheng Lien >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Ozone tarball structure contains a new bin script directory called dockerbin. > These utility script can be relocated to OZONE_HOME/libexec because they are > internal binaries that are not intended to be executed directly by users or > shell scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1701) Move dockerbin script to libexec
[ https://issues.apache.org/jira/browse/HDDS-1701?focusedWorklogId=340530=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340530 ] ASF GitHub Bot logged work on HDDS-1701: Author: ASF GitHub Bot Created on: 08/Nov/19 14:48 Start Date: 08/Nov/19 14:48 Worklog Time Spent: 10m Work Description: elek commented on pull request #80: HDDS-1701. Move dockerbin script to libexec. URL: https://github.com/apache/hadoop-ozone/pull/80 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340530) Time Spent: 40m (was: 0.5h) > Move dockerbin script to libexec > > > Key: HDDS-1701 > URL: https://issues.apache.org/jira/browse/HDDS-1701 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: YiSheng Lien >Priority: Major > Labels: pull-request-available > Fix For: 0.5.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Ozone tarball structure contains a new bin script directory called dockerbin. > These utility script can be relocated to OZONE_HOME/libexec because they are > internal binaries that are not intended to be executed directly by users or > shell scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14972) HDFS: fsck "-blockId" option not giving expected output
Souryakanta Dwivedy created HDFS-14972: -- Summary: HDFS: fsck "-blockId" option not giving expected output Key: HDFS-14972 URL: https://issues.apache.org/jira/browse/HDFS-14972 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 3.1.2 Environment: HA Cluster Reporter: Souryakanta Dwivedy Attachments: image-2019-11-08-19-10-18-057.png, image-2019-11-08-19-12-21-307.png HDFS: fsck "-blockId" option not giving expected output HDFS fsck displaying correct output for corrupted files and blocks !image-2019-11-08-19-10-18-057.png! HDFS fsck -blockId command not giving expected output for corrupted replica !image-2019-11-08-19-12-21-307.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14970) HDFS : fsck "-list-corruptfileblocks" command not giving expected output
[ https://issues.apache.org/jira/browse/HDFS-14970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970170#comment-16970170 ] hemanthboyina commented on HDFS-14970: -- thanks [~SouryakantaDwivedy] for putting this up . this issue should be fixed . will check this > HDFS : fsck "-list-corruptfileblocks" command not giving expected output > > > Key: HDFS-14970 > URL: https://issues.apache.org/jira/browse/HDFS-14970 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Affects Versions: 3.1.2 > Environment: HA Cluster >Reporter: Souryakanta Dwivedy >Priority: Major > Attachments: image-2019-11-08-18-44-03-349.png, > image-2019-11-08-18-45-53-858.png > > > HDFS fsck "-list-corruptfileblocks" option not giving expected output > Step :- > Check the currupt files with fsck it will give the correct output > !image-2019-11-08-18-44-03-349.png! > > Check the currupt files with fsck -list-corruptfileblocks option it > will > not provide the expected output which is wrong behavior > > !image-2019-11-08-18-45-53-858.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14971) HDFS : help info of fsck "-list-corruptfileblocks" command needs to be rectified
[ https://issues.apache.org/jira/browse/HDFS-14971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Souryakanta Dwivedy updated HDFS-14971: --- Issue Type: Bug (was: Improvement) > HDFS : help info of fsck "-list-corruptfileblocks" command needs to be > rectified > > > Key: HDFS-14971 > URL: https://issues.apache.org/jira/browse/HDFS-14971 > Project: Hadoop HDFS > Issue Type: Bug > Components: tools >Affects Versions: 3.1.2 > Environment: HA Cluster >Reporter: Souryakanta Dwivedy >Priority: Minor > Attachments: image-2019-11-08-18-58-41-220.png > > > HDFS : help info of fsck "-list-corruptfileblocks" command needs to be > rectified > Check the help info of fsck -list-corruptfileblocks it is specified as > "-list-corruptfileblocks print out list of missing blocks and files they > belong to" > It should be rectified as corrupted blocks and files as it is going provide > information about corrupted blocks and files not missing blocks and files > Expected output :- > "-list-corruptfileblocks print out list of corrupted blocks and files they > belong to" > > !image-2019-11-08-18-58-41-220.png! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14971) HDFS : help info of fsck "-list-corruptfileblocks" command needs to be rectified
Souryakanta Dwivedy created HDFS-14971: -- Summary: HDFS : help info of fsck "-list-corruptfileblocks" command needs to be rectified Key: HDFS-14971 URL: https://issues.apache.org/jira/browse/HDFS-14971 Project: Hadoop HDFS Issue Type: Improvement Components: tools Affects Versions: 3.1.2 Environment: HA Cluster Reporter: Souryakanta Dwivedy Attachments: image-2019-11-08-18-58-41-220.png HDFS : help info of fsck "-list-corruptfileblocks" command needs to be rectified Check the help info of fsck -list-corruptfileblocks it is specified as "-list-corruptfileblocks print out list of missing blocks and files they belong to" It should be rectified as corrupted blocks and files as it is going provide information about corrupted blocks and files not missing blocks and files Expected output :- "-list-corruptfileblocks print out list of corrupted blocks and files they belong to" !image-2019-11-08-18-58-41-220.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14970) HDFS : fsck "-list-corruptfileblocks" command not giving expected output
Souryakanta Dwivedy created HDFS-14970: -- Summary: HDFS : fsck "-list-corruptfileblocks" command not giving expected output Key: HDFS-14970 URL: https://issues.apache.org/jira/browse/HDFS-14970 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 3.1.2 Environment: HA Cluster Reporter: Souryakanta Dwivedy Attachments: image-2019-11-08-18-44-03-349.png, image-2019-11-08-18-45-53-858.png HDFS fsck "-list-corruptfileblocks" option not giving expected output Step :- Check the currupt files with fsck it will give the correct output !image-2019-11-08-18-44-03-349.png! Check the currupt files with fsck -list-corruptfileblocks option it will not provide the expected output which is wrong behavior !image-2019-11-08-18-45-53-858.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2447) Allow datanodes to operate with simulated containers
Stephen O'Donnell created HDDS-2447: --- Summary: Allow datanodes to operate with simulated containers Key: HDDS-2447 URL: https://issues.apache.org/jira/browse/HDDS-2447 Project: Hadoop Distributed Data Store Issue Type: Bug Components: Ozone Datanode Affects Versions: 0.5.0 Reporter: Stephen O'Donnell The Storage Container Manager (SCM) generally deals with datanodes and containers. Datanodes report their containers via container reports and the SCM keeps track of them, schedules new replicas to be created when needed etc. SCM does not care about individual blocks within the containers (aside from deleting them) or keys. Therefore it should be possible to scale test much of SCM without OM or worrying about writing keys. In order to scale test SCM and some of its internal features like like decommission, maintenance mode and the replication manager, it would be helpful to quickly create clusters with many containers, without needing to go through a data loading exercise. What I imagine happening is: * We generate a list of container IDs and container sizes - this could be a fixed size or configured size for all containers. We could also fix the number of blocks / chunks inside a 'generated simulated container' so they are all the same. * When the Datanode starts, if it has simulated containers enabled, it would optionally look for this list of containers and load the meta data into memory. Then it would report the containers to SCM as normal, and the SCM would believe the containers actually exist. * If SCM creates a new container, then the datanode should create the meta-data in memory, but not write anything to disk. * If SCM instructs a DN to replicate a container, then we should stream simulated data over the wire equivalent to the container size, but again throw away the data at the receiving side and store only the metadata in datanode memory. * It would be acceptable for a DN restart to forget all containers and re-load them from the generated list. A nice-to-have feature would persist any changes to disk somehow so a DN restart would return to its pre-restart state. At this stage, I am not too concerned about OM, or clients trying to read chunks out of these simulated containers (my focus is on SCM at the moment), but it would be great if that were possible too. I believe this feature would let us do scale testing of SCM and benchmark some dead node / replication / decommission scenarios on clusters with much reduced hardware requirements. It would also allow clusters with a large number of containers to be created quickly, rather than going through a dataload exercise. This would open the door to a tool similar to https://github.com/linkedin/dynamometer which uses simulated storage on HDFS to perform scale tests against the namenode with reduced hardware requirements. HDDS-1094 added the ability to have a level of simulated storage on a datanode. In that Jira, when a client writes data to a chunk the data is thrown away and nothing is written to disk. If a client later tries to read the data back, it just gets zeroed byte buffers. Hopefully this Jira could build on that feature to fully simulate the containers from the SCM point of view and later we can extend to allowing clients to create keys etc too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970119#comment-16970119 ] Li Cheng commented on HDDS-2356: [~bharat] New error shows up using today's master branch. 2019-11-08 20:08:24,832 ERROR org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest: MultipartUpload Complete request failed for Key: plc_1570863541668_9278 in Volume/Bucket s325d55ad283aa400af464c76d713c07ad/ozone-test INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: ozone-testkey: plc_1570863541668_9278 at org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest.validateAndUpdateCache(S3MultipartUploadCompleteRequest.java:187) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:217) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132) at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100) at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 > Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM > on a separate VM >Reporter: Li Cheng >Assignee: Bharat Viswanadham >Priority: Blocker > Fix For: 0.5.0 > > Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, > image-2019-10-31-18-56-56-177.png > > > Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say > it's VM0. > I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path > on VM0, while reading data from VM0 local disk and write to mount path. The > dataset has various sizes of files from 0 byte to GB-level and it has a > number of ~50,000 files. > The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I > look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors > related with Multipart upload. This error eventually causes the writing to > terminate and OM to be closed. > > Updated on 11/06/2019: > See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs > are in the attachment. > 2019-11-05 18:12:37,766 ERROR > org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest: > MultipartUpload Commit is failed for Key:./2 > 0191012/plc_1570863541668_9278 in Volume/Bucket > s325d55ad283aa400af464c76d713c07ad/ozone-test > NO_SUCH_MULTIPART_UPLOAD_ERROR > org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload > is with specified uploadId fcda8608-b431-48b7-8386- > 0a332f1a709a-103084683261641950 > at > org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1 > 56) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB. > java:217) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132) > at > org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72) > at > org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100) > at >
[jira] [Commented] (HDFS-14529) NPE while Loading the Editlogs
[ https://issues.apache.org/jira/browse/HDFS-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970116#comment-16970116 ] Xiaoqiao He commented on HDFS-14529: Thanks [~sodonnell] for your quick response. I have noticed HDFS-12369. However, in my case, we do not use SNAPSHOT feature, and I also check file mentioned above log and its all parent path, all of them are not snapshot path. I have to state that the case I mentioned is appear with old version. I would like to share some more information if any progress. > NPE while Loading the Editlogs > -- > > Key: HDFS-14529 > URL: https://issues.apache.org/jira/browse/HDFS-14529 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Harshakiran Reddy >Assignee: Ayush Saxena >Priority: Major > > {noformat} > 2019-05-31 15:15:42,397 ERROR namenode.FSEditLogLoader: Encountered exception > on operation TimesOp [length=0, > path=/testLoadSpace/dir0/dir0/dir0/dir2/_file_9096763, mtime=-1, > atime=1559294343288, opCode=OP_TIMES, txid=18927893] > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.unprotectedSetTimes(FSDirAttrOp.java:490) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:711) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:181) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:924) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:771) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1105) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1558) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1640) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1725){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14529) NPE while Loading the Editlogs
[ https://issues.apache.org/jira/browse/HDFS-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970102#comment-16970102 ] Stephen O'Donnell commented on HDFS-14529: -- [~hexiaoqiao] The stack trace you posted looks like HDFS-12369. See this comment especially, as we believe HDFS-12369 can show some different stack traces when it occurs: https://issues.apache.org/jira/browse/HDFS-12369?focusedCommentId=16304855=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16304855 [~szetszwo] I encountered the stack you mentioned once in a cluster that has snapshots, and the snapshots were somewhat corrupt. The cluster had frequently hit HDFS-13101. In that example, we found the file it was attempting to apply the TimesOp against did not exist except in the snapshot, and if I recall correctly, within the snapshot it was not really readable due to something similar to HDFS-13101. The interesting thing, was that even though the file was deleted, more edits kept appearing with the invalid TimesOp in it. That cluster had other issues we fixed and this problem got cleared as a side-effect. In short, it is likely this is somehow related to snapshots. > NPE while Loading the Editlogs > -- > > Key: HDFS-14529 > URL: https://issues.apache.org/jira/browse/HDFS-14529 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Harshakiran Reddy >Assignee: Ayush Saxena >Priority: Major > > {noformat} > 2019-05-31 15:15:42,397 ERROR namenode.FSEditLogLoader: Encountered exception > on operation TimesOp [length=0, > path=/testLoadSpace/dir0/dir0/dir0/dir2/_file_9096763, mtime=-1, > atime=1559294343288, opCode=OP_TIMES, txid=18927893] > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.unprotectedSetTimes(FSDirAttrOp.java:490) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:711) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:181) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:924) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:771) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1105) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1558) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1640) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1725){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14963) Add HDFS Client machine caching active namenode index mechanism.
[ https://issues.apache.org/jira/browse/HDFS-14963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969795#comment-16969795 ] Xudong Cao edited comment on HDFS-14963 at 11/8/19 11:57 AM: - cc [~shv] [~elgoiri] [~vagarychen] [~weichiu] Thank you all for your attention. For the convenience of reading, I have uploaded an additional patch besides github PR (they are exactly a same patch). Based on this patch: # The cache directory is configurable by a newly introduced item "dfs.client.failover.cache-active.dir", its default value is ${java.io.tmpdir}, which is /tmp on Linux platform. # Writing/Reading a cache file is under file lock protection, and we use trylock() instead of lock(), so in a high-concurrency scenario, reading/writing cache file will not become the bottleneck. if trylock() failed while reading, it just fall back to what we have today: simply return an index 0. And if trylock() failed while writing, it simply returns and continues. In fact, I think both these situations should be very rare. # All cache files' mode are manually set to "666", meaning every process can read/write them. # This cache mechanism is robust, regardless of whether the cache file was accidentally deleted or the content was maliciously modified, the readActiveCache() always returns a legal index, and writeActiveCache() will automatically rebuild the cache file on next failover in ConfiguredFailoverProxyProvider. # We surely have dfs.client.failover.random.order, actually I have used it in the unit test. Zkfc does know who is active NN right now, but it does not have an rpc interface allowing us to get it. and I think an rpc call is much more expensive than reading/writing local files. # cc [~xkrogen] , I will then tacle the logging issue discussed in (2) in a separate JIRA. was (Author: xudongcao): cc [~shv] [~elgoiri] [~vagarychen] [~weichiu] Thank you all for your attention. For the convenience of reading, I have uploaded an additional patch besides github PR (they are exactly a same patch). Based on this patch: # The cache directory is configurable by a newly introduced item "dfs.client.failover.cache-active.dir", its default value is ${java.io.tmpdir}, which is /tmp on Linux platform. # Writing/Reading a cache file is under file lock protection, and we use trylock() instead of lock(), so in a high-concurrency scenario, reading/writing cache file will not become the bottleneck. if trylock() failed while reading, it just fall back to what we have today: simply return an index 0. And if trylock() failed while writing, it simply returns and continues. In fact, I think both these situations should be very rare. # All cache files' mode are manually set to "666", meaning every process can read/write them. # This cache mechanism is robust, regardless of whether the cache file was accidentally deleted or the content was maliciously modified, the readActiveCache() always returns a legal index, and writeActiveCache() will automatically rebuild the cache file on next failover. Of course in all abnormal situations there will be a WARN log. # We surely have dfs.client.failover.random.order, actually I have used it in the unit test. Zkfc does know who is active NN right now, but it does not have an rpc interface allowing us to get it. and I think an rpc call is much more expensive than reading/writing local files. # cc [~xkrogen] , I will then tacle the logging issue discussed in (2) in a separate JIRA. > Add HDFS Client machine caching active namenode index mechanism. > > > Key: HDFS-14963 > URL: https://issues.apache.org/jira/browse/HDFS-14963 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs-client >Affects Versions: 3.1.3 >Reporter: Xudong Cao >Assignee: Xudong Cao >Priority: Minor > Attachments: HDFS-14963.000.patch, HDFS-14963.001.patch > > > In multi-NameNodes scenery, a new hdfs client always begins a rpc call from > the 1st namenode, simply polls, and finally determines the current Active > namenode. > This brings at least two problems: > # Extra failover consumption, especially in the case of frequent creation of > clients. > # Unnecessary log printing, suppose there are 3 NNs and the 3rd is ANN, and > then a client starts rpc with the 1st NN, it will be silent when failover > from the 1st NN to the 2nd NN, but when failover from the 2nd NN to the 3rd > NN, it prints some unnecessary logs, in some scenarios, these logs will be > very numerous: > {code:java} > 2019-11-07 11:35:41,577 INFO retry.RetryInvocationHandler: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): > Operation category READ is not supported in state standby. Visit >
[jira] [Commented] (HDFS-14529) NPE while Loading the Editlogs
[ https://issues.apache.org/jira/browse/HDFS-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970092#comment-16970092 ] Xiaoqiao He commented on HDFS-14529: I also meet another NPE at StandbyNN (build with hadoop-2.7.1) to replay editlog, it seems some corner case to trigger FSEditLogLoader throw null pointer. {code:java} 2019-11-06 18:30:25,948 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=$path, replication=3, mtime=1573034707723, atime=1571949218729, blockSize=67108864, blocks=[blk_2870262427_1841120265], permissions=*:*:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, txid=21246238494] java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguousUnderConstruction.setGenerationStampAndVerifyReplicas(BlockInfoContiguousUnderConstruction.java:259) at org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguousUnderConstruction.commitBlock(BlockInfoContiguousUnderConstruction.java:279) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.forceCompleteBlock(BlockManager.java:1199) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.updateBlocks(FSEditLogLoader.java:1022) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:438) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:234) at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:844) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:825) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:331) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$200(EditLogTailer.java:284) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:301) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1666) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:426) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:297) {code} > NPE while Loading the Editlogs > -- > > Key: HDFS-14529 > URL: https://issues.apache.org/jira/browse/HDFS-14529 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Harshakiran Reddy >Assignee: Ayush Saxena >Priority: Major > > {noformat} > 2019-05-31 15:15:42,397 ERROR namenode.FSEditLogLoader: Encountered exception > on operation TimesOp [length=0, > path=/testLoadSpace/dir0/dir0/dir0/dir2/_file_9096763, mtime=-1, > atime=1559294343288, opCode=OP_TIMES, txid=18927893] > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.unprotectedSetTimes(FSDirAttrOp.java:490) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:711) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:286) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:181) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:924) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:771) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:331) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1105) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:726) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1558) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1640) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1725){noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2443) Python client/interface for Ozone
[ https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Cheng updated HDDS-2443: --- Attachment: (was: OzoneS3.py) > Python client/interface for Ozone > - > > Key: HDDS-2443 > URL: https://issues.apache.org/jira/browse/HDDS-2443 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Client >Reporter: Li Cheng >Priority: Major > Attachments: OzoneS3.py > > > Original ideas: item#25 in > [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors] > Ozone Client(Python) for Data Science Notebook such as Jupyter. > # Size: Large > # PyArrow: [https://pypi.org/project/pyarrow/] > # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API > Impala uses libhdfs > > Path to try: > # s3 interface: Ozone s3 gateway(already supported) + AWS python client > (boto3) > # python native RPC > # pyarrow + libhdfs, which use the Java client under the hood. > # python + C interface of go / rust ozone library. I created POC go / rust > clients earlier which can be improved if the libhdfs interface is not good > enough. [By [~elek]] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2443) Python client/interface for Ozone
[ https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Cheng updated HDDS-2443: --- Attachment: OzoneS3.py > Python client/interface for Ozone > - > > Key: HDDS-2443 > URL: https://issues.apache.org/jira/browse/HDDS-2443 > Project: Hadoop Distributed Data Store > Issue Type: New Feature > Components: Ozone Client >Reporter: Li Cheng >Priority: Major > Attachments: OzoneS3.py > > > Original ideas: item#25 in > [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors] > Ozone Client(Python) for Data Science Notebook such as Jupyter. > # Size: Large > # PyArrow: [https://pypi.org/project/pyarrow/] > # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API > Impala uses libhdfs > > Path to try: > # s3 interface: Ozone s3 gateway(already supported) + AWS python client > (boto3) > # python native RPC > # pyarrow + libhdfs, which use the Java client under the hood. > # python + C interface of go / rust ozone library. I created POC go / rust > clients earlier which can be improved if the libhdfs interface is not good > enough. [By [~elek]] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-1701) Move dockerbin script to libexec
[ https://issues.apache.org/jira/browse/HDDS-1701?focusedWorklogId=340452=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340452 ] ASF GitHub Bot logged work on HDDS-1701: Author: ASF GitHub Bot Created on: 08/Nov/19 11:22 Start Date: 08/Nov/19 11:22 Worklog Time Spent: 10m Work Description: elek commented on pull request #8: HDDS-1701. Move dockerbin script to libexec. URL: https://github.com/apache/hadoop-docker-ozone/pull/8 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340452) Time Spent: 0.5h (was: 20m) > Move dockerbin script to libexec > > > Key: HDDS-1701 > URL: https://issues.apache.org/jira/browse/HDDS-1701 > Project: Hadoop Distributed Data Store > Issue Type: Bug >Reporter: Eric Yang >Assignee: YiSheng Lien >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Ozone tarball structure contains a new bin script directory called dockerbin. > These utility script can be relocated to OZONE_HOME/libexec because they are > internal binaries that are not intended to be executed directly by users or > shell scripts. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2369) Fix typo in param description.
[ https://issues.apache.org/jira/browse/HDDS-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marton Elek reassigned HDDS-2369: - Assignee: Neo Yang > Fix typo in param description. > -- > > Key: HDDS-2369 > URL: https://issues.apache.org/jira/browse/HDDS-2369 > Project: Hadoop Distributed Data Store > Issue Type: Task >Reporter: YiSheng Lien >Assignee: Neo Yang >Priority: Trivial > Labels: newbie, pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > In many addAcl(), the annotation param acl should be > {code} > ozone acl to be added. > {code} > but now is > {code} > ozone acl top be added. > {code} > The files as follows: > {code} > hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/protocol/ClientProtocol.java > 614: * @param acl ozone acl top be added. > hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/rpc/RpcClient.java > 1029: * @param acl ozone acl top be added. > hadoop-ozone/client/src/main/java/org/apache/hadoop/ozone/client/ObjectStore.java > 453: * @param acl ozone acl top be added. > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/IOzoneAcl.java > 36: * @param acl ozone acl top be added. > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/PrefixManagerImpl.java > 96: * @param acl ozone acl top be added. > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/VolumeManagerImpl.java > 481: * @param acl ozone acl top be added. > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/BucketManagerImpl.java > 379: * @param acl ozone acl top be added. > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java > 1475: * @param acl ozone acl top be added. > hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java > 2868: * @param acl ozone acl top be added. > hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocol/OzoneManagerProtocol.java > 486: * @param acl ozone acl top be added. > hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/OzoneManagerProtocolClientSideTranslatorPB.java > 1405: * @param acl ozone acl top be added. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2445) Replace ToStringBuilder in BlockData
[ https://issues.apache.org/jira/browse/HDDS-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Doroszlai updated HDDS-2445: --- Status: Patch Available (was: In Progress) > Replace ToStringBuilder in BlockData > > > Key: HDDS-2445 > URL: https://issues.apache.org/jira/browse/HDDS-2445 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance, pull-request-available > Attachments: blockdata.png, setchunks.png > > Time Spent: 10m > Remaining Estimate: 0h > > {{BlockData#toString}} uses {{ToStringBuilder}} for ease of implementation. > This has a few problems: > # {{ToStringBuilder}} uses {{StringBuffer}}, which is synchronized > # the default buffer is 512 bytes, more than needed here > # {{BlockID}} and {{ContainerBlockID}} both use another {{StringBuilder}} or > {{StringBuffer}} for their {{toString}} implementation, leading to several > allocations and copies > The flame graph shows that {{BlockData#toString}} may be responsible for 1.5% > of total allocations while putting keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2445) Replace ToStringBuilder in BlockData
[ https://issues.apache.org/jira/browse/HDDS-2445?focusedWorklogId=340447=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340447 ] ASF GitHub Bot logged work on HDDS-2445: Author: ASF GitHub Bot Created on: 08/Nov/19 11:06 Start Date: 08/Nov/19 11:06 Worklog Time Spent: 10m Work Description: adoroszlai commented on pull request #132: HDDS-2445. Replace ToStringBuilder in BlockData URL: https://github.com/apache/hadoop-ozone/pull/132 ## What changes were proposed in this pull request? Eliminate `ToStringBuilder` from `BlockData`. Use a single `StringBuilder` to collect parts of the final result. Also avoid stream-processing in `setChunks` for the special cases of 0 or 1 elements. https://issues.apache.org/jira/browse/HDDS-2445 ## How was this patch tested? Added benchmark with various implementations. ``` bin/ozone genesis -benchmark BenchmarkBlockDataToString ``` Normalized GC allocation rates are below (absolute values are not important, only relative to one another). Using a single string builder saves ~78% of allocations compared to the current implementation (`ToStringBuilderDefaultCapacity`). ``` Benchmark (capacity) (count) Mode CntScore Error Units PushDownStringBuilder 112 1000 thrpt 20 503403.364 ± 6.593B/op InlineStringBuilder 112 1000 thrpt 20 503625.761 ± 2.665B/op SimpleStringBuilder 112 1000 thrpt 20 1133643.831 ± 4.051B/op ToStringBuilder 112 1000 thrpt 20 1429626.864 ± 7.415B/op Concatenation 112 1000 thrpt 20 1523808.749 ± 13.819B/op ToStringBuilderDefaultCapacity112 1000 thrpt 20 2229699.096 ± 6.739B/op ``` Added a simple unit test to verify the output is unchanged. Stream-processing change is verified by existing `TestBlockData#testSetChunks`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340447) Remaining Estimate: 0h Time Spent: 10m > Replace ToStringBuilder in BlockData > > > Key: HDDS-2445 > URL: https://issues.apache.org/jira/browse/HDDS-2445 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance, pull-request-available > Attachments: blockdata.png, setchunks.png > > Time Spent: 10m > Remaining Estimate: 0h > > {{BlockData#toString}} uses {{ToStringBuilder}} for ease of implementation. > This has a few problems: > # {{ToStringBuilder}} uses {{StringBuffer}}, which is synchronized > # the default buffer is 512 bytes, more than needed here > # {{BlockID}} and {{ContainerBlockID}} both use another {{StringBuilder}} or > {{StringBuffer}} for their {{toString}} implementation, leading to several > allocations and copies > The flame graph shows that {{BlockData#toString}} may be responsible for 1.5% > of total allocations while putting keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2445) Replace ToStringBuilder in BlockData
[ https://issues.apache.org/jira/browse/HDDS-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2445: - Labels: perfomance pull-request-available (was: perfomance) > Replace ToStringBuilder in BlockData > > > Key: HDDS-2445 > URL: https://issues.apache.org/jira/browse/HDDS-2445 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance, pull-request-available > Attachments: blockdata.png, setchunks.png > > > {{BlockData#toString}} uses {{ToStringBuilder}} for ease of implementation. > This has a few problems: > # {{ToStringBuilder}} uses {{StringBuffer}}, which is synchronized > # the default buffer is 512 bytes, more than needed here > # {{BlockID}} and {{ContainerBlockID}} both use another {{StringBuilder}} or > {{StringBuffer}} for their {{toString}} implementation, leading to several > allocations and copies > The flame graph shows that {{BlockData#toString}} may be responsible for 1.5% > of total allocations while putting keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2446) ContainerReplica should contain DatanodeInfo rather than DatanodeDetails
[ https://issues.apache.org/jira/browse/HDDS-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2446: - Labels: pull-request-available (was: ) > ContainerReplica should contain DatanodeInfo rather than DatanodeDetails > > > Key: HDDS-2446 > URL: https://issues.apache.org/jira/browse/HDDS-2446 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: SCM >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > > The ContainerReplica object is used by the SCM to track containers reported > by the datanodes. The current fields stored in ContainerReplica are: > {code} > final private ContainerID containerID; > final private ContainerReplicaProto.State state; > final private DatanodeDetails datanodeDetails; > final private UUID placeOfBirth; > {code} > Now we have introduced decommission and maintenance mode, the replication > manager (and potentially other parts of the code) need to know the status of > the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to > make replication decisions. > The DatanodeDetails object does not carry this information, however the > DatanodeInfo object extends DatanodeDetails and does carry the required > information. > As DatanodeInfo extends DatanodeDetails, any place which needs a > DatanodeDetails can accept a DatanodeInfo instead. > In this Jira I propose we change the DatanodeDetails stored in > ContainerReplica to DatanodeInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2446) ContainerReplica should contain DatanodeInfo rather than DatanodeDetails
[ https://issues.apache.org/jira/browse/HDDS-2446?focusedWorklogId=340445=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-340445 ] ASF GitHub Bot logged work on HDDS-2446: Author: ASF GitHub Bot Created on: 08/Nov/19 11:02 Start Date: 08/Nov/19 11:02 Worklog Time Spent: 10m Work Description: sodonnel commented on pull request #131: HDDS-2446 - ContainerReplica should contain DatanodeInfo rather than DatanodeDetails URL: https://github.com/apache/hadoop-ozone/pull/131 The ContainerReplica object is used by the SCM to track containers reported by the datanodes. The current fields stored in ContainerReplica are: ``` final private ContainerID containerID; final private ContainerReplicaProto.State state; final private DatanodeDetails datanodeDetails; final private UUID placeOfBirth; ``` Now we have introduced decommission and maintenance mode, the replication manager (and potentially other parts of the code) need to know the status of the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to make replication decisions. The DatanodeDetails object does not carry this information, however the DatanodeInfo object extends DatanodeDetails and does carry the required information. As DatanodeInfo extends DatanodeDetails, any place which needs a DatanodeDetails can accept a DatanodeInfo instead. In this PR I propose we change the DatanodeDetails stored in ContainerReplica to DatanodeInfo. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 340445) Remaining Estimate: 0h Time Spent: 10m > ContainerReplica should contain DatanodeInfo rather than DatanodeDetails > > > Key: HDDS-2446 > URL: https://issues.apache.org/jira/browse/HDDS-2446 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: SCM >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The ContainerReplica object is used by the SCM to track containers reported > by the datanodes. The current fields stored in ContainerReplica are: > {code} > final private ContainerID containerID; > final private ContainerReplicaProto.State state; > final private DatanodeDetails datanodeDetails; > final private UUID placeOfBirth; > {code} > Now we have introduced decommission and maintenance mode, the replication > manager (and potentially other parts of the code) need to know the status of > the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to > make replication decisions. > The DatanodeDetails object does not carry this information, however the > DatanodeInfo object extends DatanodeDetails and does carry the required > information. > As DatanodeInfo extends DatanodeDetails, any place which needs a > DatanodeDetails can accept a DatanodeInfo instead. > In this Jira I propose we change the DatanodeDetails stored in > ContainerReplica to DatanodeInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2445) Replace ToStringBuilder in BlockData
[ https://issues.apache.org/jira/browse/HDDS-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Doroszlai updated HDDS-2445: --- Attachment: setchunks.png > Replace ToStringBuilder in BlockData > > > Key: HDDS-2445 > URL: https://issues.apache.org/jira/browse/HDDS-2445 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance > Attachments: blockdata.png, setchunks.png > > > {{BlockData#toString}} uses {{ToStringBuilder}} for ease of implementation. > This has a few problems: > # {{ToStringBuilder}} uses {{StringBuffer}}, which is synchronized > # the default buffer is 512 bytes, more than needed here > # {{BlockID}} and {{ContainerBlockID}} both use another {{StringBuilder}} or > {{StringBuffer}} for their {{toString}} implementation, leading to several > allocations and copies > The flame graph shows that {{BlockData#toString}} may be responsible for 1.5% > of total allocations while putting keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2445) Replace ToStringBuilder in BlockData
[ https://issues.apache.org/jira/browse/HDDS-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Doroszlai updated HDDS-2445: --- Attachment: blockdata.png > Replace ToStringBuilder in BlockData > > > Key: HDDS-2445 > URL: https://issues.apache.org/jira/browse/HDDS-2445 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance > Attachments: blockdata.png, setchunks.png > > > {{BlockData#toString}} uses {{ToStringBuilder}} for ease of implementation. > This has a few problems: > # {{ToStringBuilder}} uses {{StringBuffer}}, which is synchronized > # the default buffer is 512 bytes, more than needed here > # {{BlockID}} and {{ContainerBlockID}} both use another {{StringBuilder}} or > {{StringBuffer}} for their {{toString}} implementation, leading to several > allocations and copies > The flame graph shows that {{BlockData#toString}} may be responsible for 1.5% > of total allocations while putting keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDDS-2446) ContainerReplica should contain DatanodeInfo rather than DatanodeDetails
[ https://issues.apache.org/jira/browse/HDDS-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDDS-2446: Description: The ContainerReplica object is used by the SCM to track containers reported by the datanodes. The current fields stored in ContainerReplica are: {code} final private ContainerID containerID; final private ContainerReplicaProto.State state; final private DatanodeDetails datanodeDetails; final private UUID placeOfBirth; {code} Now we have introduced decommission and maintenance mode, the replication manager (and potentially other parts of the code) need to know the status of the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to make replication decisions. The DatanodeDetails object does not carry this information, however the DatanodeInfo object extends DatanodeDetails and does carry the required information. As DatanodeInfo extends DatanodeDetails, any place which needs a DatanodeDetails can accept a DatanodeInfo instead. In this Jira I propose we change the DatanodeDetails stored in ContainerReplica to DatanodeInfo. was: The ContainerReplica object is used by the SCM to track containers reported by the datanodes. The current fields stored in ContainerReplica are: {code} final private ContainerID containerID; final private ContainerReplicaProto.State state; final private DatanodeDetails datanodeDetails; final private UUID placeOfBirth; {code} Now we have introduced decommission and maintenance mode, the replication manager (and potentially other parts of the code) need to know the status of the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to make replication decisions. The DatanodeDetails object does not carry this information, however the DatanodeInfo object extends DatanodeDetails and does carry the required information. As DatanodeInfo extends DatanodeDetails, any place which needs a DatanodeDetails can accept a DatanodeInfo instead. > ContainerReplica should contain DatanodeInfo rather than DatanodeDetails > > > Key: HDDS-2446 > URL: https://issues.apache.org/jira/browse/HDDS-2446 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: SCM >Affects Versions: 0.5.0 >Reporter: Stephen O'Donnell >Assignee: Stephen O'Donnell >Priority: Major > > The ContainerReplica object is used by the SCM to track containers reported > by the datanodes. The current fields stored in ContainerReplica are: > {code} > final private ContainerID containerID; > final private ContainerReplicaProto.State state; > final private DatanodeDetails datanodeDetails; > final private UUID placeOfBirth; > {code} > Now we have introduced decommission and maintenance mode, the replication > manager (and potentially other parts of the code) need to know the status of > the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to > make replication decisions. > The DatanodeDetails object does not carry this information, however the > DatanodeInfo object extends DatanodeDetails and does carry the required > information. > As DatanodeInfo extends DatanodeDetails, any place which needs a > DatanodeDetails can accept a DatanodeInfo instead. > In this Jira I propose we change the DatanodeDetails stored in > ContainerReplica to DatanodeInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2446) ContainerReplica should contain DatanodeInfo rather than DatanodeDetails
Stephen O'Donnell created HDDS-2446: --- Summary: ContainerReplica should contain DatanodeInfo rather than DatanodeDetails Key: HDDS-2446 URL: https://issues.apache.org/jira/browse/HDDS-2446 Project: Hadoop Distributed Data Store Issue Type: Sub-task Components: SCM Affects Versions: 0.5.0 Reporter: Stephen O'Donnell Assignee: Stephen O'Donnell The ContainerReplica object is used by the SCM to track containers reported by the datanodes. The current fields stored in ContainerReplica are: {code} final private ContainerID containerID; final private ContainerReplicaProto.State state; final private DatanodeDetails datanodeDetails; final private UUID placeOfBirth; {code} Now we have introduced decommission and maintenance mode, the replication manager (and potentially other parts of the code) need to know the status of the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to make replication decisions. The DatanodeDetails object does not carry this information, however the DatanodeInfo object extends DatanodeDetails and does carry the required information. As DatanodeInfo extends DatanodeDetails, any place which needs a DatanodeDetails can accept a DatanodeInfo instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2445) Replace ToStringBuilder in BlockData
Attila Doroszlai created HDDS-2445: -- Summary: Replace ToStringBuilder in BlockData Key: HDDS-2445 URL: https://issues.apache.org/jira/browse/HDDS-2445 Project: Hadoop Distributed Data Store Issue Type: Improvement Reporter: Attila Doroszlai Assignee: Attila Doroszlai {{BlockData#toString}} uses {{ToStringBuilder}} for ease of implementation. This has a few problems: # {{ToStringBuilder}} uses {{StringBuffer}}, which is synchronized # the default buffer is 512 bytes, more than needed here # {{BlockID}} and {{ContainerBlockID}} both use another {{StringBuilder}} or {{StringBuffer}} for their {{toString}} implementation, leading to several allocations and copies The flame graph shows that {{BlockData#toString}} may be responsible for 1.5% of total allocations while putting keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDDS-2445) Replace ToStringBuilder in BlockData
[ https://issues.apache.org/jira/browse/HDDS-2445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDDS-2445 started by Attila Doroszlai. -- > Replace ToStringBuilder in BlockData > > > Key: HDDS-2445 > URL: https://issues.apache.org/jira/browse/HDDS-2445 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Attila Doroszlai >Assignee: Attila Doroszlai >Priority: Minor > Labels: perfomance > > {{BlockData#toString}} uses {{ToStringBuilder}} for ease of implementation. > This has a few problems: > # {{ToStringBuilder}} uses {{StringBuffer}}, which is synchronized > # the default buffer is 512 bytes, more than needed here > # {{BlockID}} and {{ContainerBlockID}} both use another {{StringBuilder}} or > {{StringBuffer}} for their {{toString}} implementation, leading to several > allocations and copies > The flame graph shows that {{BlockData#toString}} may be responsible for 1.5% > of total allocations while putting keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14720) DataNode shouldn't report block as bad block if the block length is Long.MAX_VALUE.
[ https://issues.apache.org/jira/browse/HDFS-14720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970059#comment-16970059 ] Xiaoqiao He commented on HDFS-14720: Thanks [~surendrasingh] and [~hemanthboyina], {quote}It just not log the warn message , even it report badblock to namenode and increase the work load for namenode.{quote} It is true. One minor suggestion, {code:java} + if (getBlock().getNumBytes() != BlockCommand.NO_ACK) {code} `BlockCommand.NO_ACK` is not very clear express here, it is better to add some comments, perhaps nice with JIRA id. FYI. Thanks. > DataNode shouldn't report block as bad block if the block length is > Long.MAX_VALUE. > --- > > Key: HDFS-14720 > URL: https://issues.apache.org/jira/browse/HDFS-14720 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Affects Versions: 3.1.1 >Reporter: Surendra Singh Lilhore >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14720.001.patch, HDFS-14720.002.patch > > > {noformat} > 2019-08-11 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Can't replicate block > BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because > on-disk length 175085 is shorter than NameNode recorded length > 9223372036854775807.{noformat} > If the block length is Long.MAX_VALUE, means file belongs to this block is > deleted from the namenode and DN got the command after deletion of file. In > this case command should be ignored. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org