[jira] [Created] (HDFS-6616) bestNode shouldn't always return the first DataNode
zhaoyunjiong created HDFS-6616: -- Summary: bestNode shouldn't always return the first DataNode Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.patch One possible solution is choose DataNode randomly with the cost of ignore the network distance. bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.patch bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: (was: HDFS-6616.patch) bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049669#comment-14049669 ] zhaoyunjiong commented on HDFS-6616: What happened on our cluster is very rare case. Server use HDP2.1 and client use HDP1.3, so I come up this patch. Correct me if I'm wrong: when using WebHDFS, I think it will be very rare that both client and the data will be in the same host. But I agree with you support exclude nodes in WebHDFS is a better idea. bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-2.patch Thanks Daryn Sharp for your time. Update patch, use boolean instead of Boolean. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051076#comment-14051076 ] zhaoyunjiong commented on HDFS-6616: Yes. You are right. I never thought user may use WebHDFS as source and target filesystem, and running distcp job on source cluster. For our use case, we always run jobs on target cluster and use WebHDFS as source filesystem. bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.1.patch Update patch to support exclude nodes in WebHDFS. bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.1.patch, HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.2.patch Thanks Tsz Wo Nicholas Sze Jing Zhao. Update patch according to comments: change ExcludeDatanodesParam.NAME to excludedatanodes and change WebHdfsFileSystem to use the exclude datanode feature. The test failures is not related. bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode
[ https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6616: --- Attachment: HDFS-6616.3.patch Update patch according to comments and fix test failures. bestNode shouldn't always return the first DataNode --- Key: HDFS-6616 URL: https://issues.apache.org/jira/browse/HDFS-6616 Project: Hadoop HDFS Issue Type: Bug Components: webhdfs Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.3.patch, HDFS-6616.patch When we are doing distcp between clusters, job failed: 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL part-r-00101.avro : java.net.NoRouteToHostException: No route to host at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491) at java.security.AccessController.doPrivileged(Native Method) at sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379) at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547) at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232) at org.apache.hadoop.mapred.Child.main(Child.java:249) The root reason is one of the DataNode can't access from outside, but inside cluster, it's health. In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, so even after the distcp retries, it still failed. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster
zhaoyunjiong created HDFS-6829: -- Summary: DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster Key: HDFS-6829 URL: https://issues.apache.org/jira/browse/HDFS-6829 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 2.4.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor When we run command hadoop dfsadmin -refreshSuperUserGroupsConfiguration, it failed and report below message: 14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The serverPrincipal = doesn't confirm to the standards refreshSuperUserGroupsConfiguration: null After check the code, I found the bug was triggered by below reasons: 1. We didn't set CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, which needed by RefreshUserMappingsProtocol. And in DFSAdmin, if no CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will try to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, )); 2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in hdfs-site.xml 3. DFSAdmin didn't load hdfs-site.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster
[ https://issues.apache.org/jira/browse/HDFS-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6829: --- Attachment: HDFS-6829.patch This patch is very simple, use HdfsConfiguration to load hdfs-site.xml when construct DFSAdmin. DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster --- Key: HDFS-6829 URL: https://issues.apache.org/jira/browse/HDFS-6829 Project: Hadoop HDFS Issue Type: Bug Components: tools Affects Versions: 2.4.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Minor Attachments: HDFS-6829.patch When we run command hadoop dfsadmin -refreshSuperUserGroupsConfiguration, it failed and report below message: 14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The serverPrincipal = doesn't confirm to the standards refreshSuperUserGroupsConfiguration: null After check the code, I found the bug was triggered by below reasons: 1. We didn't set CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, which needed by RefreshUserMappingsProtocol. And in DFSAdmin, if no CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will try to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, )); 2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in hdfs-site.xml 3. DFSAdmin didn't load hdfs-site.xml -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy
zhaoyunjiong created HDFS-7044: -- Summary: Support retention policy based on access time and modify time, use XAttr to store policy Key: HDFS-7044 URL: https://issues.apache.org/jira/browse/HDFS-7044 Project: Hadoop HDFS Issue Type: New Feature Components: namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong The basic idea is set retention policy on directory based on access time and modify time and use XAttr to store policy. Files under directory which have retention policy will be delete if meet the retention rule. There are three rule: # access time #* If (accessTime + retentionTimeForAccess now), the file will be delete # modify time #* If (modifyTime + retentionTimeForModify now), the file will be delete # access time and modify time #* If (accessTime + retentionTimeForAccess now modifyTime + retentionTimeForModify now ), the file will be delete -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy
[ https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7044: --- Attachment: Retention policy design.pdf Attach a simple design document. The major difference between HDFS-7044 and HDFS-6382 are(Please correct me if I'm wrong, I just knew HDFS-6382 was trying to solve same problem): # HDFS-6382 is standalone daemon outside NameNode, HDFS-7044 will be inside NameNode, I believe HDFS-7044 will be more simple and efficient. # HDFS-7044 allows user set policy based on access time or modify time, HDFS-6382 only support one ttl. Support retention policy based on access time and modify time, use XAttr to store policy Key: HDFS-7044 URL: https://issues.apache.org/jira/browse/HDFS-7044 Project: Hadoop HDFS Issue Type: New Feature Components: namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: Retention policy design.pdf The basic idea is set retention policy on directory based on access time and modify time and use XAttr to store policy. Files under directory which have retention policy will be delete if meet the retention rule. There are three rule: # access time #* If (accessTime + retentionTimeForAccess now), the file will be delete # modify time #* If (modifyTime + retentionTimeForModify now), the file will be delete # access time and modify time #* If (accessTime + retentionTimeForAccess now modifyTime + retentionTimeForModify now ), the file will be delete -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase
zhaoyunjiong created HDFS-6133: -- Summary: Make Balancer support don't move blocks belongs to Hbase Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133.patch This patch make Balancer support don't move blocks belongs to Hbase Make Balancer support don't move blocks belongs to Hbase Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943050#comment-13943050 ] zhaoyunjiong commented on HDFS-6133: Thanks for your review, stack. I didn't aware HDFS-4420 when I created this issue. The problem we are trying to solve is same, but with very different approach. The performance for HDFS-4420 seems not very well when exclude path have huge blocks. I just use hbase for example, and also hbase is main use case for this feature. For now it only accept one exclude path, but support multiple is a good idea, I can upload a new patch next week. It only run manually. Make Balancer support don't move blocks belongs to Hbase Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Summary: Make Balancer support exclude specified path (was: Make Balancer support don't move blocks belongs to Hbase) Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: (was: HDFS-6133.patch) Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133.patch This patch support exclude multiple paths. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Status: Patch Available (was: Open) Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HDFS-6228) comments typo fix for FsDatasetImpl.java
zhaoyunjiong created HDFS-6228: -- Summary: comments typo fix for FsDatasetImpl.java Key: HDFS-6228 URL: https://issues.apache.org/jira/browse/HDFS-6228 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Trivial -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6228) comments typo fix for FsDatasetImpl.java
[ https://issues.apache.org/jira/browse/HDFS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6228: --- Attachment: HDFS-6228.patch A patch fix typo in comments: - * @param estimateBlockLen estimate generation stamp + * @param estimateBlockLen estimate block length comments typo fix for FsDatasetImpl.java Key: HDFS-6228 URL: https://issues.apache.org/jira/browse/HDFS-6228 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Trivial Attachments: HDFS-6228.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6228) comments typo fix for FsDatasetImpl.java
[ https://issues.apache.org/jira/browse/HDFS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6228: --- Status: Patch Available (was: Open) comments typo fix for FsDatasetImpl.java Key: HDFS-6228 URL: https://issues.apache.org/jira/browse/HDFS-6228 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Priority: Trivial Attachments: HDFS-6228.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-4420) Provide a way to exclude subtree from balancing process
[ https://issues.apache.org/jira/browse/HDFS-4420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971498#comment-13971498 ] zhaoyunjiong commented on HDFS-4420: Hi Yongjun, Could you check https://issues.apache.org/jira/browse/HDFS-6133, which have same idea with different approach. Provide a way to exclude subtree from balancing process --- Key: HDFS-4420 URL: https://issues.apache.org/jira/browse/HDFS-4420 Project: Hadoop HDFS Issue Type: Improvement Components: balancer Reporter: Max Lapan Priority: Minor Attachments: Balancer-exclude-subtree-0.90.2.patch, Balancer-exclude-trunk-v2.patch, Balancer-exclude-trunk-v3.patch, Balancer-exclude-trunk.patch, HDFS-4420-v4.patch During balancer operation, it balances all blocks, regardless of their filesystem hierarchy. Sometimes, it would be usefull to exclude some subtree from balancing process. For example, regionservers data locality is cruical for HBase performance. Region's data is tied to regionservers, which reside on specific machines in cluster. During operation, regionservers reads and writes region's data, and after some time, all this data are reside on local machine, so, all reads become local, which is great for performance. Balancer breaks this locality during opertation by moving blocks around. This patch adds [-exclude path] switch, and, if path is provided, balancer will not move blocks under this path during operation. Attached patch have tested for 0.90.2. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-2139: --- Attachment: HDFS-2139.patch Seems Pritam don't have time to create a patch for Apache. And I do think use hard link to copy data between pools is a good idea, so based on the FaceBook's version of FastCopy, I created this patch to copy files between pools. Compare to the origin FastCopy, this patch only use hard link to do the copy. It's a early version, it works on my test cluster which only have 6 datanodes. Please let me know if I need change the name or create a new issue. Fast copy for HDFS. --- Key: HDFS-2139 URL: https://issues.apache.org/jira/browse/HDFS-2139 Project: Hadoop HDFS Issue Type: New Feature Reporter: Pritam Damania Attachments: HDFS-2139.patch Original Estimate: 168h Remaining Estimate: 168h There is a need to perform fast file copy on HDFS. The fast copy mechanism for a file works as follows : 1) Query metadata for all blocks of the source file. 2) For each block 'b' of the file, find out its datanode locations. 3) For each block of the file, add an empty block to the namesystem for the destination file. 4) For each location of the block, instruct the datanode to make a local copy of that block. 5) Once each datanode has copied over its respective blocks, they report to the namenode about it. 6) Wait for all blocks to be copied and exit. This would speed up the copying process considerably by removing top of the rack data transfers. Note : An extra improvement, would be to instruct the datanode to create a hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133.patch Thanks Yongjun Zhang and Benoy Antony for the review. Update patches according to the comments except the corner case about /a/b covers /a/b/c, I do believe user won't do that. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: (was: HDFS-6133.patch) Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: (was: HDFS-6133.patch) Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133.patch Upload patch according to comments. By the way, do we have new BM service design? Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984113#comment-13984113 ] zhaoyunjiong commented on HDFS-6133: Yes, block pinning works. By the way, where do you think is the best place to store pinning information? If save in Block, seems it will cost a lot memory. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133.patch.1 This patch will set sticky bit on the block file if the DFSClient have favored nodes hint set, and refuse to move from Balancer. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: (was: HDFS-6133.patch.1) Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-1.patch Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991492#comment-13991492 ] zhaoyunjiong commented on HDFS-6133: I'll use boolean instead of Boolean. Yes, the NN may not grant all the requested/favored nodes. The best way is to only pin the blocks on the favored nodes, but considered the probability that NN didn't grant all the favored nodes is small, so I just pinned them all. Also I was wondering whether I should provide a API that let user pinning/un-pinning blocks after file created. That might be more useful than combine pinning with favored nodes. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (HDFS-2139) Fast copy for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-2139: --- Attachment: HDFS-2139.patch Thanks Guo Ruijing Daryn Sharp for your time. Update patch according to the comments: 1. add clone in DistributedFileSystem 2. add check block tokens 3. support clone part of the file, the last block still use hardlink, then use truncateBlock to adjust block size and meta file. Yes, DN enforce no linking of UC blocks. Fast copy for HDFS. --- Key: HDFS-2139 URL: https://issues.apache.org/jira/browse/HDFS-2139 Project: Hadoop HDFS Issue Type: New Feature Reporter: Pritam Damania Attachments: HDFS-2139.patch, HDFS-2139.patch Original Estimate: 168h Remaining Estimate: 168h There is a need to perform fast file copy on HDFS. The fast copy mechanism for a file works as follows : 1) Query metadata for all blocks of the source file. 2) For each block 'b' of the file, find out its datanode locations. 3) For each block of the file, add an empty block to the namesystem for the destination file. 4) For each location of the block, instruct the datanode to make a local copy of that block. 5) Once each datanode has copied over its respective blocks, they report to the namenode about it. 6) Wait for all blocks to be copied and exit. This would speed up the copying process considerably by removing top of the rack data transfers. Note : An extra improvement, would be to instruct the datanode to create a hardlink of the block file if we are copying a block on the same datanode -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Moved] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong moved MAPREDUCE-5415 to HDFS-5028: --- Assignee: (was: zhaoyunjiong) Affects Version/s: (was: 1.2.0) 1.2.0 Key: HDFS-5028 (was: MAPREDUCE-5415) Project: Hadoop HDFS (was: Hadoop Map/Reduce) LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.2.0 Reporter: zhaoyunjiong Attachments: MAPREDUCE-5415.patch In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: (was: MAPREDUCE-5415.patch) LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.2.0 Reporter: zhaoyunjiong Attachments: HDFS-5028 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: HDFS-5028 LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.2.0 Reporter: zhaoyunjiong Attachments: HDFS-5028 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Affects Version/s: (was: 1.2.0) 1.1.2 Fix Version/s: 1.1.3 LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.2 Reporter: zhaoyunjiong Fix For: 1.1.3 Attachments: HDFS-5028 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: (was: HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.2 Reporter: zhaoyunjiong Fix For: 1.1.3 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: HDFS-5028.patch HDFS-5028-1.1.2.patch Update patch for both trunk and 1.1.2. LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.2 Reporter: zhaoyunjiong Fix For: 1.1.3 Attachments: HDFS-5028-1.1.2.patch, HDFS-5028.patch In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Affects Version/s: (was: 1.1.2) 1.1.0 LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.0, 2.0.0-alpha Reporter: zhaoyunjiong Fix For: 1.1.3 Attachments: HDFS-5028.patch In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: (was: HDFS-5028-1.1.2.patch) LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.0, 2.0.0-alpha Reporter: zhaoyunjiong Fix For: 1.1.3 Attachments: HDFS-5028.patch In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: HDFS-5028-branch-1.1.patch LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.0, 2.0.0-alpha Reporter: zhaoyunjiong Fix For: 1.1.3 Attachments: HDFS-5028-branch-1.1.patch, HDFS-5028.patch In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723253#comment-13723253 ] zhaoyunjiong commented on HDFS-5028: dfsclients was syncronizing. The problem here is Iterator. You can get more information here: http://stackoverflow.com/questions/8189466/java-util-concurrentmodificationexception For short: The iterators returned by this class's iterator and listIterator methods are fail-fast: if the list is structurally modified at any time after the iterator is created, in any way except through the iterator's own remove or add methods, the iterator will throw a ConcurrentModificationException. c.abort() will remove c(a dfsclient) from dfsclients, so iterator generated in for(DFSClient c : dfsclients) will throw ConcurrentModificationException. LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.0, 2.0.0-alpha Reporter: zhaoyunjiong Fix For: 1.1.3 Attachments: HDFS-5028-branch-1.1.patch, HDFS-5028.patch In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: (was: HDFS-5028.patch) LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.0, 2.0.0-alpha Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.1.3 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: (was: HDFS-5028-branch-1.1.patch) LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.0, 2.0.0-alpha Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.1.3 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout
[ https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5028: --- Attachment: HDFS-5028.patch HDFS-5028-branch-1.1.patch Thanks Nicholas. Change dfsclients.get(dfsclients.size() - 1).abort() to dfsclients.get(0).abort(). LeaseRenewer throw java.util.ConcurrentModificationException when timeout - Key: HDFS-5028 URL: https://issues.apache.org/jira/browse/HDFS-5028 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 1.1.0, 2.0.0-alpha Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.1.3 Attachments: HDFS-5028-branch-1.1.patch, HDFS-5028.patch In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will remove one dfsclient from dfsclients. Here will throw a ConcurrentModificationException because dfsclients changed after the iterator created by for(DFSClient c : dfsclients): Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406) at org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69) at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273) at java.lang.Thread.run(Thread.java:662) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-5247) Namenode should close editlog and unlock storage when removing failed storage dir
zhaoyunjiong created HDFS-5247: -- Summary: Namenode should close editlog and unlock storage when removing failed storage dir Key: HDFS-5247 URL: https://issues.apache.org/jira/browse/HDFS-5247 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.2.1 When one of dfs.name.dir failed, namenode didn't close editlog and unlock the storage: java24764 hadoop 78uW REG 252,320 393219 /volume1/nn/dfs/in_use.lock (deleted) java24764 hadoop 107u REG 252,32 1155072 393229 /volume1/nn/dfs/current/edits.new (deleted) java24764 hadoop 119u REG 252,320 393238 /volume1/nn/dfs/current/fstime.tmp java24764 hadoop 140u REG 252,32 1761805 393239 /volume1/nn/dfs/current/edits If this dir is limit of space, then restore this storage may fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-5247) Namenode should close editlog and unlock storage when removing failed storage dir
[ https://issues.apache.org/jira/browse/HDFS-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777073#comment-13777073 ] zhaoyunjiong commented on HDFS-5247: I'm saying for the failed directory. Our case is due to no space on that disk. In this case, it need and should close those two files. And I believe try to close won't make thing worse. Namenode should close editlog and unlock storage when removing failed storage dir - Key: HDFS-5247 URL: https://issues.apache.org/jira/browse/HDFS-5247 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.2.1 Attachments: HDFS-5247-branch-1.2.patch When one of dfs.name.dir failed, namenode didn't close editlog and unlock the storage: java24764 hadoop 78uW REG 252,320 393219 /volume1/nn/dfs/in_use.lock (deleted) java24764 hadoop 107u REG 252,32 1155072 393229 /volume1/nn/dfs/current/edits.new (deleted) java24764 hadoop 119u REG 252,320 393238 /volume1/nn/dfs/current/fstime.tmp java24764 hadoop 140u REG 252,32 1761805 393239 /volume1/nn/dfs/current/edits If this dir is limit of space, then restore this storage may fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
zhaoyunjiong created HDFS-5367: -- Summary: Restore fsimage locked NameNode too long when the size of fsimage are big Key: HDFS-5367 URL: https://issues.apache.org/jira/browse/HDFS-5367 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Our cluster have 40G fsimage, we write one copy of edit log to NFS. After NFS temporary failed, when doing checkpoint, NameNode try to recover it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5367: --- Attachment: (was: HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big - Key: HDFS-5367 URL: https://issues.apache.org/jira/browse/HDFS-5367 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Our cluster have 40G fsimage, we write one copy of edit log to NFS. After NFS temporary failed, when doing checkpoint, NameNode try to recover it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5367: --- Attachment: HDFS-5367 The fsimage restored when SecondaryNameNode call rollEditLog will be replaced soon when SecondaryNameNode call rollFsImage. So I think restore fsimage is not necessary. Restore fsimage locked NameNode too long when the size of fsimage are big - Key: HDFS-5367 URL: https://issues.apache.org/jira/browse/HDFS-5367 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Our cluster have 40G fsimage, we write one copy of edit log to NFS. After NFS temporary failed, when doing checkpoint, NameNode try to recover it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5367: --- Attachment: HDFS-5367-branch-1.2.patch This patch avoid restore fsimage to make rollEditLog finished as soon as possible. Restore fsimage locked NameNode too long when the size of fsimage are big - Key: HDFS-5367 URL: https://issues.apache.org/jira/browse/HDFS-5367 Project: Hadoop HDFS Issue Type: Improvement Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5367-branch-1.2.patch Our cluster have 40G fsimage, we write one copy of edit log to NFS. After NFS temporary failed, when doing checkpoint, NameNode try to recover it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5367) Restoring namenode storage locks namenode due to unnecessary fsimage write
[ https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798610#comment-13798610 ] zhaoyunjiong commented on HDFS-5367: Thank you for your review. Restoring namenode storage locks namenode due to unnecessary fsimage write -- Key: HDFS-5367 URL: https://issues.apache.org/jira/browse/HDFS-5367 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 Attachments: HDFS-5367-branch-1.2.patch Our cluster have 40G fsimage, we write one copy of edit log to NFS. After NFS temporary failed, when doing checkpoint, NameNode try to recover it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 320 seconds) , and it locked FSNamesystem, and this bring down our cluster. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
zhaoyunjiong created HDFS-5396: -- Summary: FSImage.getFsImageName should check whether fsimage exists Key: HDFS-5396 URL: https://issues.apache.org/jira/browse/HDFS-5396 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to all IMAGE dir, so we need to check whether fsimage exists before FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5396: --- Attachment: HDFS-5396-branch-1.2.patch Check whether fsimage exists before return. FSImage.getFsImageName should check whether fsimage exists -- Key: HDFS-5396 URL: https://issues.apache.org/jira/browse/HDFS-5396 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 Attachments: HDFS-5396-branch-1.2.patch In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to all IMAGE dir, so we need to check whether fsimage exists before FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong resolved HDFS-5396. Resolution: Not A Problem FSImage.getFsImageName should check whether fsimage exists -- Key: HDFS-5396 URL: https://issues.apache.org/jira/browse/HDFS-5396 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 Attachments: HDFS-5396-branch-1.2.patch In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to all IMAGE dir, so we need to check whether fsimage exists before FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802545#comment-13802545 ] zhaoyunjiong commented on HDFS-5396: The first image storage dir always have fsimage file in it. Restored image storage always append to the end. So the first one must have fsimage in it. FSImage.getFsImageName should check whether fsimage exists -- Key: HDFS-5396 URL: https://issues.apache.org/jira/browse/HDFS-5396 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 Attachments: HDFS-5396-branch-1.2.patch In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to all IMAGE dir, so we need to check whether fsimage exists before FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HDFS-5579) Under construction files make DataNode decommission take very long hours
zhaoyunjiong created HDFS-5579: -- Summary: Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0, 1.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579.patch HDFS-5579-branch-1.2.patch This patch let NameNode can replicate blocks belongs to under construction files but not the last block. And if the decommissioning DataNodes only have some blocks which are the last blocks of under construction files and have more than 1 live replicates left behind, then NameNode could set it to DECOMMISSIONED. Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579-branch-1.2.patch HDFS-5579.patch Thanks Vinay. Update patch as your comments. Except: getLastBlock do throws IOException, I deleted it in this patch. Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, HDFS-5579.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579-branch-1.2.patch) Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579.patch HDFS-5579-branch-1.2.patch Update patch, added test case for trunk. Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, HDFS-5579.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579-branch-1.2.patch) Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865202#comment-13865202 ] zhaoyunjiong commented on HDFS-5579: It's already in the patch. +if (bc.isUnderConstruction()) { + if (block.equals(bc.getLastBlock()) curReplicas minReplication) { +continue; + } + underReplicatedInOpenFiles++; +} Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579-branch-1.2.patch HDFS-5579.patch Good point. Thanks Jing. Update patches to fix this problem. Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, HDFS-5579.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579-branch-1.2.patch) Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: HDFS-5579.patch Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5579: --- Attachment: (was: HDFS-5579.patch) Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours
[ https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870385#comment-13870385 ] zhaoyunjiong commented on HDFS-5579: Thanks for your time to review the patch, Jing. Under construction files make DataNode decommission take very long hours Key: HDFS-5579 URL: https://issues.apache.org/jira/browse/HDFS-5579 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 2.4.0 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch We noticed that some times decommission DataNodes takes very long time, even exceeds 100 hours. After check the code, I found that in BlockManager:computeReplicationWorkForBlocks(ListListBlock blocksToReplicate) it won't replicate blocks which belongs to under construction files, however in BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there is block need replicate no matter whether it belongs to under construction or not, the decommission progress will continue running. That's the reason some time the decommission takes very long time. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Reopened] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong reopened HDFS-5396: I made a mistake when I resolved this as Not A Problem. Because for (IteratorStorageDirectory it = dirIterator(NameNodeDirType.IMAGE); it.hasNext();) sd = it.next(); will return last StorageDirectory of image, but due to HDFS-5367, it may not have fsimage in it. FSImage.getFsImageName should check whether fsimage exists -- Key: HDFS-5396 URL: https://issues.apache.org/jira/browse/HDFS-5396 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 Attachments: HDFS-5396-branch-1.2.patch In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to all IMAGE dir, so we need to check whether fsimage exists before FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Created] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
zhaoyunjiong created HDFS-5944: -- Summary: LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.2.0, 1.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is this a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Description: In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. was: In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is this a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: HDFS-5944.patch HDFS-5944-branch-1.2.patch This patch is very simple, if prefix ended with '/', just minus 1 from srclen, so p.charAt(srclen) could handle path correctly. LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901171#comment-13901171 ] zhaoyunjiong commented on HDFS-5944: Brandon, thanks for your time to review this patch. I don't think the user use DFSClient directly. Even use DistributedFileSystem, we still can send path ending with / by passing path like this /a/b/../. Because in getPathName, String result = makeAbsolute(file).toUri().getPath() will return /a/. About unit test, I'd be happy to add one. I have two questions need your help: 1. Is it enough for just writing a unit test for findLeaseWithPrefixPath? 2. In trunk, there is no TestLeaseManager.java, should I add one? LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, HDFS-5944.test.txt In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: HDFS-5944-branch-1.2.patch HDFS-5944.patch Update patches with unit test. LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944-branch-1.2.patch, HDFS-5944.patch, HDFS-5944.patch, HDFS-5944.test.txt In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: (was: HDFS-5944-branch-1.2.patch) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, HDFS-5944.test.txt In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5944: --- Attachment: (was: HDFS-5944.patch) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, HDFS-5944.test.txt In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905361#comment-13905361 ] zhaoyunjiong commented on HDFS-5944: Multiple trailing / is impossible. LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, HDFS-5944.test.txt In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint
[ https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906435#comment-13906435 ] zhaoyunjiong commented on HDFS-5944: Thank you Brandon and Benoy. LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint - Key: HDFS-5944 URL: https://issues.apache.org/jira/browse/HDFS-5944 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.0, 2.2.0 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, HDFS-5944.test.txt, HDFS-5944.trunk.patch In our cluster, we encountered error like this: java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217) at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949) What happened: Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write. And Client A continue refresh it's lease. Client B deleted /XXX/20140206/04_30/ Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log Then secondaryNameNode try to do checkpoint and failed due to failed to delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/. The reason is a bug in findLeaseWithPrefixPath: int srclen = prefix.length(); if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) { entries.put(entry.getKey(), entry.getValue()); } Here when prefix is /XXX/20140206/04_30/, and p is /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'. The fix is simple, I'll upload patch later. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists
[ https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-5396: --- Attachment: HDFS-5396-branch-1.2.patch Update patch. FSImage.getFsImageName should check whether fsimage exists -- Key: HDFS-5396 URL: https://issues.apache.org/jira/browse/HDFS-5396 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 1.2.1 Reporter: zhaoyunjiong Assignee: zhaoyunjiong Fix For: 1.3.0 Attachments: HDFS-5396-branch-1.2.patch, HDFS-5396-branch-1.2.patch In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to all IMAGE dir, so we need to check whether fsimage exists before FSImage.getFsImageName returned. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Resolved] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy
[ https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong resolved HDFS-7044. Resolution: Duplicate Thanks Allen Wittenauer and Zesheng Wu. After I read the comments in HDFS-6382, now I understand the concerns. Support retention policy based on access time and modify time, use XAttr to store policy Key: HDFS-7044 URL: https://issues.apache.org/jira/browse/HDFS-7044 Project: Hadoop HDFS Issue Type: New Feature Components: namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: Retention policy design.pdf The basic idea is set retention policy on directory based on access time and modify time and use XAttr to store policy. Files under directory which have retention policy will be delete if meet the retention rule. There are three rule: # access time #* If (accessTime + retentionTimeForAccess now), the file will be delete # modify time #* If (modifyTime + retentionTimeForModify now), the file will be delete # access time and modify time #* If (accessTime + retentionTimeForAccess now modifyTime + retentionTimeForModify now ), the file will be delete -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-3.patch Update patch for merge the trunk. {quote} Why we always pass false in below? 1653new Sender(out).writeBlock(b, accessToken, clientname, targets, 1654srcNode, stage, 0, 0, 0, 0, blockSender.getChecksum(), 1655cachingStrategy, false); {quote} This code path happens when NameNode ask DataNode send block to other DataNode(DatanodeProtocol.DNA_TRANSFER), it's not trigged by client, so there is no need pinning the block in this case. {quote} We will never copy a block? 925 if (datanode.data.getPinning(block)) 926 String msg = Not able to copy block + block.getBlockId() + + 927 to + peer.getRemoteAddressString() + because it's pinned ; 928 LOG.info(msg); 929 sendResponse(ERROR, msg); Any thing to help ensure replica count does not rot when this pinning is enabled? {quote} When the block is under replicate, NameNode will send DatanodeProtocol.DNA_TRANSFER command to DataNode and it handled by DataTransfer, pinning won't affect that. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: (was: HDFS-6133-3.patch) Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-3.patch Update patch, merge with trunk. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-4.patch Thanks Yongjun Zhang. Update patch according to comments. {quote} The concept of favoredNodes pre-existed before your patch, now your patch defines that as long as favoredNodes is passed, then pinning will be done. So we are changing the prior definition of how favoredNodes are used. Why not add some additional interface to tell that pinning will happen so we have the option not to pin even if favoredNodes is passed? Not necessarily you need to do what I suggested here, but I'd like to understand your thoughts here. {quote} I think most of time if you use favoredNodes, you'd like to keep the block on that machine, so to keep things simple, I didn't add new interface. {quote} Do we ever need interface to do unpinning? {quote} We can add unpinning in another issue if there are user case need that. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck
zhaoyunjiong created HDFS-7429: -- Summary: DomainSocketWatcher.doPoll0 stuck Key: HDFS-7429 URL: https://issues.apache.org/jira/browse/HDFS-7429 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck: {quote} DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9c90 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable [0x7f558d3d2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9cb0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable [0x7f55aef6e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52) at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457) at
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Attachment: 11241025 11241023 11241021 Upload more stack trace files. DomainSocketWatcher.doPoll0 stuck - Key: HDFS-7429 URL: https://issues.apache.org/jira/browse/HDFS-7429 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Attachments: 11241021, 11241023, 11241025 I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck: {quote} DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9c90 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable [0x7f558d3d2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9cb0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Summary: DomainSocketWatcher.kick stuck (was: DomainSocketWatcher.doPoll0 stuck) DomainSocketWatcher.kick stuck -- Key: HDFS-7429 URL: https://issues.apache.org/jira/browse/HDFS-7429 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Attachments: 11241021, 11241023, 11241025 I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck: {quote} DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9c90 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable [0x7f558d3d2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9cb0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Description: I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck: {quote} DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9c90 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable [0x7f558d3d2000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9cb0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable [0x7f55aef6e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52) at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457) at java.lang.Thread.run(Thread.java:745) {quote} was: I found some of our
[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224249#comment-14224249 ] zhaoyunjiong commented on HDFS-7429: The previous description is not right. The stuck thread happened at org.apache.hadoop.net.unix.DomainSocket.writeArray0 as below shows. {quote} $ grep -B2 -A10 DomainSocket.writeArray 1124102* 11241021-DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] 11241021- java.lang.Thread.State: RUNNABLE 11241021: at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) 11241021- at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) 11241021- at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) 11241021- at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) 11241021- at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) 11241021- at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) 11241021- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) 11241021- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) 11241021- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) 11241021- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) 11241021- at java.lang.Thread.run(Thread.java:745) -- -- 11241023-DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] 11241023- java.lang.Thread.State: RUNNABLE 11241023: at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) 11241023- at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) 11241023- at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) 11241023- at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) 11241023- at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) 11241023- at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) 11241023- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) 11241023- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) 11241023- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) 11241023- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) 11241023- at java.lang.Thread.run(Thread.java:745) -- -- 11241025-DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] 11241025- java.lang.Thread.State: RUNNABLE 11241025: at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) 11241025- at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) 11241025- at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) 11241025- at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) 11241025- at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) 11241025- at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) 11241025- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) 11241025- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) 11241025- at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) 11241025- at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) 11241025- at java.lang.Thread.run(Thread.java:745) {quote} DomainSocketWatcher.kick stuck -- Key: HDFS-7429 URL: https://issues.apache.org/jira/browse/HDFS-7429 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Attachments: 11241021, 11241023, 11241025 I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by
[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-7429: --- Description: I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck: {quote} DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9c90 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9cb0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable [0x7f55aef6e000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method) at org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52) at org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457) at
[jira] [Assigned] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong reassigned HDFS-7429: -- Assignee: zhaoyunjiong DomainSocketWatcher.kick stuck -- Key: HDFS-7429 URL: https://issues.apache.org/jira/browse/HDFS-7429 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: 11241021, 11241023, 11241025 I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck: {quote} DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9c90 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9cb0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) Thread-163852
[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck
[ https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224325#comment-14224325 ] zhaoyunjiong commented on HDFS-7429: The problem here is in our machine we can only send 299 bytes to domain socket. When it try to send the 300 byte, it will block, and the DomainSocketWatcher.add(DomainSocket sock, Handler handler) have the lock, so watcherThread.run can't get the lock and clear the buffer, it's a live lock. I'm not sure which configuration controls the bufferSize of 299 for now. Now I suspect net.core.netdev_budget, which is 300 at our machines. I'll upload a patch to control the send bytes to prevent live lock later. By the way, should I move this to HADOOP COMMON project? DomainSocketWatcher.kick stuck -- Key: HDFS-7429 URL: https://issues.apache.org/jira/browse/HDFS-7429 Project: Hadoop HDFS Issue Type: Bug Reporter: zhaoyunjiong Attachments: 11241021, 11241023, 11241025 I found some of our DataNodes will run exceeds the limit of concurrent xciever, the limit is 4K. After check the stack, I suspect that org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by DomainSocketWatcher.kick stuck: {quote} DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition [0x7f558d5d4000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9c90 (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) -- DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable [0x7f7db06c5000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method) at org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45) at org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589) at org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303) at org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) at java.lang.Thread.run(Thread.java:745) DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition [0x7f558d7d6000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x000740df9cb0 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) at org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306) at
[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path
[ https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhaoyunjiong updated HDFS-6133: --- Attachment: HDFS-6133-5.patch Thanks Yongjun Zhang, update patch to fix the format. Make Balancer support exclude specified path Key: HDFS-6133 URL: https://issues.apache.org/jira/browse/HDFS-6133 Project: Hadoop HDFS Issue Type: Improvement Components: balancer mover, namenode Reporter: zhaoyunjiong Assignee: zhaoyunjiong Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, HDFS-6133-4.patch, HDFS-6133-5.patch, HDFS-6133.patch Currently, run Balancer will destroying Regionserver's data locality. If getBlocks could exclude blocks belongs to files which have specific path prefix, like /hbase, then we can run Balancer without destroying Regionserver's data locality. -- This message was sent by Atlassian JIRA (v6.3.4#6332)