from:"zhaoyunjiong \(JIRA\)"

zhaoyunjiong created HDFS-6616:
--

 Summary: bestNode shouldn't always return the first DataNode
 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor


When we are doing distcp between clusters, job failed:
014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
part-r-00101.avro : java.net.NoRouteToHostException: No route to host
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
at java.security.AccessController.doPrivileged(Native Method)
at 
sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
at 
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

The root reason is one of the DataNode can't access from outside, but inside 
cluster, it's health.
In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode


 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.patch

One possible solution is choose DataNode randomly with the cost of ignore the 
network distance.

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode


 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.patch

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode


 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: (was: HDFS-6616.patch)

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-02 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049669#comment-14049669
 ] 

zhaoyunjiong commented on HDFS-6616:


What happened on our cluster is very rare case.
Server use HDP2.1 and client use HDP1.3, so I come up this patch.

Correct me if I'm wrong: when using WebHDFS, I think it will be very rare that 
both client and the data will be in the same host.
But I agree with you support exclude nodes in WebHDFS is a better idea.

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-07-02 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133-2.patch

Thanks Daryn Sharp for your time.
Update patch, use boolean instead of Boolean.

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-02 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14051076#comment-14051076
 ] 

zhaoyunjiong commented on HDFS-6616:


Yes. You are right. 
I never thought user may use WebHDFS as source and target filesystem, and 
running distcp job on source cluster.
For our use case, we always run jobs on target cluster and use WebHDFS as 
source filesystem.

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-16 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.1.patch

Update patch to support exclude nodes in WebHDFS.

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.1.patch, HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-17 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.2.patch

Thanks Tsz Wo Nicholas Sze  Jing Zhao.

Update patch according to comments: change ExcludeDatanodesParam.NAME to 
excludedatanodes and change WebHdfsFileSystem to use the exclude datanode 
feature.

The test failures is  not related.

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6616) bestNode shouldn't always return the first DataNode

2014-07-18 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6616:
---

Attachment: HDFS-6616.3.patch

Update patch according to comments and fix test failures.

 bestNode shouldn't always return the first DataNode
 ---

 Key: HDFS-6616
 URL: https://issues.apache.org/jira/browse/HDFS-6616
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: webhdfs
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6616.1.patch, HDFS-6616.2.patch, HDFS-6616.3.patch, 
 HDFS-6616.patch


 When we are doing distcp between clusters, job failed:
 014-06-30 20:56:28,430 INFO org.apache.hadoop.tools.DistCp: FAIL 
 part-r-00101.avro : java.net.NoRouteToHostException: No route to host
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
   at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
   at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
   at 
 sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1491)
   at java.security.AccessController.doPrivileged(Native Method)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1485)
   at 
 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1139)
   at 
 java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:379)
   at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:322)
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:419)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:547)
   at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:314)
   at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
   at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:365)
   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
   at org.apache.hadoop.mapred.Child.main(Child.java:249)
 The root reason is one of the DataNode can't access from outside, but inside 
 cluster, it's health.
 In NamenodeWebHdfsMethods.java:bestNode, it always return the first DataNode, 
 so even after the distcp retries, it still failed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster

2014-08-06 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-6829:
--

 Summary: DFSAdmin refreshSuperUserGroupsConfiguration failed in 
security cluster
 Key: HDFS-6829
 URL: https://issues.apache.org/jira/browse/HDFS-6829
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 2.4.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor


When we run command hadoop dfsadmin -refreshSuperUserGroupsConfiguration, it 
failed and report below message:
14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The 
serverPrincipal = doesn't confirm to the standards
refreshSuperUserGroupsConfiguration: null

After check the code, I found the bug was triggered by below reasons:
1. We didn't set CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, 
which needed by RefreshUserMappingsProtocol. And in DFSAdmin, if no 
CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will try 
to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: 
conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY,   
conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, ));
2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in hdfs-site.xml
3. DFSAdmin didn't load hdfs-site.xml





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6829) DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster

2014-08-06 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6829:
---

Attachment: HDFS-6829.patch

This patch is very simple, use HdfsConfiguration to load hdfs-site.xml when 
construct DFSAdmin.

 DFSAdmin refreshSuperUserGroupsConfiguration failed in security cluster
 ---

 Key: HDFS-6829
 URL: https://issues.apache.org/jira/browse/HDFS-6829
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 2.4.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Minor
 Attachments: HDFS-6829.patch


 When we run command hadoop dfsadmin -refreshSuperUserGroupsConfiguration, 
 it failed and report below message:
 14/08/05 21:32:06 WARN security.MultiRealmUserAuthentication: The 
 serverPrincipal = doesn't confirm to the standards
 refreshSuperUserGroupsConfiguration: null
 After check the code, I found the bug was triggered by below reasons:
 1. We didn't set 
 CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY, which needed 
 by RefreshUserMappingsProtocol. And in DFSAdmin, if no 
 CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY set, it will 
 try to use DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY: 
 conf.set(CommonConfigurationKeys.HADOOP_SECURITY_SERVICE_USER_NAME_KEY,   
 conf.get(DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY, ));
 2. But we set DFSConfigKeys.DFS_NAMENODE_KERBEROS_PRINCIPAL_KEY in 
 hdfs-site.xml
 3. DFSAdmin didn't load hdfs-site.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy

2014-09-11 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-7044:
--

 Summary: Support retention policy based on access time and modify 
time, use XAttr to store policy
 Key: HDFS-7044
 URL: https://issues.apache.org/jira/browse/HDFS-7044
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


The basic idea is set retention policy on directory based on access time and 
modify time and use XAttr to store policy.
Files under directory which have retention policy will be delete if meet the 
retention rule.
There are three rule:
# access time 
#* If (accessTime + retentionTimeForAccess  now), the file will be delete
# modify time
#* If (modifyTime + retentionTimeForModify  now), the file will be delete
# access time and modify time
#* If (accessTime + retentionTimeForAccess  now  modifyTime + 
retentionTimeForModify  now ), the file will be delete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy

2014-09-11 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7044:
---
Attachment: Retention policy design.pdf

Attach a simple design document.
The major difference between HDFS-7044 and HDFS-6382 are(Please correct me if 
I'm wrong, I just knew HDFS-6382 was trying to solve same problem):
# HDFS-6382 is standalone daemon outside NameNode, HDFS-7044 will be inside 
NameNode, I believe HDFS-7044 will be more simple and efficient.
# HDFS-7044 allows user set policy based on access time or modify time, 
HDFS-6382 only support one ttl.

 Support retention policy based on access time and modify time, use XAttr to 
 store policy
 

 Key: HDFS-7044
 URL: https://issues.apache.org/jira/browse/HDFS-7044
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: Retention policy design.pdf


 The basic idea is set retention policy on directory based on access time and 
 modify time and use XAttr to store policy.
 Files under directory which have retention policy will be delete if meet the 
 retention rule.
 There are three rule:
 # access time 
 #* If (accessTime + retentionTimeForAccess  now), the file will be delete
 # modify time
 #* If (modifyTime + retentionTimeForModify  now), the file will be delete
 # access time and modify time
 #* If (accessTime + retentionTimeForAccess  now  modifyTime + 
 retentionTimeForModify  now ), the file will be delete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase

2014-03-20 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-6133:
--

 Summary: Make Balancer support don't move blocks belongs to Hbase
 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


Currently, run Balancer will destroying Regionserver's data locality.
If getBlocks could exclude blocks belongs to files which have specific path 
prefix, like /hbase, then we can run Balancer without destroying 
Regionserver's data locality.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase

2014-03-20 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133.patch

This patch make Balancer support don't move blocks belongs to Hbase

 Make Balancer support don't move blocks belongs to Hbase
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6133) Make Balancer support don't move blocks belongs to Hbase

2014-03-21 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13943050#comment-13943050
 ] 

zhaoyunjiong commented on HDFS-6133:


Thanks for your review, stack.

I didn't aware HDFS-4420 when I created this issue.
The problem we are trying to solve is same, but with very different approach.
The performance for HDFS-4420 seems not very well when exclude path have huge 
blocks.

I just use hbase for example, and also hbase is main use case for this feature.
For now it only accept one exclude path, but support multiple is a good idea, I 
can upload a new patch next week.

It only run manually.

 Make Balancer support don't move blocks belongs to Hbase
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-03-21 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Summary: Make Balancer support exclude specified path  (was: Make Balancer 
support don't move blocks belongs to Hbase)

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-03-24 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: (was: HDFS-6133.patch)

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-03-24 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133.patch

This patch support exclude multiple paths.

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-03-25 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Status: Patch Available  (was: Open)

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HDFS-6228) comments typo fix for FsDatasetImpl.java

2014-04-09 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-6228:
--

 Summary: comments typo fix for FsDatasetImpl.java
 Key: HDFS-6228
 URL: https://issues.apache.org/jira/browse/HDFS-6228
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6228) comments typo fix for FsDatasetImpl.java

2014-04-09 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6228:
---

Attachment: HDFS-6228.patch

A patch fix typo in comments:
-   * @param estimateBlockLen estimate generation stamp
+   * @param estimateBlockLen estimate block length


 comments typo fix for FsDatasetImpl.java
 

 Key: HDFS-6228
 URL: https://issues.apache.org/jira/browse/HDFS-6228
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Trivial
 Attachments: HDFS-6228.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6228) comments typo fix for FsDatasetImpl.java

2014-04-09 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6228:
---

Status: Patch Available  (was: Open)

 comments typo fix for FsDatasetImpl.java
 

 Key: HDFS-6228
 URL: https://issues.apache.org/jira/browse/HDFS-6228
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Priority: Trivial
 Attachments: HDFS-6228.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-4420) Provide a way to exclude subtree from balancing process

2014-04-16 Thread zhaoyunjiong (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-4420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13971498#comment-13971498
]

zhaoyunjiong commented on HDFS-4420:

Hi Yongjun,
Could you check https://issues.apache.org/jira/browse/HDFS-6133, which have
same idea with different approach.

Provide a way to exclude subtree from balancing process
---

Key: HDFS-4420
URL: https://issues.apache.org/jira/browse/HDFS-4420
Project: Hadoop HDFS
Issue Type: Improvement
Components: balancer
Reporter: Max Lapan
Priority: Minor
Attachments: Balancer-exclude-subtree-0.90.2.patch,
Balancer-exclude-trunk-v2.patch, Balancer-exclude-trunk-v3.patch,
Balancer-exclude-trunk.patch, HDFS-4420-v4.patch

During balancer operation, it balances all blocks, regardless of their
filesystem hierarchy. Sometimes, it would be usefull to exclude some subtree
from balancing process.
For example, regionservers data locality is cruical for HBase performance.
Region's data is tied to regionservers, which reside on specific machines in
cluster. During operation, regionservers reads and writes region's data, and
after some time, all this data are reside on local machine, so, all reads
become local, which is great for performance. Balancer breaks this locality
during opertation by moving blocks around.
This patch adds [-exclude path] switch, and, if path is provided,
balancer will not move blocks under this path during operation.
Attached patch have tested for 0.90.2.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-2139) Fast copy for HDFS.

2014-04-17 Thread zhaoyunjiong (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhaoyunjiong updated HDFS-2139:
---

Attachment: HDFS-2139.patch

Seems Pritam don't have time to create a patch for Apache.
And I do think use hard link to copy data between pools is a good idea,
so based on the FaceBook's version of FastCopy, I created this patch to copy
files between pools.
Compare to the origin FastCopy, this patch only use hard link to do the copy.

It's a early version, it works on my test cluster which only have 6 datanodes.

Please let me know if I need change the name or create a new issue.

Fast copy for HDFS.
---

Key: HDFS-2139
URL: https://issues.apache.org/jira/browse/HDFS-2139
Project: Hadoop HDFS
Issue Type: New Feature
Reporter: Pritam Damania
Attachments: HDFS-2139.patch

Original Estimate: 168h
Remaining Estimate: 168h

There is a need to perform fast file copy on HDFS. The fast copy mechanism
for a file works as
follows :
1) Query metadata for all blocks of the source file.
2) For each block 'b' of the file, find out its datanode locations.
3) For each block of the file, add an empty block to the namesystem for
the destination file.
4) For each location of the block, instruct the datanode to make a local
copy of that block.
5) Once each datanode has copied over its respective blocks, they
report to the namenode about it.
6) Wait for all blocks to be copied and exit.
This would speed up the copying process considerably by removing top of
the rack data transfers.
Note : An extra improvement, would be to instruct the datanode to create a
hardlink of the block file if we are copying a block on the same datanode

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-04-24 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133.patch

Thanks Yongjun Zhang  and Benoy Antony for the review.

Update patches according to the comments except the corner case about /a/b 
covers /a/b/c, I do believe user won't do that. 

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch, HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-04-24 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: (was: HDFS-6133.patch)

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-04-25 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: (was: HDFS-6133.patch)

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-04-25 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133.patch

Upload patch according to comments.

By the way, do we have new BM service design?



 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2014-04-29 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13984113#comment-13984113
 ] 

zhaoyunjiong commented on HDFS-6133:


Yes, block pinning works.
By the way, where do you think is the best place to store pinning information?
If save in Block, seems it will cost a lot memory.

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-05-06 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133.patch.1

This patch will set sticky bit on the block file if the DFSClient have  favored 
nodes hint set, and refuse to move from Balancer.

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-05-06 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: (was: HDFS-6133.patch.1)

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-05-06 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---

Attachment: HDFS-6133-1.patch

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133-1.patch, HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HDFS-6133) Make Balancer support exclude specified path

2014-05-07 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991492#comment-13991492
 ] 

zhaoyunjiong commented on HDFS-6133:


I'll use boolean instead of Boolean.

Yes, the NN  may not grant all the requested/favored nodes. The best way is to 
only pin the blocks on the favored nodes, but considered the probability that 
NN didn't grant all the favored nodes is small, so I just pinned them all.

Also I was wondering whether I should provide a API that let user 
pinning/un-pinning blocks after file created. That might be more useful than 
combine pinning with favored nodes.



 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133-1.patch, HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-2139) Fast copy for HDFS.

2014-06-12 Thread zhaoyunjiong (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhaoyunjiong updated HDFS-2139:
---

Attachment: HDFS-2139.patch

Thanks Guo Ruijing Daryn Sharp for your time.
Update patch according to the comments:
1. add clone in DistributedFileSystem
2. add check block tokens
3. support clone part of the file, the last block still use hardlink, then use
truncateBlock to adjust block size and meta file.

Yes, DN enforce no linking of UC blocks.

Fast copy for HDFS.
---

Key: HDFS-2139
URL: https://issues.apache.org/jira/browse/HDFS-2139
Project: Hadoop HDFS
Issue Type: New Feature
Reporter: Pritam Damania
Attachments: HDFS-2139.patch, HDFS-2139.patch

Original Estimate: 168h
Remaining Estimate: 168h

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Moved] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong moved MAPREDUCE-5415 to HDFS-5028:
---

 Assignee: (was: zhaoyunjiong)
Affects Version/s: (was: 1.2.0)
   1.2.0
  Key: HDFS-5028  (was: MAPREDUCE-5415)
  Project: Hadoop HDFS  (was: Hadoop Map/Reduce)

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: zhaoyunjiong
 Attachments: MAPREDUCE-5415.patch


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: (was: MAPREDUCE-5415.patch)

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: zhaoyunjiong
 Attachments: HDFS-5028


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: HDFS-5028

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: zhaoyunjiong
 Attachments: HDFS-5028


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Affects Version/s: (was: 1.2.0)
   1.1.2
Fix Version/s: 1.1.3

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.2
Reporter: zhaoyunjiong
 Fix For: 1.1.3

 Attachments: HDFS-5028


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: (was: HDFS-5028)

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.2
Reporter: zhaoyunjiong
 Fix For: 1.1.3


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: HDFS-5028.patch
HDFS-5028-1.1.2.patch

Update patch for both trunk and 1.1.2.

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.2
Reporter: zhaoyunjiong
 Fix For: 1.1.3

 Attachments: HDFS-5028-1.1.2.patch, HDFS-5028.patch


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Affects Version/s: (was: 1.1.2)
   1.1.0

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: zhaoyunjiong
 Fix For: 1.1.3

 Attachments: HDFS-5028.patch


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: (was: HDFS-5028-1.1.2.patch)

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: zhaoyunjiong
 Fix For: 1.1.3

 Attachments: HDFS-5028.patch


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: HDFS-5028-branch-1.1.patch

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: zhaoyunjiong
 Fix For: 1.1.3

 Attachments: HDFS-5028-branch-1.1.patch, HDFS-5028.patch


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout

2013-07-29 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13723253#comment-13723253
 ] 

zhaoyunjiong commented on HDFS-5028:


dfsclients was syncronizing.
The problem here is Iterator.
You can get more information here:
http://stackoverflow.com/questions/8189466/java-util-concurrentmodificationexception

For short:
The iterators returned by this class's iterator and listIterator methods are 
fail-fast: if the list is structurally modified at any time after the iterator 
is created, in any way except through the iterator's own remove or add methods, 
the iterator will throw a ConcurrentModificationException.

c.abort() will remove c(a dfsclient) from dfsclients, so iterator generated in 
for(DFSClient c : dfsclients) will throw ConcurrentModificationException.

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: zhaoyunjiong
 Fix For: 1.1.3

 Attachments: HDFS-5028-branch-1.1.patch, HDFS-5028.patch


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout

2013-08-01 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: (was: HDFS-5028.patch)

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.1.3


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout

2013-08-01 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: (was: HDFS-5028-branch-1.1.patch)

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.1.3


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HDFS-5028) LeaseRenewer throw java.util.ConcurrentModificationException when timeout

2013-08-01 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5028:
---

Attachment: HDFS-5028.patch
HDFS-5028-branch-1.1.patch

Thanks Nicholas.
Change dfsclients.get(dfsclients.size() - 1).abort() to 
dfsclients.get(0).abort().

 LeaseRenewer throw java.util.ConcurrentModificationException when timeout
 -

 Key: HDFS-5028
 URL: https://issues.apache.org/jira/browse/HDFS-5028
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 1.1.0, 2.0.0-alpha
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.1.3

 Attachments: HDFS-5028-branch-1.1.patch, HDFS-5028.patch


 In LeaseRenewer, when renew() throw SocketTimeoutException, c.abort() will 
 remove one dfsclient from dfsclients. Here will throw a 
 ConcurrentModificationException because dfsclients changed after the iterator 
 created by for(DFSClient c : dfsclients):
 Exception in thread org.apache.hadoop.hdfs.LeaseRenewer$1@75fa1077 
 java.util.ConcurrentModificationException
 at 
 java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372)
 at java.util.AbstractList$Itr.next(AbstractList.java:343)
 at org.apache.hadoop.hdfs.LeaseRenewer.run(LeaseRenewer.java:406)
 at 
 org.apache.hadoop.hdfs.LeaseRenewer.access$600(LeaseRenewer.java:69)
 at org.apache.hadoop.hdfs.LeaseRenewer$1.run(LeaseRenewer.java:273)
 at java.lang.Thread.run(Thread.java:662)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HDFS-5247) Namenode should close editlog and unlock storage when removing failed storage dir

2013-09-22 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-5247:
--

 Summary: Namenode should close editlog and unlock storage when 
removing failed storage dir
 Key: HDFS-5247
 URL: https://issues.apache.org/jira/browse/HDFS-5247
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.2.1


When one of dfs.name.dir failed, namenode didn't close editlog and unlock the 
storage:
java24764 hadoop   78uW  REG 252,320 393219 
/volume1/nn/dfs/in_use.lock (deleted)
java24764 hadoop  107u   REG 252,32  1155072 393229 
/volume1/nn/dfs/current/edits.new (deleted)
java24764 hadoop  119u   REG 252,320 393238 
/volume1/nn/dfs/current/fstime.tmp
java24764 hadoop  140u   REG 252,32  1761805 393239 
/volume1/nn/dfs/current/edits

If this dir is limit of space, then restore this storage may fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-5247) Namenode should close editlog and unlock storage when removing failed storage dir

2013-09-24 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13777073#comment-13777073
 ] 

zhaoyunjiong commented on HDFS-5247:


I'm saying for the failed directory.
Our case is due to no space on that disk. In this case, it need and should 
close those two files.
And I believe try to close won't make thing worse.

 Namenode should close editlog and unlock storage when removing failed storage 
 dir
 -

 Key: HDFS-5247
 URL: https://issues.apache.org/jira/browse/HDFS-5247
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.2.1

 Attachments: HDFS-5247-branch-1.2.patch


 When one of dfs.name.dir failed, namenode didn't close editlog and unlock the 
 storage:
 java24764 hadoop   78uW  REG 252,320 393219 
 /volume1/nn/dfs/in_use.lock (deleted)
 java24764 hadoop  107u   REG 252,32  1155072 393229 
 /volume1/nn/dfs/current/edits.new (deleted)
 java24764 hadoop  119u   REG 252,320 393238 
 /volume1/nn/dfs/current/fstime.tmp
 java24764 hadoop  140u   REG 252,32  1761805 393239 
 /volume1/nn/dfs/current/edits
 If this dir is limit of space, then restore this storage may fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big

zhaoyunjiong created HDFS-5367:
--

 Summary: Restore fsimage locked NameNode too long when the size of 
fsimage are big
 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


Our cluster have 40G fsimage, we write one copy of edit log to NFS.
After NFS temporary failed, when doing checkpoint, NameNode try to recover it, 
and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 320 
seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big


 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: (was: HDFS-5367)

 Restore fsimage locked NameNode too long when the size of fsimage are big
 -

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong

 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big


 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: HDFS-5367

The fsimage restored when SecondaryNameNode call rollEditLog will be replaced 
soon when SecondaryNameNode call rollFsImage.
So I think restore fsimage is not necessary.

 Restore fsimage locked NameNode too long when the size of fsimage are big
 -

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong

 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5367) Restore fsimage locked NameNode too long when the size of fsimage are big


 [ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5367:
---

Attachment: HDFS-5367-branch-1.2.patch

This patch avoid restore fsimage to make rollEditLog finished as soon as 
possible.

 Restore fsimage locked NameNode too long when the size of fsimage are big
 -

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5367-branch-1.2.patch


 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5367) Restoring namenode storage locks namenode due to unnecessary fsimage write

2013-10-17 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13798610#comment-13798610
 ] 

zhaoyunjiong commented on HDFS-5367:


Thank you for your review.

 Restoring namenode storage locks namenode due to unnecessary fsimage write
 --

 Key: HDFS-5367
 URL: https://issues.apache.org/jira/browse/HDFS-5367
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0

 Attachments: HDFS-5367-branch-1.2.patch


 Our cluster have 40G fsimage, we write one copy of edit log to NFS.
 After NFS temporary failed, when doing checkpoint, NameNode try to recover 
 it, and it will save 40G fsimage to NFS, it takes some time ( 40G/128MB/s = 
 320 seconds) , and it locked FSNamesystem, and this bring down our cluster.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-21 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-5396:
--

 Summary: FSImage.getFsImageName should check whether fsimage exists
 Key: HDFS-5396
 URL: https://issues.apache.org/jira/browse/HDFS-5396
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0


In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
all IMAGE dir, so we need to check whether fsimage exists before 
FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-21 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5396:
---

Attachment: HDFS-5396-branch-1.2.patch

Check whether fsimage exists before return.

 FSImage.getFsImageName should check whether fsimage exists
 --

 Key: HDFS-5396
 URL: https://issues.apache.org/jira/browse/HDFS-5396
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0

 Attachments: HDFS-5396-branch-1.2.patch


 In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
 all IMAGE dir, so we need to check whether fsimage exists before 
 FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Resolved] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-22 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong resolved HDFS-5396.


Resolution: Not A Problem

 FSImage.getFsImageName should check whether fsimage exists
 --

 Key: HDFS-5396
 URL: https://issues.apache.org/jira/browse/HDFS-5396
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0

 Attachments: HDFS-5396-branch-1.2.patch


 In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
 all IMAGE dir, so we need to check whether fsimage exists before 
 FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2013-10-22 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13802545#comment-13802545
 ] 

zhaoyunjiong commented on HDFS-5396:


The first image storage dir always have fsimage file in it.
Restored image storage always append to the end.
So the first one must have fsimage in it.

 FSImage.getFsImageName should check whether fsimage exists
 --

 Key: HDFS-5396
 URL: https://issues.apache.org/jira/browse/HDFS-5396
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0

 Attachments: HDFS-5396-branch-1.2.patch


 In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
 all IMAGE dir, so we need to check whether fsimage exists before 
 FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Created] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-27 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-5579:
--

 Summary: Under construction files make DataNode decommission take 
very long hours
 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.2.0, 1.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


We noticed that some times decommission DataNodes takes very long time, even 
exceeds 100 hours.
After check the code, I found that in 
BlockManager:computeReplicationWorkForBlocks(ListListBlock 
blocksToReplicate) it won't replicate blocks which belongs to under 
construction files, however in 
BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  is 
block need replicate no matter whether it belongs to under construction or not, 
the decommission progress will continue running.
That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-27 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579.patch
HDFS-5579-branch-1.2.patch

This patch let NameNode can replicate blocks belongs to under construction 
files but not the last block.
And if the decommissioning DataNodes only have some blocks which are the last 
blocks of under construction files and have more than 1 live replicates left 
behind, then NameNode could set it to DECOMMISSIONED.

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-28 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579-branch-1.2.patch
HDFS-5579.patch

Thanks Vinay.
Update patch as your comments.
Except: getLastBlock  do throws IOException, I deleted it in this patch.

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, 
 HDFS-5579.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-28 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579-branch-1.2.patch)

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-11-28 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-12-05 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579.patch
HDFS-5579-branch-1.2.patch

Update patch, added test case for trunk.

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, 
 HDFS-5579.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-12-05 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2013-12-05 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579-branch-1.2.patch)

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours


[ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13865202#comment-13865202
 ] 

zhaoyunjiong commented on HDFS-5579:


It's already in the patch.
+if (bc.isUnderConstruction()) {
+  if (block.equals(bc.getLastBlock())  curReplicas  minReplication) 
{
+continue;
+  }
+  underReplicatedInOpenFiles++;
+}

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579-branch-1.2.patch
HDFS-5579.patch

Good point. Thanks Jing.
Update patches to fix this problem.

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579-branch-1.2.patch, 
 HDFS-5579.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579-branch-1.2.patch)

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: HDFS-5579.patch

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5579) Under construction files make DataNode decommission take very long hours


 [ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5579:
---

Attachment: (was: HDFS-5579.patch)

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5579) Under construction files make DataNode decommission take very long hours

2014-01-13 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13870385#comment-13870385
 ] 

zhaoyunjiong commented on HDFS-5579:


Thanks for your time to review the patch, Jing.

 Under construction files make DataNode decommission take very long hours
 

 Key: HDFS-5579
 URL: https://issues.apache.org/jira/browse/HDFS-5579
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 2.4.0

 Attachments: HDFS-5579-branch-1.2.patch, HDFS-5579.patch


 We noticed that some times decommission DataNodes takes very long time, even 
 exceeds 100 hours.
 After check the code, I found that in 
 BlockManager:computeReplicationWorkForBlocks(ListListBlock 
 blocksToReplicate) it won't replicate blocks which belongs to under 
 construction files, however in 
 BlockManager:isReplicationInProgress(DatanodeDescriptor srcNode), if there  
 is block need replicate no matter whether it belongs to under construction or 
 not, the decommission progress will continue running.
 That's the reason some time the decommission takes very long time.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Reopened] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2014-02-10 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong reopened HDFS-5396:



I made a mistake when I resolved this as Not A Problem.
Because   
for (IteratorStorageDirectory it = 
  dirIterator(NameNodeDirType.IMAGE); it.hasNext();)
sd = it.next(); 
will return last StorageDirectory of image, but due to HDFS-5367, it may not 
have fsimage in it.  

 FSImage.getFsImageName should check whether fsimage exists
 --

 Key: HDFS-5396
 URL: https://issues.apache.org/jira/browse/HDFS-5396
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0

 Attachments: HDFS-5396-branch-1.2.patch


 In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
 all IMAGE dir, so we need to check whether fsimage exists before 
 FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Created] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-12 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-5944:
--

 Summary: LeaseManager:findLeaseWithPrefixPath didn't handle path 
like /a/b/ right cause SecondaryNameNode failed do checkpoint
 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.2.0, 1.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong


In our cluster, we encountered error like this:
java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log 
but is not under construction.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)

What happened:
Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
And Client A continue refresh it's lease.
Client B deleted /XXX/20140206/04_30/
Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
Then secondaryNameNode try to do checkpoint and failed due to failed to delete 
lease hold by Client A when Client B deleted /XXX/20140206/04_30/.

The reason is this a bug in findLeaseWithPrefixPath:
 int srclen = prefix.length();
 if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
entries.put(entry.getKey(), entry.getValue());
  }
Here when prefix is /XXX/20140206/04_30/, and p is 
/XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-12 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Description: 
In our cluster, we encountered error like this:
java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log 
but is not under construction.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)

What happened:
Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
And Client A continue refresh it's lease.
Client B deleted /XXX/20140206/04_30/
Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
Then secondaryNameNode try to do checkpoint and failed due to failed to delete 
lease hold by Client A when Client B deleted /XXX/20140206/04_30/.

The reason is a bug in findLeaseWithPrefixPath:
 int srclen = prefix.length();
 if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
entries.put(entry.getKey(), entry.getValue());
  }
Here when prefix is /XXX/20140206/04_30/, and p is 
/XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
The fix is simple, I'll upload patch later.

  was:
In our cluster, we encountered error like this:
java.io.IOException: saveLeases found path /XXX/20140206/04_30/_SUCCESS.slc.log 
but is not under construction.
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
at 
org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
at 
org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)

What happened:
Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
And Client A continue refresh it's lease.
Client B deleted /XXX/20140206/04_30/
Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
Then secondaryNameNode try to do checkpoint and failed due to failed to delete 
lease hold by Client A when Client B deleted /XXX/20140206/04_30/.

The reason is this a bug in findLeaseWithPrefixPath:
 int srclen = prefix.length();
 if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
entries.put(entry.getKey(), entry.getValue());
  }
Here when prefix is /XXX/20140206/04_30/, and p is 
/XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
The fix is simple, I'll upload patch later.


 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong

 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-12 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: HDFS-5944.patch
HDFS-5944-branch-1.2.patch

This patch is very simple, if prefix ended with '/', just minus 1 from srclen, 
so p.charAt(srclen) could handle path correctly.

 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch


 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-13 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13901171#comment-13901171
 ] 

zhaoyunjiong commented on HDFS-5944:


Brandon, thanks for your time to review this patch.
I don't think the user use DFSClient directly.
Even use DistributedFileSystem, we still can send path ending with / by 
passing path like this /a/b/../.
Because in getPathName, String result = makeAbsolute(file).toUri().getPath() 
will return /a/.

About unit test, I'd be happy to add one. I have two questions need your help:
1. Is it enough for just writing a unit test for findLeaseWithPrefixPath?
2. In trunk, there is no TestLeaseManager.java, should I add one?

 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
 HDFS-5944.test.txt


 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-17 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: HDFS-5944-branch-1.2.patch
HDFS-5944.patch

Update patches with unit test.

 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944-branch-1.2.patch, 
 HDFS-5944.patch, HDFS-5944.patch, HDFS-5944.test.txt


 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-17 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: (was: HDFS-5944-branch-1.2.patch)

 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
 HDFS-5944.test.txt


 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-17 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5944:
---

Attachment: (was: HDFS-5944.patch)

 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
 HDFS-5944.test.txt


 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-19 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13905361#comment-13905361
 ] 

zhaoyunjiong commented on HDFS-5944:


Multiple trailing / is impossible.

 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
 HDFS-5944.test.txt


 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HDFS-5944) LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right cause SecondaryNameNode failed do checkpoint

2014-02-19 Thread zhaoyunjiong (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-5944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13906435#comment-13906435
 ] 

zhaoyunjiong commented on HDFS-5944:


Thank you Brandon and Benoy.

 LeaseManager:findLeaseWithPrefixPath didn't handle path like /a/b/ right 
 cause SecondaryNameNode failed do checkpoint
 -

 Key: HDFS-5944
 URL: https://issues.apache.org/jira/browse/HDFS-5944
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.0, 2.2.0
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-5944-branch-1.2.patch, HDFS-5944.patch, 
 HDFS-5944.test.txt, HDFS-5944.trunk.patch


 In our cluster, we encountered error like this:
 java.io.IOException: saveLeases found path 
 /XXX/20140206/04_30/_SUCCESS.slc.log but is not under construction.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.saveFilesUnderConstruction(FSNamesystem.java:6217)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImageFormat$Saver.save(FSImageFormat.java:607)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveCurrent(FSImage.java:1004)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSImage.saveNamespace(FSImage.java:949)
 What happened:
 Client A open file /XXX/20140206/04_30/_SUCCESS.slc.log for write.
 And Client A continue refresh it's lease.
 Client B deleted /XXX/20140206/04_30/
 Client C open file /XXX/20140206/04_30/_SUCCESS.slc.log for write
 Client C closed the file /XXX/20140206/04_30/_SUCCESS.slc.log
 Then secondaryNameNode try to do checkpoint and failed due to failed to 
 delete lease hold by Client A when Client B deleted /XXX/20140206/04_30/.
 The reason is a bug in findLeaseWithPrefixPath:
  int srclen = prefix.length();
  if (p.length() == srclen || p.charAt(srclen) == Path.SEPARATOR_CHAR) {
 entries.put(entry.getKey(), entry.getValue());
   }
 Here when prefix is /XXX/20140206/04_30/, and p is 
 /XXX/20140206/04_30/_SUCCESS.slc.log, p.charAt(srcllen) is '_'.
 The fix is simple, I'll upload patch later.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (HDFS-5396) FSImage.getFsImageName should check whether fsimage exists

2014-02-20 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-5396:
---

Attachment: HDFS-5396-branch-1.2.patch

Update patch.

 FSImage.getFsImageName should check whether fsimage exists
 --

 Key: HDFS-5396
 URL: https://issues.apache.org/jira/browse/HDFS-5396
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 1.2.1
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Fix For: 1.3.0

 Attachments: HDFS-5396-branch-1.2.patch, HDFS-5396-branch-1.2.patch


 In https://issues.apache.org/jira/browse/HDFS-5367, fsimage may not write to 
 all IMAGE dir, so we need to check whether fsimage exists before 
 FSImage.getFsImageName returned.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Resolved] (HDFS-7044) Support retention policy based on access time and modify time, use XAttr to store policy

2014-09-17 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong resolved HDFS-7044.

Resolution: Duplicate

Thanks  Allen Wittenauer and Zesheng Wu.
After I read the comments in HDFS-6382, now I understand the concerns.

 Support retention policy based on access time and modify time, use XAttr to 
 store policy
 

 Key: HDFS-7044
 URL: https://issues.apache.org/jira/browse/HDFS-7044
 Project: Hadoop HDFS
  Issue Type: New Feature
  Components: namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: Retention policy design.pdf


 The basic idea is set retention policy on directory based on access time and 
 modify time and use XAttr to store policy.
 Files under directory which have retention policy will be delete if meet the 
 retention rule.
 There are three rule:
 # access time 
 #* If (accessTime + retentionTimeForAccess  now), the file will be delete
 # modify time
 #* If (modifyTime + retentionTimeForModify  now), the file will be delete
 # access time and modify time
 #* If (accessTime + retentionTimeForAccess  now  modifyTime + 
 retentionTimeForModify  now ), the file will be delete



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-10-09 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-3.patch

Update patch for merge the trunk.
{quote}
Why we always pass false in below?
1653new Sender(out).writeBlock(b, accessToken, clientname, targets,
1654srcNode, stage, 0, 0, 0, 0, blockSender.getChecksum(),
1655cachingStrategy, false);
{quote}
This code path happens when NameNode ask DataNode send block to other 
DataNode(DatanodeProtocol.DNA_TRANSFER), it's not trigged by client, so there 
is no need pinning the block in this case.

{quote}
We will never copy a block?
925 if (datanode.data.getPinning(block))
926 String msg = Not able to copy block  + block.getBlockId() +   + 927 
to  + peer.getRemoteAddressString() +  because it's pinned ; 928
LOG.info(msg); 929  sendResponse(ERROR, msg); 
Any thing to help ensure replica count does not rot when this pinning is 
enabled?
{quote}
When the block is under replicate, NameNode will send 
DatanodeProtocol.DNA_TRANSFER command to DataNode and it handled by 
DataTransfer, pinning won't affect that.



 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
 HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-10-10 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: (was: HDFS-6133-3.patch)

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-10-10 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-3.patch

Update patch, merge with trunk.

 Make Balancer support exclude specified path
 

 Key: HDFS-6133
 URL: https://issues.apache.org/jira/browse/HDFS-6133
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch, 
 HDFS-6133.patch


 Currently, run Balancer will destroying Regionserver's data locality.
 If getBlocks could exclude blocks belongs to files which have specific path 
 prefix, like /hbase, then we can run Balancer without destroying 
 Regionserver's data locality.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path

2014-11-18 Thread zhaoyunjiong (JIRA)

[
https://issues.apache.org/jira/browse/HDFS-6133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

zhaoyunjiong updated HDFS-6133:
---
Attachment: HDFS-6133-4.patch

Thanks Yongjun Zhang.
Update patch according to comments.

{quote}
The concept of favoredNodes pre-existed before your patch, now your patch
defines that as long as favoredNodes is passed, then pinning will be done. So
we are changing the prior definition of how favoredNodes are used. Why not add
some additional interface to tell that pinning will happen so we have the
option not to pin even if favoredNodes is passed? Not necessarily you need to
do what I suggested here, but I'd like to understand your thoughts here.
{quote}
I think most of time if you use favoredNodes, you'd like to keep the block on
that machine, so to keep things simple, I didn't add new interface.
{quote}
Do we ever need interface to do unpinning?
{quote}
We can add unpinning in another issue if there are user case need that.

Make Balancer support exclude specified path

Key: HDFS-6133
URL: https://issues.apache.org/jira/browse/HDFS-6133
Project: Hadoop HDFS
Issue Type: Improvement
Components: balancer mover, namenode
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
Attachments: HDFS-6133-1.patch, HDFS-6133-2.patch, HDFS-6133-3.patch,
HDFS-6133-4.patch, HDFS-6133.patch

Currently, run Balancer will destroying Regionserver's data locality.
If getBlocks could exclude blocks belongs to files which have specific path
prefix, like /hbase, then we can run Balancer without destroying
Regionserver's data locality.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck

2014-11-23 Thread zhaoyunjiong (JIRA)

zhaoyunjiong created HDFS-7429:
--

 Summary: DomainSocketWatcher.doPoll0 stuck
 Key: HDFS-7429
 URL: https://issues.apache.org/jira/browse/HDFS-7429
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong


I found some of our DataNodes will run exceeds the limit of concurrent 
xciever, the limit is 4K.

After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck:
{quote}
DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
[0x7f558d5d4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000740df9c90 (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
--
DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
[0x7f558d3d2000]
   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)

DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
[0x7f558d7d6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000740df9cb0 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
 

Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable 
[0x7f55aef6e000]
   java.lang.Thread.State: RUNNABLE 
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
at

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.doPoll0 stuck

2014-11-23 Thread zhaoyunjiong (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Attachment: 11241025
11241023
11241021

Upload more stack trace files.

 DomainSocketWatcher.doPoll0 stuck
 -

 Key: HDFS-7429
 URL: https://issues.apache.org/jira/browse/HDFS-7429
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
 Attachments: 11241021, 11241023, 11241025


 I found some of our DataNodes will run exceeds the limit of concurrent 
 xciever, the limit is 4K.
 After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck:
 {quote}
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
 [0x7f558d5d4000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9c90 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
 at 
 java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
 at 
 java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 --
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
 [0x7f558d3d2000]
java.lang.Thread.State: RUNNABLE
 at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
 at 
 org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
 at 
 org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
 [0x7f558d7d6000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9cb0 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
  
 Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck


 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Summary: DomainSocketWatcher.kick stuck  (was: DomainSocketWatcher.doPoll0 
stuck)

 DomainSocketWatcher.kick stuck
 --

 Key: HDFS-7429
 URL: https://issues.apache.org/jira/browse/HDFS-7429
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
 Attachments: 11241021, 11241023, 11241025


 I found some of our DataNodes will run exceeds the limit of concurrent 
 xciever, the limit is 4K.
 After check the stack, I suspect that DomainSocketWatcher.doPoll0 stuck:
 {quote}
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
 [0x7f558d5d4000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9c90 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
 at 
 java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
 at 
 java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 --
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
 [0x7f558d3d2000]
java.lang.Thread.State: RUNNABLE
 at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
 at 
 org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
 at 
 org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
 [0x7f558d7d6000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9cb0 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
  
 Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck


 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Description: 
I found some of our DataNodes will run exceeds the limit of concurrent 
xciever, the limit is 4K.

After check the stack, I suspect that 
org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
DomainSocketWatcher.kick stuck:
{quote}
DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
[0x7f558d5d4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000740df9c90 (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
--
DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5575000 nid=0x37b3 runnable 
[0x7f558d3d2000]
   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)

DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
[0x7f558d7d6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000740df9cb0 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
 

Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable 
[0x7f55aef6e000]
   java.lang.Thread.State: RUNNABLE 
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
at java.lang.Thread.run(Thread.java:745)
{quote}

  was:
I found some of our

[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck


[ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224249#comment-14224249
 ] 

zhaoyunjiong commented on HDFS-7429:


The previous description is not right. 
The stuck thread happened at 
org.apache.hadoop.net.unix.DomainSocket.writeArray0 as below shows.
{quote}
$ grep -B2 -A10 DomainSocket.writeArray 1124102*
11241021-DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for 
operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
11241021-   java.lang.Thread.State: RUNNABLE
11241021:   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native 
Method)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
11241021-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
11241021-   at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
11241021-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
11241021-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
11241021-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
11241021-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
11241021-   at java.lang.Thread.run(Thread.java:745)
--
--
11241023-DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for 
operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
11241023-   java.lang.Thread.State: RUNNABLE
11241023:   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native 
Method)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
11241023-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
11241023-   at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
11241023-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
11241023-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
11241023-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
11241023-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
11241023-   at java.lang.Thread.run(Thread.java:745)
--
--
11241025-DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for 
operation #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
11241025-   java.lang.Thread.State: RUNNABLE
11241025:   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native 
Method)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
11241025-   at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
11241025-   at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
11241025-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
11241025-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
11241025-   at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
11241025-   at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
11241025-   at java.lang.Thread.run(Thread.java:745)
{quote}


 DomainSocketWatcher.kick stuck
 --

 Key: HDFS-7429
 URL: https://issues.apache.org/jira/browse/HDFS-7429
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
 Attachments: 11241021, 11241023, 11241025


 I found some of our DataNodes will run exceeds the limit of concurrent 
 xciever, the limit is 4K.
 After check the stack, I suspect that 
 org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by

[jira] [Updated] (HDFS-7429) DomainSocketWatcher.kick stuck


 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated HDFS-7429:
---
Description: 
I found some of our DataNodes will run exceeds the limit of concurrent 
xciever, the limit is 4K.

After check the stack, I suspect that 
org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
DomainSocketWatcher.kick stuck:
{quote}
DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
[0x7f558d5d4000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000740df9c90 (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
--
DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
[0x7f7db06c5000]
   java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)

DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
#1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
[0x7f558d7d6000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x000740df9cb0 (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
at 
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
at java.lang.Thread.run(Thread.java:745)
 

Thread-163852 daemon prio=10 tid=0x7f55c811c800 nid=0x6757 runnable 
[0x7f55aef6e000]
   java.lang.Thread.State: RUNNABLE 
at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native Method)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(DomainSocketWatcher.java:52)
at 
org.apache.hadoop.net.unix.DomainSocketWatcher$1.run(DomainSocketWatcher.java:457)
at

[jira] [Assigned] (HDFS-7429) DomainSocketWatcher.kick stuck


 [ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong reassigned HDFS-7429:
--

Assignee: zhaoyunjiong

 DomainSocketWatcher.kick stuck
 --

 Key: HDFS-7429
 URL: https://issues.apache.org/jira/browse/HDFS-7429
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
Assignee: zhaoyunjiong
 Attachments: 11241021, 11241023, 11241025


 I found some of our DataNodes will run exceeds the limit of concurrent 
 xciever, the limit is 4K.
 After check the stack, I suspect that 
 org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
 DomainSocketWatcher.kick stuck:
 {quote}
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
 [0x7f558d5d4000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9c90 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
 at 
 java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
 at 
 java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 --
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
 [0x7f7db06c5000]
java.lang.Thread.State: RUNNABLE
   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
   at 
 org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
   at 
 org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
   at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
   at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
   at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
   at java.lang.Thread.run(Thread.java:745)
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
 [0x7f558d7d6000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9cb0 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 at java.lang.Thread.run(Thread.java:745)
  
 Thread-163852

[jira] [Commented] (HDFS-7429) DomainSocketWatcher.kick stuck


[ 
https://issues.apache.org/jira/browse/HDFS-7429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224325#comment-14224325
 ] 

zhaoyunjiong commented on HDFS-7429:


The problem here is in our machine we can only send 299 bytes to domain socket.
When it try to send the 300 byte, it will block, and the 
DomainSocketWatcher.add(DomainSocket sock, Handler handler)  have the lock, so 
watcherThread.run can't get the lock and clear the buffer, it's a live lock.

I'm not sure which configuration controls the bufferSize of 299 for now.
Now I suspect net.core.netdev_budget, which is 300 at our machines.
I'll upload a patch to control the send bytes to prevent live lock later.

By the way, should I move this to HADOOP COMMON project?

 DomainSocketWatcher.kick stuck
 --

 Key: HDFS-7429
 URL: https://issues.apache.org/jira/browse/HDFS-7429
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: zhaoyunjiong
 Attachments: 11241021, 11241023, 11241025


 I found some of our DataNodes will run exceeds the limit of concurrent 
 xciever, the limit is 4K.
 After check the stack, I suspect that 
 org.apache.hadoop.net.unix.DomainSocket.writeArray0 which called by 
 DomainSocketWatcher.kick stuck:
 {quote}
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5576000 nid=0x385d waiting on condition 
 [0x7f558d5d4000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9c90 (a 
 java.util.concurrent.locks.ReentrantLock$NonfairSync)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
 at 
 java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
 at 
 java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:286)
 at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
 --
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f7de034c800 nid=0x7b7 runnable 
 [0x7f7db06c5000]
java.lang.Thread.State: RUNNABLE
   at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
   at 
 org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
   at 
 org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:589)
   at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.kick(DomainSocketWatcher.java:350)
   at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:303)
   at 
 org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.createNewMemorySegment(ShortCircuitRegistry.java:283)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:413)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:172)
   at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
   at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232)
   at java.lang.Thread.run(Thread.java:745)
 DataXceiver for client unix:/var/run/hadoop-hdfs/dn [Waiting for operation 
 #1] daemon prio=10 tid=0x7f55c5574000 nid=0x377a waiting on condition 
 [0x7f558d7d6000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  0x000740df9cb0 (a 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
 at 
 org.apache.hadoop.net.unix.DomainSocketWatcher.add(DomainSocketWatcher.java:306)
 at

[jira] [Updated] (HDFS-6133) Make Balancer support exclude specified path