[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-27 Thread Zhihai Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihai Xu updated HDFS-16111:
-
Description: 
When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
failed volume on a lot of datanodes, which cause some missing blocks at that 
time. Although later on we recovered all the missing blocks by symlinking the 
path (dfs/dn/current) on the failed volume to a new directory and copying all 
the data to the new directory, we missed our SLA and it delayed our upgrading 
process on our production cluster for several hours.

When this issue happened, we saw a lot of this exceptions happened before the 
volumed failed on the datanode:

 [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
[Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
BlockReceiver constructor :Possible disk error: Failed to create 
/XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause is
 java.io.IOException: No space left on device
         at java.io.UnixFileSystem.createFileExclusively(Native Method)
         at java.io.File.createNewFile(File.java:1012)
         at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
         at 
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
         at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
         at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
         at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
         at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
         at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
         at java.lang.Thread.run(Thread.java:748)

 

We found this issue happened due to the following two reasons:

First the upgrade process added some extra disk storage on the each disk volume 
of the data node:

BlockPoolSliceStorage.doUpgrade 
([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445])
 is the main upgrade function in the datanode, it will add some extra storage. 
The extra storage added is all new directories created in 
/current//current, although all block data file and block meta data file 
are hard-linked with /current//previous after upgrade. Since there will 
be a lot of new directories created, this will use some disk space on each disk 
volume.

 

Second there is a potential bug when picking a disk volume to write a new block 
file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, The code 
to select a disk will check whether the available space on the selected disk is 
more than the size bytes of block file to store 
([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86])
 But when creating a new block, there will be two files created: one is the 
block file blk_, the other is block metadata file blk__.meta, this 
is the code when finalizing a block, both block file size and meta data file 
size will be updated: 
[https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391]
 the current code only considers the size of block file and doesn't consider 
the size of block metadata file, when choosing a disk in 
RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks received 
at the same time, the default maximum number of DataXceiver threads is 4096. 
This will underestimate the total size needed to write a block, which will 
potentially cause the above disk full error(No space left on device).

 

Since the size of the block metadata file is not fixed, I suggest to add a 
configuration(

dfs.datanode.round-robin-volume-choosing-policy.additional-available-space

) to safeguard the disk space when choosing a volume to write a new block data 
in RoundRobinVolumeChoosingPolicy.

 

 

[jira] [Commented] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-27 Thread Zhihai Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388404#comment-17388404
 ] 

Zhihai Xu commented on HDFS-16111:
--

Thanks [~weichiu] for the review and committing the patch! Thanks [~ywskycn] 
for the review!

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, 
> The code to select a disk will check whether the available space on the 
> selected disk is more than the size bytes of block file to store 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
>  But when creating a new block, there will be two files created: one is the 
> block file blk_, the other is block metadata file blk__.meta, 
> this is the code when finalizing a block, both block file size and meta data 
> file size will be updated: 
> https://github.com/apache/hadoop/blob/trun

[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanodes.

2021-07-04 Thread Zhihai Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihai Xu updated HDFS-16111:
-
Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid 
failed volumes at datanodes.  (was: Add a configuration to 
RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanode.)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanodes.
> ---
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, 
> The code to select a disk will check whether the available space on the 
> selected disk is more than the size bytes of block file to store 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
>  But when creating a new block, there will be two files created: one is the 
> block file blk_, the other is block metadata file blk__.meta, 
> this is the code when finalizing a block, both block file size and meta data 
> file size will be updated: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/d

[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes at datanode.

2021-07-04 Thread Zhihai Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihai Xu updated HDFS-16111:
-
Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid 
failed volumes at datanode.  (was: Add a configuration to 
RoundRobinVolumeChoosingPolicy to avoid failed volumes.)

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes 
> at datanode.
> --
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, 
> The code to select a disk will check whether the available space on the 
> selected disk is more than the size bytes of block file to store 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
>  But when creating a new block, there will be two files created: one is the 
> block file blk_, the other is block metadata file blk__.meta, 
> this is the code when finalizing a block, both block file size and meta data 
> file size will be updated: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdatas

[jira] [Updated] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes.

2021-07-04 Thread Zhihai Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhihai Xu updated HDFS-16111:
-
Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to avoid 
failed volumes.  (was: Add a configuration to RoundRobinVolumeChoosingPolicy to 
avoid picking an almost full volume to place a replica. )

> Add a configuration to RoundRobinVolumeChoosingPolicy to avoid failed volumes.
> --
>
> Key: HDFS-16111
> URL: https://issues.apache.org/jira/browse/HDFS-16111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Zhihai Xu
>Assignee: Zhihai Xu
>Priority: Major
>
> When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
> failed volume on a lot of datanodes, which cause some missing blocks at that 
> time. Although later on we recovered all the missing blocks by symlinking the 
> path (dfs/dn/current) on the failed volume to a new directory and copying all 
> the data to the new directory, we missed our SLA and it delayed our upgrading 
> process on our production cluster for several hours.
> When this issue happened, we saw a lot of this exceptions happened before the 
> volumed failed on the datanode:
>  [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
> [Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
> datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
> BlockReceiver constructor :Possible disk error: Failed to create 
> /XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause 
> is
> java.io.IOException: No space left on device
>         at java.io.UnixFileSystem.createFileExclusively(Native Method)
>         at java.io.File.createNewFile(File.java:1012)
>         at 
> org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
>         at java.lang.Thread.run(Thread.java:748)
>  
> We found this issue happened due to the following two reasons:
> First the upgrade process added some extra disk storage on the each disk 
> volume of the data node:
> BlockPoolSliceStorage.doUpgrade 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
>  is the main upgrade function in the datanode, it will add some extra 
> storage. The extra storage added is all new directories created in 
> /current//current, although all block data file and block meta data 
> file are hard-linked with /current//previous after upgrade. Since there 
> will be a lot of new directories created, this will use some disk space on 
> each disk volume.
>  
> Second there is a potential bug when picking a disk volume to write a new 
> block file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, 
> The code to select a disk will check whether the available space on the 
> selected disk is more than the size bytes of block file to store 
> (https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
>  But when creating a new block, there will be two files created: one is the 
> block file blk_, the other is block metadata file blk__.meta, 
> this is the code when finalizing a block, both block file size and meta data 
> file size will be updated: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/i

[jira] [Created] (HDFS-16111) Add a configuration to RoundRobinVolumeChoosingPolicy to avoid picking an almost full volume to place a replica.

2021-07-04 Thread Zhihai Xu (Jira)
Zhihai Xu created HDFS-16111:


 Summary: Add a configuration to RoundRobinVolumeChoosingPolicy to 
avoid picking an almost full volume to place a replica. 
 Key: HDFS-16111
 URL: https://issues.apache.org/jira/browse/HDFS-16111
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Zhihai Xu
Assignee: Zhihai Xu


When we upgraded our hadoop cluster from hadoop 2.6.0 to hadoop 3.2.2, we got 
failed volume on a lot of datanodes, which cause some missing blocks at that 
time. Although later on we recovered all the missing blocks by symlinking the 
path (dfs/dn/current) on the failed volume to a new directory and copying all 
the data to the new directory, we missed our SLA and it delayed our upgrading 
process on our production cluster for several hours.

When this issue happened, we saw a lot of this exceptions happened before the 
volumed failed on the datanode:

 [DataXceiver for client  at /[XX.XX.XX.XX:XXX|http://10.104.103.159:33986/] 
[Receiving block BP-XX-XX.XX.XX.XX-XX:blk_X_XXX]] 
datanode.DataNode (BlockReceiver.java:(289)) - IOException in 
BlockReceiver constructor :Possible disk error: Failed to create 
/XXX/dfs/dn/current/BP-XX-XX.XX.XX.XX-X/tmp/blk_XX. Cause is
java.io.IOException: No space left on device
        at java.io.UnixFileSystem.createFileExclusively(Native Method)
        at java.io.File.createNewFile(File.java:1012)
        at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
        at 
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createTmpFile(BlockPoolSlice.java:292)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTmpFile(FsVolumeImpl.java:532)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createTemporary(FsVolumeImpl.java:1254)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1598)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:212)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1314)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:768)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:291)
        at java.lang.Thread.run(Thread.java:748)

 

We found this issue happened due to the following two reasons:

First the upgrade process added some extra disk storage on the each disk volume 
of the data node:

BlockPoolSliceStorage.doUpgrade 
(https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockPoolSliceStorage.java#L445)
 is the main upgrade function in the datanode, it will add some extra storage. 
The extra storage added is all new directories created in 
/current//current, although all block data file and block meta data file 
are hard-linked with /current//previous after upgrade. Since there will 
be a lot of new directories created, this will use some disk space on each disk 
volume.

 

Second there is a potential bug when picking a disk volume to write a new block 
file(replica). By default, Hadoop uses RoundRobinVolumeChoosingPolicy, The code 
to select a disk will check whether the available space on the selected disk is 
more than the size bytes of block file to store 
(https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/RoundRobinVolumeChoosingPolicy.java#L86)
 But when creating a new block, there will be two files created: one is the 
block file blk_, the other is block metadata file blk__.meta, this 
is the code when finalizing a block, both block file size and meta data file 
size will be updated: 
https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/fsdataset/impl/BlockPoolSlice.java#L391
 the current code only considers the size of block file and doesn't consider 
the size of block metadata file, when choosing a disk in 
RoundRobinVolumeChoosingPolicy. There can be a lot of on-going blocks received 
at the same time, the default maximum number of DataXceiver threads is 4096. 
This will underestimate the total size needed to write a block, which will 
potentially cause the above disk full error(No space left on device).

 

Since the size of the block metadata file is not fixed,

[jira] [Commented] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-10-09 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951239#comment-14951239
 ] 

zhihai xu commented on HDFS-9085:
-

Thanks [~cnauroth] for reviewing and committing the patch!

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Fix For: 3.0.0
>
> Attachments: HDFS-9085.001.patch, HDFS-9085.002.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-10-08 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Attachment: (was: HDFS-9085.002.patch)

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch, HDFS-9085.002.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-10-08 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Attachment: HDFS-9085.002.patch

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch, HDFS-9085.002.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-10-08 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Attachment: (was: HDFS-9085.002.patch)

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch, HDFS-9085.002.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-10-08 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Attachment: HDFS-9085.002.patch

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch, HDFS-9085.002.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-10-08 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949472#comment-14949472
 ] 

zhihai xu commented on HDFS-9085:
-

Thanks for the good suggestion [~cnauroth]! Yes, I uploaded a new patch 
HDFS-9085.002.patch, which added a unit test to verify {{toString}}.

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch, HDFS-9085.002.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-10-08 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Attachment: HDFS-9085.002.patch

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: security
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch, HDFS-9085.002.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-09-17 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803307#comment-14803307
 ] 

zhihai xu commented on HDFS-9085:
-

Thanks for the review [~cnauroth]! That is great information. Yes, it makes 
sense to commit the patch to trunk only.

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-09-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Description: 
Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
{{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}} 
didn't show the renewer information. It will be very useful to have renewer 
information to debug security related issue. Because the renewer will be 
filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
real renewer info after applying the rules.

  was:
Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
{{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}} 
didn't show the renewer information. It will be very useful to have renewer 
information to debug security related issue. Because the renewer will be 
filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
real renewer after applying the rules.


> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-09-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Status: Patch Available  (was: Open)

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer info after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-09-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Description: 
Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
{{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}} 
didn't show the renewer information. It will be very useful to have renewer 
information to debug security related issue. Because the renewer will be 
filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
real renewer after applying the rules.

  was:
Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
{{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}} 
didn't show the renewer information. It will be very useful to have renewer 
information to debug security related issue.


> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue. Because the renewer will be 
> filtered by "hadoop.security.auth_to_local", it will be helpful to show the 
> real renewer after applying the rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-09-15 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-9085:

Attachment: HDFS-9085.001.patch

> Show renewer information in DelegationTokenIdentifier#toString
> --
>
> Key: HDFS-9085
> URL: https://issues.apache.org/jira/browse/HDFS-9085
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: zhihai xu
>Assignee: zhihai xu
>Priority: Trivial
> Attachments: HDFS-9085.001.patch
>
>
> Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
> {{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}}
>  didn't show the renewer information. It will be very useful to have renewer 
> information to debug security related issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-9085) Show renewer information in DelegationTokenIdentifier#toString

2015-09-15 Thread zhihai xu (JIRA)
zhihai xu created HDFS-9085:
---

 Summary: Show renewer information in 
DelegationTokenIdentifier#toString
 Key: HDFS-9085
 URL: https://issues.apache.org/jira/browse/HDFS-9085
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: zhihai xu
Assignee: zhihai xu
Priority: Trivial


Show renewer information in {{DelegationTokenIdentifier#toString}}. Currently 
{{org.apache.hadoop.hdfs.security.token.delegation.DelegationTokenIdentifier}} 
didn't show the renewer information. It will be very useful to have renewer 
information to debug security related issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (HDFS-8847) change TestHDFSContractAppend to not override testRenameFileBeingAppended method.

2015-07-31 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu reopened HDFS-8847:
-

> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.
> -
>
> Key: HDFS-8847
> URL: https://issues.apache.org/jira/browse/HDFS-8847
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0
>
>
> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8847) change TestHDFSContractAppend to not override testRenameFileBeingAppended method.

2015-07-31 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved HDFS-8847.
-
Resolution: Fixed

> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.
> -
>
> Key: HDFS-8847
> URL: https://issues.apache.org/jira/browse/HDFS-8847
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0
>
>
> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (HDFS-8847) change TestHDFSContractAppend to not override testRenameFileBeingAppended method.

2015-07-31 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu resolved HDFS-8847.
-
  Resolution: Fixed
Hadoop Flags: Reviewed

> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.
> -
>
> Key: HDFS-8847
> URL: https://issues.apache.org/jira/browse/HDFS-8847
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0
>
>
> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-8847) change TestHDFSContractAppend to not override testRenameFileBeingAppended method.

2015-07-31 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650152#comment-14650152
 ] 

zhihai xu commented on HDFS-8847:
-

The patch from HADOOP-12268 
(https://issues.apache.org/jira/secure/attachment/12748104/HADOOP-12268.001.patch)
 has change at hdfs project, which is in TestHDFSContractAppend.java.
I committed the change in TestHDFSContractAppend.java to trunk and branch-2.

> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.
> -
>
> Key: HDFS-8847
> URL: https://issues.apache.org/jira/browse/HDFS-8847
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0
>
>
> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8847) change TestHDFSContractAppend to not override testRenameFileBeingAppended method.

2015-07-31 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-8847:

Fix Version/s: 2.8.0

> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.
> -
>
> Key: HDFS-8847
> URL: https://issues.apache.org/jira/browse/HDFS-8847
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: test
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0
>
>
> change TestHDFSContractAppend to not override testRenameFileBeingAppended 
> method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-8847) change TestHDFSContractAppend to not override testRenameFileBeingAppended method.

2015-07-31 Thread zhihai xu (JIRA)
zhihai xu created HDFS-8847:
---

 Summary: change TestHDFSContractAppend to not override 
testRenameFileBeingAppended method.
 Key: HDFS-8847
 URL: https://issues.apache.org/jira/browse/HDFS-8847
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: test
Reporter: zhihai xu
Assignee: zhihai xu


change TestHDFSContractAppend to not override testRenameFileBeingAppended 
method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8814) BlockSender.sendChunks() exception

2015-07-23 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-8814:

Component/s: HDFS

> BlockSender.sendChunks() exception
> --
>
> Key: HDFS-8814
> URL: https://issues.apache.org/jira/browse/HDFS-8814
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: HDFS
>Affects Versions: 2.6.0, 2.7.1
> Environment: OS: CentOS Linux release 7.1.1503 (Core) 
> Kernel: 3.10.0-229.1.2.el7.x86_64
>Reporter: Marius
>
> Hi
> I was running some streaming jobs with avro files from my hadoop cluster. 
> They performed poorly so i checked the logs of my datanodes and found this:
> http://pastebin.com/DXKJJ55z
> The cluster is running on CentOS machines:
> CentOS Linux release 7.1.1503 (Core) 
> This is the Kernel:
> 3.10.0-229.1.2.el7.x86_64
> No one on the userlist replied and i could not find anything helpful on the 
> internet despite disk failure which is unlikely to cause this because here 
> are several machines and its not very likely that all of their disks fail at 
> the same time.
> This error is not reported on the console when running a job and the error 
> occurs from time to time and then dissapears and comes back again.
> The block size of the cluster is the default value.
> This is my command:
> hadoop jar hadoop-streaming-2.7.1.jar -files mapper.py,reducer.py,avro-1.
> 7.7.jar,avro-mapred-1.7.7-hadoop2.jar -D mapreduce.job.reduces=15 -libjars 
> avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar -input /Y/Y1.avro -output 
> /htest/output -mapper mapper.py -reducer reducer.py -inputformat 
> org.apache.avro.mapred.AvroAsTextInputFormat
> Marius



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Moved] (HDFS-8814) BlockSender.sendChunks() exception

2015-07-23 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu moved HADOOP-12260 to HDFS-8814:
--

Affects Version/s: (was: 2.7.1)
   (was: 2.6.0)
   2.6.0
   2.7.1
  Key: HDFS-8814  (was: HADOOP-12260)
  Project: Hadoop HDFS  (was: Hadoop Common)

> BlockSender.sendChunks() exception
> --
>
> Key: HDFS-8814
> URL: https://issues.apache.org/jira/browse/HDFS-8814
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1, 2.6.0
> Environment: OS: CentOS Linux release 7.1.1503 (Core) 
> Kernel: 3.10.0-229.1.2.el7.x86_64
>Reporter: Marius
>
> Hi
> I was running some streaming jobs with avro files from my hadoop cluster. 
> They performed poorly so i checked the logs of my datanodes and found this:
> http://pastebin.com/DXKJJ55z
> The cluster is running on CentOS machines:
> CentOS Linux release 7.1.1503 (Core) 
> This is the Kernel:
> 3.10.0-229.1.2.el7.x86_64
> No one on the userlist replied and i could not find anything helpful on the 
> internet despite disk failure which is unlikely to cause this because here 
> are several machines and its not very likely that all of their disks fail at 
> the same time.
> This error is not reported on the console when running a job and the error 
> occurs from time to time and then dissapears and comes back again.
> The block size of the cluster is the default value.
> This is my command:
> hadoop jar hadoop-streaming-2.7.1.jar -files mapper.py,reducer.py,avro-1.
> 7.7.jar,avro-mapred-1.7.7-hadoop2.jar -D mapreduce.job.reduces=15 -libjars 
> avro-1.7.7.jar,avro-mapred-1.7.7-hadoop2.jar -input /Y/Y1.avro -output 
> /htest/output -mapper mapper.py -reducer reducer.py -inputformat 
> org.apache.avro.mapred.AvroAsTextInputFormat
> Marius



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-03-20 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371622#comment-14371622
 ] 

zhihai xu commented on HDFS-7835:
-

thanks [~yzhangal] for your review and commit.

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Affects Versions: 2.7.0
>Reporter: zhihai xu
>Assignee: zhihai xu
> Fix For: 2.8.0
>
> Attachments: HDFS-7835.000.patch, HDFS-7835.001.patch, 
> HDFS-7835.002.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-03-19 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370821#comment-14370821
 ] 

zhihai xu commented on HDFS-7835:
-

All these test failure are not related to my change.
TestTracing is reported at HDFS-7963
TestRetryCacheWithHA and TestEncryptionZonesWithKMS are passed at my latest 
local build:
{code}
---
 T E S T S
---
Running org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
Tests run: 22, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 60.994 sec - 
in org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA
Results :
Tests run: 22, Failures: 0, Errors: 0, Skipped: 0
---
 T E S T S
---
Running org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
Tests run: 19, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 41.67 sec - in 
org.apache.hadoop.hdfs.TestEncryptionZonesWithKMS
Results :
Tests run: 19, Failures: 0, Errors: 0, Skipped: 0
{code}

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: HDFS-7835.000.patch, HDFS-7835.001.patch, 
> HDFS-7835.002.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-03-19 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14370455#comment-14370455
 ] 

zhihai xu commented on HDFS-7835:
-

Hi [~yzhangal], thanks for the review. I uploaded a new patch 
HDFS-7835.002.patch which addressed all your comments. Please review it.

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: HDFS-7835.000.patch, HDFS-7835.001.patch, 
> HDFS-7835.002.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-03-19 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-7835:

Attachment: HDFS-7835.002.patch

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: HDFS-7835.000.patch, HDFS-7835.001.patch, 
> HDFS-7835.002.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-03-12 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359749#comment-14359749
 ] 

zhihai xu commented on HDFS-7835:
-

Hi [~yzhangal], thanks for your thorough review. I uploaded a mew patch 
HDFS-7835.001.patch which addressed all your comments.

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: HDFS-7835.000.patch, HDFS-7835.001.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-03-12 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-7835:

Attachment: HDFS-7835.001.patch

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: HDFS-7835.000.patch, HDFS-7835.001.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-02-24 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-7835:

Attachment: HDFS-7835.000.patch

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: HDFS-7835.000.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-02-24 Thread zhihai xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhihai xu updated HDFS-7835:

Status: Patch Available  (was: Open)

> make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> --
>
> Key: HDFS-7835
> URL: https://issues.apache.org/jira/browse/HDFS-7835
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: dfsclient
>Reporter: zhihai xu
>Assignee: zhihai xu
> Attachments: HDFS-7835.000.patch
>
>
> Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
> Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
> DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
> "dfs.client.block.write.locateFollowingBlock.retries". We should also make 
> the initial sleeptime configurable to give user more flexibility to control 
> both retry and delay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7835) make initial sleeptime in locateFollowingBlock configurable for DFSClient.

2015-02-24 Thread zhihai xu (JIRA)
zhihai xu created HDFS-7835:
---

 Summary: make initial sleeptime in locateFollowingBlock 
configurable for DFSClient.
 Key: HDFS-7835
 URL: https://issues.apache.org/jira/browse/HDFS-7835
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: dfsclient
Reporter: zhihai xu
Assignee: zhihai xu


Make initial sleeptime in locateFollowingBlock configurable for DFSClient.
Current the sleeptime/localTimeout in locateFollowingBlock/completeFile from 
DFSOutputStream is hard-coded as 400 ms, but retries can be configured by 
"dfs.client.block.write.locateFollowingBlock.retries". We should also make the 
initial sleeptime configurable to give user more flexibility to control both 
retry and delay.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (HDFS-7801) "IOException:NameNode still not started" cause DFSClient operation failure without retry.

2015-02-15 Thread zhihai xu (JIRA)
zhihai xu created HDFS-7801:
---

 Summary: "IOException:NameNode still not started" cause DFSClient 
operation failure without retry.
 Key: HDFS-7801
 URL: https://issues.apache.org/jira/browse/HDFS-7801
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client, namenode
Reporter: zhihai xu


"IOException:NameNode still not started" cause DFSClient operation failure 
without retry.
In YARN-1778, TestFSRMStateStore failed randomly, it is due to the 
"java.io.IOException: NameNode still not started".
The stack trace for this Exception is the following:
{code}
2015-02-03 00:09:19,092 INFO  [Thread-110] recovery.TestFSRMStateStore 
(TestFSRMStateStore.java:run(285)) - testFSRMStateStoreClientRetry: Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): NameNode still not 
started
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:973)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2134)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2130)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2128)

at org.apache.hadoop.ipc.Client.call(Client.java:1474)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy23.mkdirs(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:557)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101)
at com.sun.proxy.$Proxy24.mkdirs(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2991)
at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2961)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:973)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$19.doCall(DistributedFileSystem.java:969)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:969)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:962)
at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1869)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore.storeApplicationStateInternal(FileSystemRMStateStore.java:364)
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore$2.run(TestFSRMStateStore.java:273)
2015-02-03 00:09:19,089 INFO  [IPC Server handler 0 on 57792] ipc.Server 
(Server.java:run(2155)) - IPC Server handler 0 on 57792, call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805 
Call#14 Retry#1
java.io.IOException: NameNode still not started
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:622)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:636)
at org.apache.had