[
https://issues.apache.org/jira/browse/HDFS-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555717#comment-13555717
]
Colin Patrick McCabe commented on HDFS-2882:
--------------------------------------------
I tried to reproduce this in the following ways:
*Changing the BPID*
I started the cluster, and then killed the datanode and ran this:
{code}
mv $rdata1/current/$old_bpid $r/data1/current/$new_bpid
sed --in-place "s/blockpoolID=$old_bpid/blockpoolID=$new_bpid/"
$r/data1/current/$new_bpid/VERSION
mv $r/data2/current/$old_bpid $r/data2/current/$new_bpid
sed --in-place "s/blockpoolID=$old_bpid/blockpoolID=$new_bpid/"
$r/data2/current/$new_bpid/VERSION
{code}
Then I restarted the datanode.
Result: DataNode created a new directory for {{$old_bpid}} and started up
normally.
*Simulating your "disk unwritable because it's full" problem.*
{code}
[stop datanode]
rm -rf $r/data[12]/*
sudo chown root $r/data[12]
[restart datanode]
{code}
Result: DataNode startup failed with this:
{code}
15:59:55,978 FATAL DataNode:668 - Initialization failed for block pool Block
pool BP-164367671-127.0.0.1-1358201991231 (storage id DS-226544464-
127.0.0.1-6100-1358380730839) service to localhost/127.0.0.1:6000
java.io.IOException: All specified directories are not accessible or do not
exist.
at
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:137)
at
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
...
{code}
It seems like this is the result of {{dfs.datanode.failed.volumes.tolerated}}
defaulting to 0. Since we don't tolerate any volumes failing, and not being
able to create the directories causes the volume to fail, the DN startup
aborts.
*Changing the NamespaceID of the Datanode*
{code}
[start the cluster]
[stop the DN]
sed --in-place "s/namespaceID=$old_nsid/namespaceID=$new_nsid/"
$r/data*/current/*/current/VERSION
[restart the DN]
{code}
Result: I got this exception:
{code}
17:17:01,898 FATAL DataNode:668 - Initialization failed for block pool Block
pool BP-164367671-127.0.0.1-1358201991231 (storage id
DS-358917359-127.0.0.1-6100-1358385257855) service to
localhost/127.0.0.1:6000
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory
/r/data1/current/BP-164367671-127.0.0.1-1358201991231 is in an inconsistent
state: namespaceID is incompatible with others.
at
org.apache.hadoop.hdfs.server.common.Storage.setNamespaceID(Storage.java:1091)
at
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:218)
{code}
So I'm not sure under what conditions the DN still starts up when the block
pools fail to initialize. I guess if you change
{{dfs.datanode.failed.volumes.tolerated}} from the default of 0 to something
else, you might be able to get this behavior. But at that point, it seems like
you asked for that behavior. Is it expected that people will change this from
the default, and that we should make that configuration option not apply on
startup?
> DN continues to start up, even if block pool fails to initialize
> ----------------------------------------------------------------
>
> Key: HDFS-2882
> URL: https://issues.apache.org/jira/browse/HDFS-2882
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.0.2-alpha
> Reporter: Todd Lipcon
> Assignee: Colin Patrick McCabe
> Attachments: hdfs-2882.txt
>
>
> I started a DN on a machine that was completely out of space on one of its
> drives. I saw the following:
> 2012-02-02 09:56:50,499 FATAL
> org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for
> block pool Block pool BP-448349972-172.29.5.192-1323816762969 (storage id
> DS-507718931-172.29.5.194-11072-12978
> 42002148) service to styx01.sf.cloudera.com/172.29.5.192:8021
> java.io.IOException: Mkdirs failed to create
> /data/1/scratch/todd/styx-datadir/current/BP-448349972-172.29.5.192-1323816762969/tmp
> at
> org.apache.hadoop.hdfs.server.datanode.FSDataset$BlockPoolSlice.<init>(FSDataset.java:335)
> but the DN continued to run, spewing NPEs when it tried to do block reports,
> etc. This was on the HDFS-1623 branch but may affect trunk as well.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira