[jira] [Commented] (HDFS-2882) DN continues to start up, even if block pool fails to initialize

Colin Patrick McCabe (JIRA) Wed, 16 Jan 2013 17:22:21 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555717#comment-13555717
 ]


Colin Patrick McCabe commented on HDFS-2882:
--------------------------------------------

I tried to reproduce this in the following ways:

*Changing the BPID*

I started the cluster, and then killed the datanode and ran this:
{code}
mv $rdata1/current/$old_bpid $r/data1/current/$new_bpid
sed --in-place "s/blockpoolID=$old_bpid/blockpoolID=$new_bpid/" 
$r/data1/current/$new_bpid/VERSION
mv $r/data2/current/$old_bpid $r/data2/current/$new_bpid
sed --in-place "s/blockpoolID=$old_bpid/blockpoolID=$new_bpid/" 
$r/data2/current/$new_bpid/VERSION
{code}
Then I restarted the datanode.
Result: DataNode created a new directory for {{$old_bpid}} and started up 
normally.

*Simulating your "disk unwritable because it's full" problem.*

{code}
[stop datanode]
rm -rf $r/data[12]/*
sudo chown root $r/data[12]
[restart datanode]
{code} 

Result: DataNode startup failed with this: 
{code}
15:59:55,978 FATAL DataNode:668 - Initialization failed for block pool Block 
pool BP-164367671-127.0.0.1-1358201991231 (storage id DS-226544464-  
127.0.0.1-6100-1358380730839) service to     localhost/127.0.0.1:6000
java.io.IOException: All specified directories are not accessible or do not 
exist.
at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:137)
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:234)
...
{code}

It seems like this is the result of {{dfs.datanode.failed.volumes.tolerated}} 
defaulting to 0.  Since we don't tolerate any volumes failing, and not being 
able to create the directories      causes the volume to fail, the DN startup 
aborts.

*Changing the NamespaceID of the Datanode*

{code} 
[start the cluster]
[stop the DN]
sed --in-place "s/namespaceID=$old_nsid/namespaceID=$new_nsid/" 
$r/data*/current/*/current/VERSION
[restart the DN]
{code}
Result: I got this exception:

{code}
17:17:01,898 FATAL DataNode:668 - Initialization failed for block pool Block 
pool BP-164367671-127.0.0.1-1358201991231 (storage id 
DS-358917359-127.0.0.1-6100-1358385257855) service to       
localhost/127.0.0.1:6000
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/r/data1/current/BP-164367671-127.0.0.1-1358201991231 is in an inconsistent 
state: namespaceID is incompatible    with others.
        at 
org.apache.hadoop.hdfs.server.common.Storage.setNamespaceID(Storage.java:1091)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.setFieldsFromProperties(BlockPoolSliceStorage.java:218)
{code}

So I'm not sure under what conditions the DN still starts up when the block 
pools fail to initialize.  I guess if you change 
{{dfs.datanode.failed.volumes.tolerated}} from the default of 0   to something 
else, you might be able to get this behavior.  But at that point, it seems like 
you asked for that behavior.  Is it expected that people will change this from 
the default, and that we should make that configuration option not apply on 
startup?
                
> DN continues to start up, even if block pool fails to initialize
> ----------------------------------------------------------------
>
>                 Key: HDFS-2882
>                 URL: https://issues.apache.org/jira/browse/HDFS-2882
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Colin Patrick McCabe
>         Attachments: hdfs-2882.txt
>
>
> I started a DN on a machine that was completely out of space on one of its 
> drives. I saw the following:
> 2012-02-02 09:56:50,499 FATAL 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for 
> block pool Block pool BP-448349972-172.29.5.192-1323816762969 (storage id 
> DS-507718931-172.29.5.194-11072-12978
> 42002148) service to styx01.sf.cloudera.com/172.29.5.192:8021
> java.io.IOException: Mkdirs failed to create 
> /data/1/scratch/todd/styx-datadir/current/BP-448349972-172.29.5.192-1323816762969/tmp
>         at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$BlockPoolSlice.<init>(FSDataset.java:335)
> but the DN continued to run, spewing NPEs when it tried to do block reports, 
> etc. This was on the HDFS-1623 branch but may affect trunk as well.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-2882) DN continues to start up, even if block pool fails to initialize

Reply via email to