[ 
https://issues.apache.org/jira/browse/HDFS-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829224#comment-13829224
 ] 

Colin Patrick McCabe commented on HDFS-2882:
--------------------------------------------

If you look at the original description of this JIRA, by Todd, it looks like 
this:

{code}
I started a DN on a machine that was completely out of space on one of its 
drives. I saw the following:
2012-02-02 09:56:50,499 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: 
Initialization failed for block pool Block pool 
BP-448349972-172.29.5.192-1323816762969 (storage id 
DS-507718931-172.29.5.194-11072-12978
42002148) service to styx01.sf.cloudera.com/172.29.5.192:8021
java.io.IOException: Mkdirs failed to create 
/data/1/scratch/todd/styx-datadir/current/BP-448349972-172.29.5.192-1323816762969/tmp
at 
org.apache.hadoop.hdfs.server.datanode.FSDataset$BlockPoolSlice.<init>(FSDataset.java:335)
but the DN continued to run, spewing NPEs when it tried to do block reports, 
etc. This was on the HDFS-1623 branch but may affect trunk as well.
{code}

His concern was that the block pool failed to initialize, but the the DN 
continued to start up anyway, leading to a system that was not functional.  I 
tried to reproduce this on trunk (as opposed to the HDFS-1623 branch that Todd 
was using).  I was unable to reproduce this behavior: every time I got the 
block pool to fail to initialize, the DN also did not start up.  My theory is 
that this was either a bug that affected only the HDFS-1623 branch, or a bug 
that is related to a race condition that is very hard to reproduce.  You can 
see how I tried to reproduce it here: 
https://issues.apache.org/jira/browse/HDFS-2882?focusedCommentId=13555717&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13555717

Vinay, I don't have a clear idea of what your patch is trying to do.  I can see 
that it adds a retry state machine.  But as you yourself commented, 
BPServiceActor#retrieveNamespaceInfo() already loops until the NameNode 
responds.  So why do we need another retry mechanism?

Also, when I asked you whether you had reproduced Todd's problem, I didn't mean 
in a unit test.  I meant have you started up the DN and had it fail to 
initialize a block pool, but continue to start?

I also wonder if any of this is addressed in the HDFS-2832 branch, which 
changes the way DataNode storage ID is handled, among other things.

> DN continues to start up, even if block pool fails to initialize
> ----------------------------------------------------------------
>
>                 Key: HDFS-2882
>                 URL: https://issues.apache.org/jira/browse/HDFS-2882
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.0.2-alpha
>            Reporter: Todd Lipcon
>            Assignee: Vinay
>         Attachments: HDFS-2882.patch, HDFS-2882.patch, HDFS-2882.patch, 
> HDFS-2882.patch, HDFS-2882.patch, HDFS-2882.patch, hdfs-2882.txt
>
>
> I started a DN on a machine that was completely out of space on one of its 
> drives. I saw the following:
> 2012-02-02 09:56:50,499 FATAL 
> org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for 
> block pool Block pool BP-448349972-172.29.5.192-1323816762969 (storage id 
> DS-507718931-172.29.5.194-11072-12978
> 42002148) service to styx01.sf.cloudera.com/172.29.5.192:8021
> java.io.IOException: Mkdirs failed to create 
> /data/1/scratch/todd/styx-datadir/current/BP-448349972-172.29.5.192-1323816762969/tmp
>         at 
> org.apache.hadoop.hdfs.server.datanode.FSDataset$BlockPoolSlice.<init>(FSDataset.java:335)
> but the DN continued to run, spewing NPEs when it tried to do block reports, 
> etc. This was on the HDFS-1623 branch but may affect trunk as well.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to