TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on central build or 
hadoop QA on trunk ("The directory is already locked.")
-------------------------------------------------------------------------------------------------------------------------------------

                 Key: HBASE-5163
                 URL: https://issues.apache.org/jira/browse/HBASE-5163
             Project: HBase
          Issue Type: Bug
          Components: test
    Affects Versions: 0.94.0
         Environment: all
            Reporter: nkeywal
            Assignee: nkeywal
            Priority: Minor


The stack is typically:
{noformat}
    <error message="Cannot lock storage 
/tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
 The directory is already locked." 
type="java.io.IOException">java.io.IOException: Cannot lock storage 
/tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
 The directory is already locked.
        at 
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
        at 
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
        at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
        at 
org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
        // ...
{noformat}

It can be reproduced without parallelization or without executing the other 
tests in the class. It seems to fail about 5% of the time.

This comes from the naming policy for the directories in 
MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* in 
the cluster, and does not take into account previous starts/stops:
{noformat}
   for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
      if (manageDfsDirs) {
        File dir1 = new File(data_dir, "data"+(2*i+1));
        File dir2 = new File(data_dir, "data"+(2*i+2));
        dir1.mkdirs();
        dir2.mkdirs();
      // [...]
{noformat}

This means that it if we want to stop/start a datanode, we should always stop 
the last one, if not the names will conflict. This test exhibits the behavior:
{noformat}
  @Test
  public void testMiniDFSCluster_startDataNode() throws Exception {

    assertTrue( dfsCluster.getDataNodes().size() == 2 );


    // Works, as we kill the last datanode, we can now start a datanode
    dfsCluster.stopDataNode(1);
    dfsCluster
      .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);


    // Fails, as it's not the last datanode, the directory will conflict on
    //  creation
    dfsCluster.stopDataNode(0);
    try {
      dfsCluster
        .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
      fail("There should be an exception because the directory already exists");
    } catch (IOException e) {
      assertTrue( e.getMessage().contains("The directory is already locked."));
      LOG.info("Expected (!) exception caught " + e.getMessage());
    }

    // Works, as we kill the last datanode, we can now restart 2 datanodes
    // This makes us back with 2 nodes
    dfsCluster.stopDataNode(0);
    dfsCluster
      .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
  }
{noformat}


And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
because when we do
{noformat}
DatanodeInfo[] pipeline = getPipeline(log);
assertTrue(pipeline.length == fs.getDefaultReplication());
{noformat}

and then kill the datanodes in the pipeline, we will have:
 - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a new 
datanode that will reuse the available 2's directory.
 - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
datanode, it fails because it wants to use the same directory as the still 
alive '2'.

There are two ways of fixing the test:
1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure 
that the directory names will not be reused. But I wonder if there is not a 
testCase somewhere (may be not in HBase) depending on this behavior.
2) Kill explicitly the first and second datanode without using the pipeline to 
be sure that the names won't conflict.

Feedback welcome and the choice to make here...



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to