[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Zhihong Yu (Commented) (JIRA) Wed, 11 Jan 2012 07:01:06 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13184113#comment-13184113
 ]


Zhihong Yu commented on HBASE-5163:
-----------------------------------

Integrated to TRUNK.

Thanks for the patch, N.

Thanks for the review, Stack.
                
> TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or 
> hadoop QA ("The directory is already locked.")
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5163
>                 URL: https://issues.apache.org/jira/browse/HBASE-5163
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 0.94.0
>         Environment: all
>            Reporter: nkeywal
>            Assignee: nkeywal
>            Priority: Minor
>         Attachments: 5163.patch
>
>
> The stack is typically:
> {noformat}
>     <error message="Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked." 
> type="java.io.IOException">java.io.IOException: Cannot lock storage 
> /tmp/19e3e634-8980-4923-9e72-a5b900a71d63/dfscluster_32a46f7b-24ef-488f-bd33-915959e001f4/dfs/data/data3.
>  The directory is already locked.
>       at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:602)
>       at 
> org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:455)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:111)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:376)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.&lt;init&gt;(DataNode.java:290)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1553)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1492)
>       at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1467)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:417)
>       at 
> org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:460)
>       at 
> org.apache.hadoop.hbase.regionserver.wal.TestLogRolling.testLogRollOnDatanodeDeath(TestLogRolling.java:470)
>         // ...
> {noformat}
> It can be reproduced without parallelization or without executing the other 
> tests in the class. It seems to fail about 5% of the time.
> This comes from the naming policy for the directories in 
> MiniDFSCluster#startDataNode. It depends on the number of nodes *currently* 
> in the cluster, and does not take into account previous starts/stops:
> {noformat}
>    for (int i = curDatanodesNum; i < curDatanodesNum+numDataNodes; i++) {
>       if (manageDfsDirs) {
>         File dir1 = new File(data_dir, "data"+(2*i+1));
>         File dir2 = new File(data_dir, "data"+(2*i+2));
>         dir1.mkdirs();
>         dir2.mkdirs();
>       // [...]
> {noformat}
> This means that it if we want to stop/start a datanode, we should always stop 
> the last one, if not the names will conflict. This test exhibits the behavior:
> {noformat}
>   @Test
>   public void testMiniDFSCluster_startDataNode() throws Exception {
>     assertTrue( dfsCluster.getDataNodes().size() == 2 );
>     // Works, as we kill the last datanode, we can now start a datanode
>     dfsCluster.stopDataNode(1);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>     // Fails, as it's not the last datanode, the directory will conflict on
>     //  creation
>     dfsCluster.stopDataNode(0);
>     try {
>       dfsCluster
>         .startDataNodes(TEST_UTIL.getConfiguration(), 1, true, null, null);
>       fail("There should be an exception because the directory already 
> exists");
>     } catch (IOException e) {
>       assertTrue( e.getMessage().contains("The directory is already 
> locked."));
>       LOG.info("Expected (!) exception caught " + e.getMessage());
>     }
>     // Works, as we kill the last datanode, we can now restart 2 datanodes
>     // This makes us back with 2 nodes
>     dfsCluster.stopDataNode(0);
>     dfsCluster
>       .startDataNodes(TEST_UTIL.getConfiguration(), 2, true, null, null);
>   }
> {noformat}
> And then this behavior is randomly triggered in testLogRollOnDatanodeDeath 
> because when we do
> {noformat}
> DatanodeInfo[] pipeline = getPipeline(log);
> assertTrue(pipeline.length == fs.getDefaultReplication());
> {noformat}
> and then kill the datanodes in the pipeline, we will have:
>  - most of the time: pipeline = 1 & 2, so after killing 1&2 we can start a 
> new datanode that will reuse the available 2's directory.
>  - sometimes: pipeline = 1 & 3. In this case,when we try to launch the new 
> datanode, it fails because it wants to use the same directory as the still 
> alive '2'.
> There are two ways of fixing the test:
> 1) Fix the naming rule in MiniDFSCluster#startDataNode, for example to ensure 
> that the directory names will not be reused. But I wonder if there is not a 
> testCase somewhere (may be not in HBase) depending on this behavior.
> 2) Kill explicitly the first and second datanode without using the pipeline 
> to be sure that the names won't conflict.
> Feedback welcome and the choice to make here...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5163) TestLogRolling#testLogRollOnDatanodeDeath fails sometimes on Jenkins or hadoop QA ("The directory is already locked.")

Reply via email to