[ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303272#comment-15303272
 ] 

Vinitha Reddy Gankidi commented on HDFS-10301:
----------------------------------------------

I looked into why the test {{TestAddOverReplicatedStripedBlocks}} fails with 
patch 004. I don't completely understand why the test relies on the fact that 
zombie storages should be removed when the DN has stale storages. Probably the 
test needs to be modified. Here are my findings:

With the patch, the test fails with the following error:
{code}
java.lang.AssertionError: expected:<10> but was:<11>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:743)
        at org.junit.Assert.assertEquals(Assert.java:118)
        at org.junit.Assert.assertEquals(Assert.java:555)
        at org.junit.Assert.assertEquals(Assert.java:542)
        at 
org.apache.hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks.testProcessOverReplicatedAndMissingStripedBlock(TestAddOverReplicatedStripedBlocks.java:281)
{code}

In the test, {{DFSUtil.createStripedFile}} is invoked in the beginning.
{code}
 /**
   * Creates the metadata of a file in striped layout. This method only
   * manipulates the NameNode state without injecting data to DataNode.
   * You should disable periodical heartbeat before use this.
   *  @param file Path of the file to create
   * @param dir Parent path of the file
   * @param numBlocks Number of striped block groups to add to the file
   * @param numStripesPerBlk Number of striped cells in each block
   * @param toMkdir
   */
  public static void createStripedFile(MiniDFSCluster cluster, Path file, Path 
dir,
      int numBlocks, int numStripesPerBlk, boolean toMkdir) throws Exception {
{code}

This internally calls the {{DFSUtil.addBlockToFile}} method that mimics block 
reports. While processing these incremental storages, we update the datanode 
storages. In the test output, you can see the storages being added.
{code}
2016-05-26 17:10:03,330 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
9505a2ad-78f4-45d7-9c13-2ecd92a06866 for DN 127.0.0.1:60835
2016-05-26 17:10:03,331 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
d4bb2f70-4a1e-451f-9d47-a2967f819130 for DN 127.0.0.1:60839
2016-05-26 17:10:03,332 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
841fc92f-fa15-4ced-8487-96ca4e6996d0 for DN 127.0.0.1:60844
2016-05-26 17:10:03,332 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
304aaeeb-e2d0-4427-81c6-c79e4d0b6a4e for DN 127.0.0.1:60849
2016-05-26 17:10:03,332 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
2d046d66-26fc-448f-938c-04dda2ecf34a for DN 127.0.0.1:60853
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
381d3151-e75e-434a-86f8-da5c83f22b19 for DN 127.0.0.1:60857
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
71f72bc9-9c66-478f-a0d7-3f0c7fc23964 for DN 127.0.0.1:60861
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
4dc539f3-b7a9-4145-a313-fa99ca1dd779 for DN 127.0.0.1:60865
2016-05-26 17:10:03,333 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
734ea366-e635-4715-97d5-196bfcdccb18 for DN 127.0.0.1:60869
2016-05-26 17:10:03,334 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
c639de06-e85c-4e93-92d2-506a49d4e41c for DN 127.0.0.1:60835
2016-05-26 17:10:03,343 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
a82ff231-d630-4799-907d-f0a72ff06b38 for DN 127.0.0.1:60839
2016-05-26 17:10:03,343 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
328c3467-0507-45fd-9aac-73a38165f741 for DN 127.0.0.1:60844
2016-05-26 17:10:03,343 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
0b2a3b7f-e065-4e9a-9908-024091393738 for DN 127.0.0.1:60849
2016-05-26 17:10:03,344 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
3654a0ce-8389-40bf-b8d3-08cc49895a7d for DN 127.0.0.1:60853
2016-05-26 17:10:03,344 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
8072cc31-5567-4c04-8f71-7a8ee03c2fe0 for DN 127.0.0.1:60857
2016-05-26 17:10:03,344 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
0202860d-4aad-4996-a325-23a34f052cb2 for DN 127.0.0.1:60861
2016-05-26 17:10:03,345 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
5415d95d-c173-4458-be78-d3fa95652589 for DN 127.0.0.1:60865
2016-05-26 17:10:03,345 [Thread-0] INFO  blockmanagement.DatanodeDescriptor 
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID 
14570c81-1dc1-4479-a65a-5b61944d4b94 for DN 127.0.0.1:60869
2016-05-26 17:10:03,359 [IPC Server handler 9 on 60834] INFO  hdfs.StateChange 
(FSNamesystem.java:completeFile(2663)) - DIR* completeFile: /striped/file is 
closed by DFSClient_NONMAPREDUCE_865500748_10
{code}

When these storages are added, the lastBlockReportId is set to zero and the 
storage is considered as a stale storage. Since the DN doesn't know about these 
storages, these storages are not reported in the next block report. These 
storages are considered as zombie storages and are removed. One of these zombie 
storages has a replica. Relevant logs:
{code}
2016-05-26 17:10:03,383 [Block report processor] WARN  
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2239)) - 
processReport 0x6aedc669a6437553: removing zombie storage 
c639de06-e85c-4e93-92d2-506a49d4e41c, which no longer exists on the DataNode.
2016-05-26 17:10:03,384 [Block report processor] WARN  
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2263)) - 
processReport 0x6aedc669a6437553: removed 0 replicas from storage 
c639de06-e85c-4e93-92d2-506a49d4e41c, which no longer exists on the DataNode.

2016-05-26 17:10:03,416 [Block report processor] WARN  
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2239)) - 
processReport 0xf7e24bf2690ca946: removing zombie storage 
0202860d-4aad-4996-a325-23a34f052cb2, which no longer exists on the DataNode.
2016-05-26 17:10:03,416 [Block report processor] WARN  
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2263)) - 
processReport 0xf7e24bf2690ca946: removed 0 replicas from storage 
0202860d-4aad-4996-a325-23a34f052cb2, which no longer exists on the DataNode.

2016-05-26 17:10:04,217 [Block report processor] WARN  
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2239)) - 
processReport 0xe361b2d0f2b49c0c: removing zombie storage 
14570c81-1dc1-4479-a65a-5b61944d4b94, which no longer exists on the DataNode.
2016-05-26 17:10:04,219 [Block report processor] WARN  
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2263)) - 
processReport 0xe361b2d0f2b49c0c: removed 1 replicas from storage 
14570c81-1dc1-4479-a65a-5b61944d4b94, which no longer exists on the DataNode.
{code}

In patch 004, zombie storages are not removed when there are stale storages. 
Are there scenarios where this could happen? Since the zombie storages are not 
removed and one of the zombie storages has a replica, the assertion fails. This 
test was introduced in HDFS-8827.



> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Colin Patrick McCabe
>            Priority: Critical
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.01.patch, 
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to