[
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15303272#comment-15303272
]
Vinitha Reddy Gankidi commented on HDFS-10301:
----------------------------------------------
I looked into why the test {{TestAddOverReplicatedStripedBlocks}} fails with
patch 004. I don't completely understand why the test relies on the fact that
zombie storages should be removed when the DN has stale storages. Probably the
test needs to be modified. Here are my findings:
With the patch, the test fails with the following error:
{code}
java.lang.AssertionError: expected:<10> but was:<11>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at
org.apache.hadoop.hdfs.server.namenode.TestAddOverReplicatedStripedBlocks.testProcessOverReplicatedAndMissingStripedBlock(TestAddOverReplicatedStripedBlocks.java:281)
{code}
In the test, {{DFSUtil.createStripedFile}} is invoked in the beginning.
{code}
/**
* Creates the metadata of a file in striped layout. This method only
* manipulates the NameNode state without injecting data to DataNode.
* You should disable periodical heartbeat before use this.
* @param file Path of the file to create
* @param dir Parent path of the file
* @param numBlocks Number of striped block groups to add to the file
* @param numStripesPerBlk Number of striped cells in each block
* @param toMkdir
*/
public static void createStripedFile(MiniDFSCluster cluster, Path file, Path
dir,
int numBlocks, int numStripesPerBlk, boolean toMkdir) throws Exception {
{code}
This internally calls the {{DFSUtil.addBlockToFile}} method that mimics block
reports. While processing these incremental storages, we update the datanode
storages. In the test output, you can see the storages being added.
{code}
2016-05-26 17:10:03,330 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
9505a2ad-78f4-45d7-9c13-2ecd92a06866 for DN 127.0.0.1:60835
2016-05-26 17:10:03,331 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
d4bb2f70-4a1e-451f-9d47-a2967f819130 for DN 127.0.0.1:60839
2016-05-26 17:10:03,332 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
841fc92f-fa15-4ced-8487-96ca4e6996d0 for DN 127.0.0.1:60844
2016-05-26 17:10:03,332 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
304aaeeb-e2d0-4427-81c6-c79e4d0b6a4e for DN 127.0.0.1:60849
2016-05-26 17:10:03,332 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
2d046d66-26fc-448f-938c-04dda2ecf34a for DN 127.0.0.1:60853
2016-05-26 17:10:03,333 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
381d3151-e75e-434a-86f8-da5c83f22b19 for DN 127.0.0.1:60857
2016-05-26 17:10:03,333 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
71f72bc9-9c66-478f-a0d7-3f0c7fc23964 for DN 127.0.0.1:60861
2016-05-26 17:10:03,333 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
4dc539f3-b7a9-4145-a313-fa99ca1dd779 for DN 127.0.0.1:60865
2016-05-26 17:10:03,333 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
734ea366-e635-4715-97d5-196bfcdccb18 for DN 127.0.0.1:60869
2016-05-26 17:10:03,334 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
c639de06-e85c-4e93-92d2-506a49d4e41c for DN 127.0.0.1:60835
2016-05-26 17:10:03,343 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
a82ff231-d630-4799-907d-f0a72ff06b38 for DN 127.0.0.1:60839
2016-05-26 17:10:03,343 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
328c3467-0507-45fd-9aac-73a38165f741 for DN 127.0.0.1:60844
2016-05-26 17:10:03,343 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
0b2a3b7f-e065-4e9a-9908-024091393738 for DN 127.0.0.1:60849
2016-05-26 17:10:03,344 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
3654a0ce-8389-40bf-b8d3-08cc49895a7d for DN 127.0.0.1:60853
2016-05-26 17:10:03,344 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
8072cc31-5567-4c04-8f71-7a8ee03c2fe0 for DN 127.0.0.1:60857
2016-05-26 17:10:03,344 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
0202860d-4aad-4996-a325-23a34f052cb2 for DN 127.0.0.1:60861
2016-05-26 17:10:03,345 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
5415d95d-c173-4458-be78-d3fa95652589 for DN 127.0.0.1:60865
2016-05-26 17:10:03,345 [Thread-0] INFO blockmanagement.DatanodeDescriptor
(DatanodeDescriptor.java:updateStorage(912)) - Adding new storage ID
14570c81-1dc1-4479-a65a-5b61944d4b94 for DN 127.0.0.1:60869
2016-05-26 17:10:03,359 [IPC Server handler 9 on 60834] INFO hdfs.StateChange
(FSNamesystem.java:completeFile(2663)) - DIR* completeFile: /striped/file is
closed by DFSClient_NONMAPREDUCE_865500748_10
{code}
When these storages are added, the lastBlockReportId is set to zero and the
storage is considered as a stale storage. Since the DN doesn't know about these
storages, these storages are not reported in the next block report. These
storages are considered as zombie storages and are removed. One of these zombie
storages has a replica. Relevant logs:
{code}
2016-05-26 17:10:03,383 [Block report processor] WARN
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2239)) -
processReport 0x6aedc669a6437553: removing zombie storage
c639de06-e85c-4e93-92d2-506a49d4e41c, which no longer exists on the DataNode.
2016-05-26 17:10:03,384 [Block report processor] WARN
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2263)) -
processReport 0x6aedc669a6437553: removed 0 replicas from storage
c639de06-e85c-4e93-92d2-506a49d4e41c, which no longer exists on the DataNode.
2016-05-26 17:10:03,416 [Block report processor] WARN
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2239)) -
processReport 0xf7e24bf2690ca946: removing zombie storage
0202860d-4aad-4996-a325-23a34f052cb2, which no longer exists on the DataNode.
2016-05-26 17:10:03,416 [Block report processor] WARN
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2263)) -
processReport 0xf7e24bf2690ca946: removed 0 replicas from storage
0202860d-4aad-4996-a325-23a34f052cb2, which no longer exists on the DataNode.
2016-05-26 17:10:04,217 [Block report processor] WARN
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2239)) -
processReport 0xe361b2d0f2b49c0c: removing zombie storage
14570c81-1dc1-4479-a65a-5b61944d4b94, which no longer exists on the DataNode.
2016-05-26 17:10:04,219 [Block report processor] WARN
blockmanagement.BlockManager (BlockManager.java:removeZombieReplicas(2263)) -
processReport 0xe361b2d0f2b49c0c: removed 1 replicas from storage
14570c81-1dc1-4479-a65a-5b61944d4b94, which no longer exists on the DataNode.
{code}
In patch 004, zombie storages are not removed when there are stale storages.
Are there scenarios where this could happen? Since the zombie storages are not
removed and one of the zombie storages has a replica, the assertion fails. This
test was introduced in HDFS-8827.
> BlockReport retransmissions may lead to storages falsely being declared
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10301
> URL: https://issues.apache.org/jira/browse/HDFS-10301
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.1
> Reporter: Konstantin Shvachko
> Assignee: Colin Patrick McCabe
> Priority: Critical
> Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch,
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.01.patch,
> HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it
> sends the block report again. Then NameNode while process these two reports
> at the same time can interleave processing storages from different reports.
> This screws up the blockReportId field, which makes NameNode think that some
> storages are zombie. Replicas from zombie storages are immediately removed,
> causing missing blocks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]