[
https://issues.apache.org/jira/browse/HDFS-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rushabh S Shah updated HDFS-8245:
---------------------------------
Attachment: HDFS-8245.patch
I have just added a check in BlockManager#removeStoredBlock to add to
pendingDNMessages if the namenode is standby.
There is one test case failing TestDNFencing#testDnFencing by the change.
The test case was checking for postponedMisreplicatedBlocksCount after the
failover.
Since the change adds the deleted block to the pending messages queue and
during the trnasition to active, it dequeues the deleted block request.
So the postponedMisreplicatedBlocksCount will be zero after the failOver.
> Standby namenode doesn't process DELETED_BLOCK if the add block request is in
> edit log.
> ---------------------------------------------------------------------------------------
>
> Key: HDFS-8245
> URL: https://issues.apache.org/jira/browse/HDFS-8245
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Rushabh S Shah
> Assignee: Rushabh S Shah
> Attachments: HDFS-8245.patch
>
>
> The following series of events happened on Standby namenode :
> 2015-04-09 07:47:21,735 \[Edit log tailer] INFO ha.EditLogTailer: Triggering
> log roll on remote NameNode Active Namenode (ANN)
> 2015-04-09 07:58:01,858 \[Edit log tailer] INFO ha.EditLogTailer: Triggering
> log roll on remote NameNode ANN
> The following series of events happened on Active Namenode:,
> 2015-04-09 07:47:21,747 \[IPC Server handler 99 on 8020] INFO
> namenode.FSNamesystem: Roll Edit Log from Standby NN (SNN)
> 2015-04-09 07:58:01,868 \[IPC Server handler 18 on 8020] INFO
> namenode.FSNamesystem: Roll Edit Log from SNN
> The following series of events happened on datanode ( {color:red} datanodeA
> {color}):
> 2015-04-09 07:52:15,817 \[DataXceiver for client
> DFSClient_attempt_1428022041757_102831_r_000107_0_1139131345_1 at /xxxx:51078
> \[Receiving block
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867]] INFO
> datanode.DataNode: Receiving
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867 src:
> /client:51078 dest: /{color:red}datanodeA:1004{color}
> 2015-04-09 07:52:15,969 \[PacketResponder:
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867,
> type=HAS_DOWNSTREAM_IN_PIPELINE] INFO DataNode.clienttrace: src:
> /client:51078, dest: /{color:red}datanodeA:1004{color}, bytes: 20, op:
> HDFS_WRITE, cliID:
> DFSClient_attempt_1428022041757_102831_r_000107_0_1139131345_1, offset: 0,
> srvID: 356a8a98-826f-446d-8f4c-ce288c1f0a75, blockid:
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867, duration:
> 148948385
> 2015-04-09 07:52:15,969 \[PacketResponder:
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867,
> type=HAS_DOWNSTREAM_IN_PIPELINE] INFO datanode.DataNode: PacketResponder:
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867,
> type=HAS_DOWNSTREAM_IN_PIPELINE terminating
> 2015-04-09 07:52:25,970 \[DataXceiver for client /{color:red}<datanodeB>
> {color}:52827 \[Copying block
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867]] INFO
> datanode.DataNode: Copied
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867 to
> <{color:red}datanodeB{color}>:52827
> 2015-04-09 07:52:28,187 \[DataNode: heartbeating to ANN:8020] INFO
> impl.FsDatasetAsyncDiskService: Scheduling blk_1570321882_1102029183867 file
> <path>/blk_1570321882 for deletion
> 2015-04-09 07:52:28,188 \[Async disk worker #1482 for volume ] INFO
> impl.FsDatasetAsyncDiskService: Deleted BP-595383232-xxxx-1360869396230
> blk_1570321882_1102029183867 file <path>/blk_1570321882
> Then we failover for upgrade and then the standby became active.
> When we did ls command on this file, we got the following exception:
> 15/04/09 22:07:39 WARN hdfs.BlockReaderFactory: I/O error constructing remote
> block reader.
> java.io.IOException: Got error for OP_READ_BLOCK, self=/client:32947,
> remote={color:red}datanodeA:1004{color}, for file <filename>, for pool
> BP-595383232-xxxx-1360869396230 block 1570321882_1102029183867
> at
> org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445)
> at
> org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:815)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:693)
> at
> org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:351)
> at
> org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:576)
> at
> org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:800)
> at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
> at
> org.apache.hadoop.fs.shell.CopyCommands$Merge.processArguments(CopyCommands.java:97)
> at
> org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190)
> at org.apache.hadoop.fs.shell.Command.run(Command.java:154)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:287)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:340)
> Eventually this request was served by {color:red} datanodeB {color} on the
> second attempt,
> 15/04/09 22:07:39 INFO hdfs.DFSClient: Successfully connected to
> {color:red}datanodeB:1004{color} for
> BP-595383232-xxxx-1360869396230:blk_1570321882_1102029183867
> On further analyzing this problem, I found that when {color:red} datanodeA
> {color} notified the namenode after deleting that block, the then ANN didn't
> queue delete request to PendingDNMessages.
> So when the then SNN read the edits, it didn't knew that the block from
> {color:red}datanodeA {color} was deleted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)