[
https://issues.apache.org/jira/browse/HDFS-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827725#comment-16827725
]
Stephen O'Donnell commented on HDFS-13677:
------------------------------------------
Yes, I just thought that perhaps the reason the problem does not reproduce is
because my new disk is empty. Re-adding a non-empty disk is what causes this
problem to manifest. So steps to reproduce:
1. Single node local cluster with 1 storage configured. Add 10 files or so,
giving a known number of blocks.
2. Reconfigure to add a new second storage. Add say 5 more files to put 2 or 3
blocks on the new disk.
3. Reconfigure to remove the second storage, then add it again. Then the DN
should generate a FBR with only 2 or 3 blocks (ie missing those from the
original storage) due to this bug.
{code}
<DN with 1 volume and 10 blocks just restarted>
2019-04-27 19:50:12,801 INFO datanode.DataNode: Successfully sent block report
0x230a1c633c634987, containing 1 storage report(s), of which we sent 1. The
reports had 10 total blocks and used 1 RPC(s). This took 3 msec to generate and
20 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
{code}
Now I add an empty disk and we can no issues, still 10 blocks reported:
{code}
2019-04-27 19:52:18,085 INFO impl.FsDatasetImpl: Added volume -
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2, StorageType: DISK
2019-04-27 19:52:18,085 INFO datanode.DataNode: Successfully added volume:
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2
2019-04-27 19:52:18,086 INFO datanode.DataNode: Block pool
BP-1999061334-192.168.0.24-1556390848658 (Datanode Uuid
242e590a-b1e3-4a1e-9b32-e67cc095bb0f): scheduling a full block report.
2019-04-27 19:52:18,087 INFO datanode.DataNode: Forcing a full block report to
localhost/127.0.0.1:8020
2019-04-27 19:52:18,087 INFO conf.ReconfigurableBase: Property
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not
configurable: old value: org.apache.hadoop.ipc.ProtobufRpcEngine, new value:
null
2019-04-27 19:52:18,089 INFO datanode.DataNode: Successfully sent block report
0x230a1c633c634988, containing 2 storage report(s), of which we sent 2. The
reports had 10 total blocks and used 1 RPC(s). This took 1 msec to generate and
2 msecs for RPC and NN processing. Got back no commands.
{code}
Now add 5 more files, giving 15 blocks total and then remove the second volume.
We can see 13 blocks reported as 2 are on the removed disk, still as expected:
{code}
2019-04-27 19:54:19,740 INFO impl.FsDatasetImpl: Removed volume:
/tmp/hadoop-sodonnell/dfs/data2
2019-04-27 19:54:19,740 INFO impl.FsDatasetImpl: Volume reference is released.
2019-04-27 19:54:19,741 INFO common.Storage: Removing block level storage:
/tmp/hadoop-sodonnell/dfs/data2/current/BP-1999061334-192.168.0.24-1556390848658
2019-04-27 19:54:19,743 INFO datanode.DataNode: Block pool
BP-1999061334-192.168.0.24-1556390848658 (Datanode Uuid
242e590a-b1e3-4a1e-9b32-e67cc095bb0f): scheduling a full block report.
2019-04-27 19:54:19,743 INFO datanode.DataNode: Forcing a full block report to
localhost/127.0.0.1:8020
2019-04-27 19:54:19,743 INFO conf.ReconfigurableBase: Property
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not
configurable: old value: org.apache.hadoop.ipc.ProtobufRpcEngine, new value:
null
2019-04-27 19:54:19,746 INFO datanode.DataNode: Successfully sent block report
0x230a1c633c634989, containing 1 storage report(s), of which we sent 1. The
reports had 13 total blocks and used 1 RPC(s). This took 0 msec to generate and
2 msecs for RPC and NN processing. Got back no commands.
{code}
Finally add the disk back in and the problem appears, as its no longer an empty
disk. The FBR reports only 2 blocks from the new disk instead of 15 as it
should:
{code}
2019-04-27 19:57:24,710 INFO impl.FsDatasetImpl: Added volume -
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2, StorageType: DISK
2019-04-27 19:57:24,710 INFO datanode.DataNode: Successfully added volume:
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2
2019-04-27 19:57:24,711 INFO datanode.DataNode: Block pool
BP-1999061334-192.168.0.24-1556390848658 (Datanode Uuid
242e590a-b1e3-4a1e-9b32-e67cc095bb0f): scheduling a full block report.
2019-04-27 19:57:24,711 INFO datanode.DataNode: Forcing a full block report to
localhost/127.0.0.1:8020
2019-04-27 19:57:24,711 INFO conf.ReconfigurableBase: Property
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not
configurable: old value: org.apache.hadoop.ipc.ProtobufRpcEngine, new value:
null
2019-04-27 19:57:24,713 INFO datanode.DataNode: Successfully sent block report
0x230a1c633c63498a, containing 2 storage report(s), of which we sent 2. The
reports had 2 total blocks and used 1 RPC(s). This took 0 msec to generate and
2 msecs for RPC and NN processing. Got back no commands.
{code}
> Dynamic refresh Disk configuration results in overwriting VolumeMap
> -------------------------------------------------------------------
>
> Key: HDFS-13677
> URL: https://issues.apache.org/jira/browse/HDFS-13677
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.6.0, 3.0.0
> Reporter: xuzq
> Priority: Major
> Attachments:
> 0001-fix-the-bug-of-the-refresh-disk-configuration.patch,
> image-2018-06-14-13-05-54-354.png, image-2018-06-14-13-10-24-032.png
>
>
> When I added a new disk by dynamically refreshing the configuration, an
> exception "FileNotFound while finding block" was caused.
>
> The steps are as follows:
> 1.Change the hdfs-site.xml of DataNode to add a new disk.
> 2.Refresh the configuration by "./bin/hdfs dfsadmin -reconfig datanode
> ****:50020 start"
>
> The error is like:
> ```
> VolumeScannerThread(/media/disk5/hdfs/dn): FileNotFound while finding block
> BP-233501496-*.*.*.*-1514185698256:blk_1620868560_547245090 on volume
> /media/disk5/hdfs/dn
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not
> found for BP-1997955181-*.*.*.*-1514186468560:blk_1090885868_17145082
> at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:471)
> at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:240)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:553)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
> at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
> at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:254)
> at java.lang.Thread.run(Thread.java:748)
> ```
> I added some logs for confirmation, as follows:
> Log Code like:
> !image-2018-06-14-13-05-54-354.png!
> And the result is like:
> !image-2018-06-14-13-10-24-032.png!
> The Size of 'VolumeMap' has been reduced, and We found the 'VolumeMap' to be
> overridden by the new Disk Block by the method 'ReplicaMap.addAll(ReplicaMap
> other)'.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]