[jira] [Commented] (HDFS-13677) Dynamic refresh Disk configuration results in overwriting VolumeMap

Stephen O'Donnell (JIRA) Sat, 27 Apr 2019 12:01:19 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827725#comment-16827725
 ]


Stephen O'Donnell commented on HDFS-13677:
------------------------------------------

Yes, I just thought that perhaps the reason the problem does not reproduce is 
because my new disk is empty. Re-adding a non-empty disk is what causes this 
problem to manifest. So steps to reproduce:

1. Single node local cluster with 1 storage configured. Add 10 files or so, 
giving a known number of blocks.
2. Reconfigure to add a new second storage. Add say 5 more files to put 2 or 3 
blocks on the new disk.
3. Reconfigure to remove the second storage, then add it again. Then the DN 
should generate a FBR with only 2 or 3 blocks (ie missing those from the 
original storage) due to this bug.

{code}
<DN with 1 volume and 10 blocks just restarted>
2019-04-27 19:50:12,801 INFO datanode.DataNode: Successfully sent block report 
0x230a1c633c634987,  containing 1 storage report(s), of which we sent 1. The 
reports had 10 total blocks and used 1 RPC(s). This took 3 msec to generate and 
20 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
{code}

Now I add an empty disk and we can no issues, still 10 blocks reported:

{code}
2019-04-27 19:52:18,085 INFO impl.FsDatasetImpl: Added volume - 
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2, StorageType: DISK
2019-04-27 19:52:18,085 INFO datanode.DataNode: Successfully added volume: 
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2
2019-04-27 19:52:18,086 INFO datanode.DataNode: Block pool 
BP-1999061334-192.168.0.24-1556390848658 (Datanode Uuid 
242e590a-b1e3-4a1e-9b32-e67cc095bb0f): scheduling a full block report.
2019-04-27 19:52:18,087 INFO datanode.DataNode: Forcing a full block report to 
localhost/127.0.0.1:8020
2019-04-27 19:52:18,087 INFO conf.ReconfigurableBase: Property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not 
configurable: old value: org.apache.hadoop.ipc.ProtobufRpcEngine, new value: 
null
2019-04-27 19:52:18,089 INFO datanode.DataNode: Successfully sent block report 
0x230a1c633c634988,  containing 2 storage report(s), of which we sent 2. The 
reports had 10 total blocks and used 1 RPC(s). This took 1 msec to generate and 
2 msecs for RPC and NN processing. Got back no commands.
{code}

Now add 5 more files, giving 15 blocks total and then remove the second volume. 
We can see 13 blocks reported as 2 are on the removed disk, still as expected:

{code}
2019-04-27 19:54:19,740 INFO impl.FsDatasetImpl: Removed volume: 
/tmp/hadoop-sodonnell/dfs/data2
2019-04-27 19:54:19,740 INFO impl.FsDatasetImpl: Volume reference is released.
2019-04-27 19:54:19,741 INFO common.Storage: Removing block level storage: 
/tmp/hadoop-sodonnell/dfs/data2/current/BP-1999061334-192.168.0.24-1556390848658
2019-04-27 19:54:19,743 INFO datanode.DataNode: Block pool 
BP-1999061334-192.168.0.24-1556390848658 (Datanode Uuid 
242e590a-b1e3-4a1e-9b32-e67cc095bb0f): scheduling a full block report.
2019-04-27 19:54:19,743 INFO datanode.DataNode: Forcing a full block report to 
localhost/127.0.0.1:8020
2019-04-27 19:54:19,743 INFO conf.ReconfigurableBase: Property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not 
configurable: old value: org.apache.hadoop.ipc.ProtobufRpcEngine, new value: 
null
2019-04-27 19:54:19,746 INFO datanode.DataNode: Successfully sent block report 
0x230a1c633c634989,  containing 1 storage report(s), of which we sent 1. The 
reports had 13 total blocks and used 1 RPC(s). This took 0 msec to generate and 
2 msecs for RPC and NN processing. Got back no commands.
{code}

Finally add the disk back in and the problem appears, as its no longer an empty 
disk. The FBR reports only 2 blocks from the new disk instead of 15 as it 
should:

{code}
2019-04-27 19:57:24,710 INFO impl.FsDatasetImpl: Added volume - 
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2, StorageType: DISK
2019-04-27 19:57:24,710 INFO datanode.DataNode: Successfully added volume: 
[DISK]file:/tmp/hadoop-sodonnell/dfs/data2
2019-04-27 19:57:24,711 INFO datanode.DataNode: Block pool 
BP-1999061334-192.168.0.24-1556390848658 (Datanode Uuid 
242e590a-b1e3-4a1e-9b32-e67cc095bb0f): scheduling a full block report.
2019-04-27 19:57:24,711 INFO datanode.DataNode: Forcing a full block report to 
localhost/127.0.0.1:8020
2019-04-27 19:57:24,711 INFO conf.ReconfigurableBase: Property 
rpc.engine.org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolPB is not 
configurable: old value: org.apache.hadoop.ipc.ProtobufRpcEngine, new value: 
null
2019-04-27 19:57:24,713 INFO datanode.DataNode: Successfully sent block report 
0x230a1c633c63498a,  containing 2 storage report(s), of which we sent 2. The 
reports had 2 total blocks and used 1 RPC(s). This took 0 msec to generate and 
2 msecs for RPC and NN processing. Got back no commands.
{code}

> Dynamic refresh Disk configuration results in overwriting VolumeMap
> -------------------------------------------------------------------
>
>                 Key: HDFS-13677
>                 URL: https://issues.apache.org/jira/browse/HDFS-13677
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0, 3.0.0
>            Reporter: xuzq
>            Priority: Major
>         Attachments: 
> 0001-fix-the-bug-of-the-refresh-disk-configuration.patch, 
> image-2018-06-14-13-05-54-354.png, image-2018-06-14-13-10-24-032.png
>
>
> When I added a new disk by dynamically refreshing the configuration, an 
> exception "FileNotFound while finding block" was caused.
>  
> The steps are as follows:
> 1.Change the hdfs-site.xml of DataNode to add a new disk.
> 2.Refresh the configuration by "./bin/hdfs dfsadmin -reconfig datanode 
> ****:50020 start"
>  
> The error is like:
> ```
> VolumeScannerThread(/media/disk5/hdfs/dn): FileNotFound while finding block 
> BP-233501496-*.*.*.*-1514185698256:blk_1620868560_547245090 on volume 
> /media/disk5/hdfs/dn
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
> found for BP-1997955181-*.*.*.*-1514186468560:blk_1090885868_17145082
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:471)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:240)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:553)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:254)
>  at java.lang.Thread.run(Thread.java:748)
> ```
> I added some logs for confirmation, as follows:
> Log Code like:
> !image-2018-06-14-13-05-54-354.png!
> And the result is like:
> !image-2018-06-14-13-10-24-032.png!  
> The Size of 'VolumeMap' has been reduced, and We found the 'VolumeMap' to be 
> overridden by the new Disk Block by the method 'ReplicaMap.addAll(ReplicaMap 
> other)'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13677) Dynamic refresh Disk configuration results in overwriting VolumeMap

Reply via email to