[jira] [Commented] (HDFS-13677) Dynamic refresh Disk configuration results in overwriting VolumeMap

Stephen O'Donnell (JIRA) Sat, 27 Apr 2019 04:13:38 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827566#comment-16827566
 ]


Stephen O'Donnell commented on HDFS-13677:
------------------------------------------

[~xuzq_zander] This is an interesting find. I have seen this happen a couple of 
times when a disk was removed and then added to the cluster, but was never able 
to reproduce. It seemed to happen somewhat randomly and resulted in the DN 
reporting much fewer than expected blocks to the NN and hence causing missing 
blocks. A DN restarted fixed it.

Are you able to reproduce this problem every time, or is it caused by some sort 
of race condition and only happens sometimes?

Looking at the source just before the activeVolume call, the new temporary map 
should be populated with all the replicas on the volume before it is called by 
getVolumeMap():

{code}
    ReplicaMap tempVolumeMap = new ReplicaMap(datasetLock);
    fsVolume.getVolumeMap(tempVolumeMap, ramDiskReplicaTracker);

    activateVolume(tempVolumeMap, sd, storageLocation.getStorageType(), ref);
{code}

If the new map is populated correctly, do you know why the call to 
volumeMap.addAll() causes the issue? Is it possible the tempVolumeMap object is 
not populated fully somehow?

> Dynamic refresh Disk configuration results in overwriting VolumeMap
> -------------------------------------------------------------------
>
>                 Key: HDFS-13677
>                 URL: https://issues.apache.org/jira/browse/HDFS-13677
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.6.0, 3.0.0
>            Reporter: xuzq
>            Priority: Major
>         Attachments: 
> 0001-fix-the-bug-of-the-refresh-disk-configuration.patch, 
> image-2018-06-14-13-05-54-354.png, image-2018-06-14-13-10-24-032.png
>
>
> When I added a new disk by dynamically refreshing the configuration, an 
> exception "FileNotFound while finding block" was caused.
>  
> The steps are as follows:
> 1.Change the hdfs-site.xml of DataNode to add a new disk.
> 2.Refresh the configuration by "./bin/hdfs dfsadmin -reconfig datanode 
> ****:50020 start"
>  
> The error is like:
> ```
> VolumeScannerThread(/media/disk5/hdfs/dn): FileNotFound while finding block 
> BP-233501496-*.*.*.*-1514185698256:blk_1620868560_547245090 on volume 
> /media/disk5/hdfs/dn
> org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Replica not 
> found for BP-1997955181-*.*.*.*-1514186468560:blk_1090885868_17145082
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.getReplica(BlockSender.java:471)
>  at 
> org.apache.hadoop.hdfs.server.datanode.BlockSender.<init>(BlockSender.java:240)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:553)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:148)
>  at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:103)
>  at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:254)
>  at java.lang.Thread.run(Thread.java:748)
> ```
> I added some logs for confirmation, as follows:
> Log Code like:
> !image-2018-06-14-13-05-54-354.png!
> And the result is like:
> !image-2018-06-14-13-10-24-032.png!  
> The Size of 'VolumeMap' has been reduced, and We found the 'VolumeMap' to be 
> overridden by the new Disk Block by the method 'ReplicaMap.addAll(ReplicaMap 
> other)'.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13677) Dynamic refresh Disk configuration results in overwriting VolumeMap

Reply via email to