[ 
https://issues.apache.org/jira/browse/HDFS-11155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-11155:
-----------------------------------
    Description: 
HDFS-10512 fixed a race condition that caused VolumeScanner to terminate 
abruptly when a corrupt replica, which is being updated, is detected. However, 
when such a corrupt replica is detected, VolumeScanner still reports the old 
replica generation stamp to the NN. NN then directs DN to remove the older 
replica. Because the generation stamp is updated, DN can not find it, so 
corrupt replica remains corrupt.

NN's log shows something similar to the following:
{quote}
2016-11-17 21:08:05,350 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1077571736 added as corrupt on 
192.168.168.58:50010 by /192.168.168.58  because client machine reported it
2016-11-17 21:08:05,350 INFO BlockStateChange: BLOCK* invalidateBlock: 
blk_1077571736_3991953(stored=blk_1077571736_3992018) on 192.168.168.58:50010
{quote}
The DN's log has these:

{noformat}
2016-11-17 21:08:04,815 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Appending 
to FinalizedReplica, blk_1077571736_3991953, FINALIZED
  getNumBytes()     = 39061752
  getBytesOnDisk()  = 39061752
  getVisibleLength()= 39061752
  getVolume()       = /data/3/dfs/dn/current
  getBlockFile()    = 
/data/3/dfs/dn/current/BP-1092022411-192.168.168.55-1474407949037/current/finalized/subdir58/subdir112/blk_1077571736

2016-11-17 21:08:09,158 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Failed to 
delete replica blk_1077571736_3991953: ReplicaInfo not found.
{noformat}

  was:
HDFS-10512 fixed a race condition that caused VolumeScanner to terminate 
abruptly when a corrupt replica is detected. However, when a corrupt replica is 
detected, VolumeScanner still reports the old replica generation stamp to the 
NN. NN then directs DN to remove the older replica, but because the generation 
stamp is updated, DN can not find it, so corrupt replica remains corrupt.

NN's log shows something similar to the following:
{quote}
2016-11-17 21:08:05,350 INFO BlockStateChange: BLOCK 
NameSystem.addToCorruptReplicasMap: blk_1077571736 added as corrupt on 
192.168.168.58:50010 by /192.168.168.58  because client machine reported it
2016-11-17 21:08:05,350 INFO BlockStateChange: BLOCK* invalidateBlock: 
blk_1077571736_3991953(stored=blk_1077571736_3992018) on 192.168.168.58:50010
{quote}
The DN's log has these:

{noformat}
2016-11-17 21:08:04,815 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Appending 
to FinalizedReplica, blk_1077571736_3991953, FINALIZED
  getNumBytes()     = 39061752
  getBytesOnDisk()  = 39061752
  getVisibleLength()= 39061752
  getVolume()       = /data/3/dfs/dn/current
  getBlockFile()    = 
/data/3/dfs/dn/current/BP-1092022411-192.168.168.55-1474407949037/current/finalized/subdir58/subdir112/blk_1077571736

2016-11-17 21:08:09,158 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Failed to 
delete replica blk_1077571736_3991953: ReplicaInfo not found.
{noformat}


> VolumeScanner should report the latest generation stamp of a bad replica
> ------------------------------------------------------------------------
>
>                 Key: HDFS-11155
>                 URL: https://issues.apache.org/jira/browse/HDFS-11155
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.7.4
>         Environment: CDH5.7.3
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>
> HDFS-10512 fixed a race condition that caused VolumeScanner to terminate 
> abruptly when a corrupt replica, which is being updated, is detected. 
> However, when such a corrupt replica is detected, VolumeScanner still reports 
> the old replica generation stamp to the NN. NN then directs DN to remove the 
> older replica. Because the generation stamp is updated, DN can not find it, 
> so corrupt replica remains corrupt.
> NN's log shows something similar to the following:
> {quote}
> 2016-11-17 21:08:05,350 INFO BlockStateChange: BLOCK 
> NameSystem.addToCorruptReplicasMap: blk_1077571736 added as corrupt on 
> 192.168.168.58:50010 by /192.168.168.58  because client machine reported it
> 2016-11-17 21:08:05,350 INFO BlockStateChange: BLOCK* invalidateBlock: 
> blk_1077571736_3991953(stored=blk_1077571736_3992018) on 192.168.168.58:50010
> {quote}
> The DN's log has these:
> {noformat}
> 2016-11-17 21:08:04,815 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Appending to FinalizedReplica, blk_1077571736_3991953, FINALIZED
>   getNumBytes()     = 39061752
>   getBytesOnDisk()  = 39061752
>   getVisibleLength()= 39061752
>   getVolume()       = /data/3/dfs/dn/current
>   getBlockFile()    = 
> /data/3/dfs/dn/current/BP-1092022411-192.168.168.55-1474407949037/current/finalized/subdir58/subdir112/blk_1077571736
> 2016-11-17 21:08:09,158 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Failed 
> to delete replica blk_1077571736_3991953: ReplicaInfo not found.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to