[ 
https://issues.apache.org/jira/browse/HDFS-7235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181750#comment-14181750
 ] 

Colin Patrick McCabe commented on HDFS-7235:
--------------------------------------------

Thanks, Yongjun.  This looks good overall.

{code}
1792    try {
1793          data.checkBlock(block, block.getNumBytes(), 
ReplicaState.FINALIZED);
1794        }
1795        catch (ReplicaNotFoundException e) {
1796          replicaNotExist = true;
1797        }
1798        catch (UnexpectedReplicaStateException e) {
1799          replicaStateNotFinalized = true;
1800        }
1801        catch (FileNotFoundException e) {
1802          blockFileNotExist = true;
1803        }
1804        catch (EOFException e) {
1805          lengthTooShort = true;
1806        }
{code}
Nit: can you put the catches on the same line as the previous end bracket?  The 
Oracle Java style guide says to do this (section 7.9).  It saves vertical 
space.  Same comment in {{FsDatasetImpl#isValidRbw}} and 
{{SimulatedFSDataset.java}}.

{code}
27     /**
28       * Exception indicating that DataNode does not have a replica
29       * that matches the target block.  
30       */
31      public class UnexpectedReplicaStateException extends IOException {
{code}
Hmm... shouldn't this be "Exception indicating that the replica is in an 
unexpected state"?  "Does not have a replica" is a different error case.

There are a bunch of unnecessary whitespace changes in this patch... for 
example, in FsDatasetImpl.java.  It's best to avoid those because they can 
confuse people looking through the history.  I like to use "meld" to let me see 
and quickly eliminate spurious whitespace changes in my patch.

Looks good aside from that!

> Can not decommission DN which has invalid block due to bad disk
> ---------------------------------------------------------------
>
>                 Key: HDFS-7235
>                 URL: https://issues.apache.org/jira/browse/HDFS-7235
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.6.0
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>         Attachments: HDFS-7235.001.patch, HDFS-7235.002.patch, 
> HDFS-7235.003.patch, HDFS-7235.004.patch
>
>
> When to decommission a DN, the process hangs. 
> What happens is, when NN chooses a replica as a source to replicate data on 
> the to-be-decommissioned DN to other DNs, it favors choosing this DN 
> to-be-decommissioned as the source of transfer (see BlockManager.java).  
> However, because of the bad disk, the DN would detect the source block to be 
> transfered as invalidBlock with the following logic in FsDatasetImpl.java:
> {code}
> /** Does the block exist and have the given state? */
>   private boolean isValid(final ExtendedBlock b, final ReplicaState state) {
>     final ReplicaInfo replicaInfo = volumeMap.get(b.getBlockPoolId(), 
>         b.getLocalBlock());
>     return replicaInfo != null
>         && replicaInfo.getState() == state
>         && replicaInfo.getBlockFile().exists();
>   }
> {code}
> The reason that this method returns false (detecting invalid block) is 
> because the block file doesn't exist due to bad disk in this case. 
> The key issue we found here is, after DN detects an invalid block for the 
> above reason, it doesn't report the invalid block back to NN, thus NN doesn't 
> know that the block is corrupted, and keeps sending the data transfer request 
> to the same DN to be decommissioned, again and again. This caused an infinite 
> loop, so the decommission process hangs.
> Thanks [~qwertymaniac] for reporting the issue and initial analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to