[jira] [Commented] (HDFS-5443) Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file.

Jing Zhao (JIRA) Tue, 05 Nov 2013 00:14:36 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-5443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813747#comment-13813747
 ]


Jing Zhao commented on HDFS-5443:
---------------------------------

bq. Here actual problem is not containing the 0-sized blocks, but counting them 
also in safemode threshold as these are loaded as COMPLETE blocks

Agree. But in the meanwhile, we also should clear these 0-sized block since if 
the corresponding file is only in snapshot, no one will finalize the block I 
guess. That's why I think maybe we should fix this part in a separate jira. I 
think for the safemode part, as Vinay mentioned, the key issue is still the 
current code fails to recognize INodeFileUC if the file is in snapshot and the 
deletion is on its parent/ancestral directory, while loading the fsimage. 

I think HDFS-5428 can solve the problem, but it may overkill the problem 
because in the current HDFS-5428 patch we need to keep records in the lease 
map, and maintain these records even for snapshot deletion and renaming. Since 
the safemode issue only happens when starting NN, can we fix the problem by:
1. recording extra information in fsimage to indicate INodeFileUC that are only 
in snapshots
2. re-generating all the INodeFileUC when loading fsimage
3. using a similar workaround as in HDFS-5283.

For 1&2, we need to cover the files that are deleted through its ancestral 
directory. To avoid the incompatibility of fsimage, we can put the extra 
information to the "under construction files" section of the fsimage.

> Namenode can stuck in safemode on restart if it crashes just after addblock 
> logsync and after taking snapshot for such file.
> ----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-5443
>                 URL: https://issues.apache.org/jira/browse/HDFS-5443
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: snapshots
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Uma Maheswara Rao G
>            Assignee: sathish
>
> This issue is reported by Prakash and Sathish.
> On looking into the issue following things are happening.
> .
> 1) Client added block at NN and just did logsync
>    So, NN has block ID persisted.
> 2)Before returning addblock response to client take a snapshot for root or 
> parent directories for that file
> 3) Delete parent directory for that file
> 4) Now crash the NN with out responding success to client for that addBlock 
> call
> Now on restart of the Namenode, it will stuck in safemode.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

[jira] [Commented] (HDFS-5443) Namenode can stuck in safemode on restart if it crashes just after addblock logsync and after taking snapshot for such file.

Reply via email to