[jira] [Commented] (HDFS-4015) Safemode should count and report orphaned blocks

Anu Engineer (JIRA) Mon, 21 Nov 2016 10:29:21 -0800

    [ 
https://issues.apache.org/jira/browse/HDFS-4015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684329#comment-15684329
 ]


Anu Engineer commented on HDFS-4015:
------------------------------------

bq. Should this be marked as incompatible change? 
I think we can argue both sides, so I am fine with either call you make after 
reading through both points of view.

bq. Earlier "dfsadmin -safemode leave" was leaving the safemode.
The behaviour in 99.99999% of cases is exactly same. So it is a rare case of 
incompatibility, even if we end up defining this as an incompatibility.

bq. Now expects "-forceExit" also if there are any future bytes detected.
Future bytes -- is an error condition that should have been flagged by HDFS. 
This was a missing error check, if we call this incompatibility, it would mean 
that copying old fsImage or old NN metadata was a supported operation. I would 
argue that it never was a supported operation in the sense that NN metadata is 
sacrosanct and you are not supposed to roll it back.

So from that point of view, it just confirms what we always knew and avoids 
booting up with incorrect metadata, but both of us very well know that the 
reason why this JIRA is fixed is because people do this and lose data. 

With this change copy/restoring NN metadata it has become a supported operation 
(that is, HDFS is aware users are going to do this), and we explicitly warn the 
user of harm that this action can cause. If we were to argue that old behavior 
was a feature, then we are saying that changing NN metadata and losing data was 
a supported feature.

While it is still possible to copy an older version of NN metadata, now HDFS is 
going warn the end user about data loss. The question you are asking is should 
we classify that as a incompatibility or as enforcement of the core axioms of 
HDFS? 

My *personal* view is that is it not an incompatible change, since HDFS has 
never officially encouraged people to copy *older* versions of NN metadata. If 
you agree with that, then this change merely formalizes that assumption that NN 
metadata is sacrosanct and if you roll it back, we are in an error state that 
needs explicit user intervention.

But I also see that from an end users point of view (especially someone with 
lot of HDFS experience), this enforcement of a NN metadata integrity takes way 
some of the old dangerous behavior. Now we have added detection of an error 
condition which requires explicit action from user, you can syntactically argue 
that it is an incompatible change, though semantically I would suppose that it 
is obvious to any HDFS user that copying old versions of NN metadata is a bad 
idea.


> Safemode should count and report orphaned blocks
> ------------------------------------------------
>
>                 Key: HDFS-4015
>                 URL: https://issues.apache.org/jira/browse/HDFS-4015
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Todd Lipcon
>            Assignee: Anu Engineer
>             Fix For: 2.8.0, 3.0.0-alpha1
>
>         Attachments: HDFS-4015.001.patch, HDFS-4015.002.patch, 
> HDFS-4015.003.patch, HDFS-4015.004.patch, HDFS-4015.005.patch, 
> HDFS-4015.006.patch, HDFS-4015.007.patch
>
>
> The safemode status currently reports the number of unique reported blocks 
> compared to the total number of blocks referenced by the namespace. However, 
> it does not report the inverse: blocks which are reported by datanodes but 
> not referenced by the namespace.
> In the case that an admin accidentally starts up from an old image, this can 
> be confusing: safemode and fsck will show "corrupt files", which are the 
> files which actually have been deleted but got resurrected by restarting from 
> the old image. This will convince them that they can safely force leave 
> safemode and remove these files -- after all, they know that those files 
> should really have been deleted. However, they're not aware that leaving 
> safemode will also unrecoverably delete a bunch of other block files which 
> have been orphaned due to the namespace rollback.
> I'd like to consider reporting something like: "900000 of expected 1000000 
> blocks have been reported. Additionally, 10000 blocks have been reported 
> which do not correspond to any file in the namespace. Forcing exit of 
> safemode will unrecoverably remove those data blocks"
> Whether this statistic is also used for some kind of "inverse safe mode" is 
> the logical next step, but just reporting it as a warning seems easy enough 
> to accomplish and worth doing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-4015) Safemode should count and report orphaned blocks

Reply via email to