[ 
https://issues.apache.org/jira/browse/HDFS-8193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14522819#comment-14522819
 ] 

Zhe Zhang commented on HDFS-8193:
---------------------------------

Thanks [~sureshms] for the helpful comments!

bq. Second use case, NN deleted file and admin wants to restore it (the case of 
NN metadata backup). Going back to an older fsimage is not that straight 
forward and a solution to be used only in desperate situation. It can cause 
corruption for other applications running on HDFS. It also results in loss of 
newly created data across the file system. Snapshots and trash are solutions 
for this.
You are absolutely right that it's always preferable to protect data on the 
file instead of block level. This JIRA indeed is aimed as the last resort for 
desperate situations. It's similar to recovering data directly from hard disk 
drives when the file system is corrupt beyond recovery. It's fully controlled 
by the DN and is the last layer of protection when all layers above have failed 
(trash mistakenly emptied, snapshots not correctly setup, etc.).

bq. First use case, NN deletes blocks without deleting files. Have you seen an 
instance of this? It would be great to get one pager on how one handles this 
condition.
One possible situation (recently fixed by HDFS-7960) is that NN mistakenly 
considers some blocks as over replicated, caused by zombie storages. Even 
though HDFS-7960 is already fixed, we should do something to protect against 
possible future NN bugs. This is the crux of why file-level protections, 
although always desirable, are not always sufficient. It could be that the NN 
gets something wrong, and then we're left with irrecoverable data loss.

bq. Does NN keep deleting the blocks until it is hot fixed? 
In the above case, NN will delete all replicas it considers over replicated 
until hot fixed.

bq. Also completing deletion of blocks in a timely manner is important for a 
running cluster.
Yes this is a valid concern. Empirically, most customer clusters do not run 
even close to near disk capacity. Therefore, adding a reasonable grace period 
shouldn't delay allocating new blocks. The configured delay window should also 
be enforced under the constraint of available space (e.g., don't delay deletion 
when available disk space < 10%). We will also add Web UI and metrics support 
to clearly show the space consumption by deletion-delayed replicas.

bq. All files don't require the same reliability. Intermediate data and tmp 
files need to be deleted immediately to free up cluster storage to avoid the 
risk of running out of storage space. At datanode level, there is no notion of 
whether files are temporary or important ones that need to be preserved. So a 
trash such as this can result in retaining lot of tmp files and deletes not 
being able to free up storage with in the cluster fast enough.
This is a great point. The proposed work (at least in the first phase) is 
intended as a best-effort optimization and will always yield to foreground 
workloads. The target is to statistically reduce the chance and severity of 
data losses given typical storage consumption conditions. It's certainly still 
possible for wave of tmp data to flush out more important data in DN trashes. 
We can design some smart eviction algorithms as future work.

As I [commented | 
https://issues.apache.org/jira/browse/HDFS-8193?focusedCommentId=14505336&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14505336]
 above, we are considering a more radical approach as a potential next phase of 
this work, where deletion-delayed replicas will just be overwritten by incoming 
replicas. In that case we might not even need to count deletion-delayed 
replicas in the space quota, making the feature more transparent to admins.

> Add the ability to delay replica deletion for a period of time
> --------------------------------------------------------------
>
>                 Key: HDFS-8193
>                 URL: https://issues.apache.org/jira/browse/HDFS-8193
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Aaron T. Myers
>            Assignee: Zhe Zhang
>
> When doing maintenance on an HDFS cluster, users may be concerned about the 
> possibility of administrative mistakes or software bugs deleting replicas of 
> blocks that cannot easily be restored. It would be handy if HDFS could be 
> made to optionally not delete any replicas for a configurable period of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to