Eli Collins created HADOOP-8598:
-----------------------------------

             Summary: Server-side Trash
                 Key: HADOOP-8598
                 URL: https://issues.apache.org/jira/browse/HADOOP-8598
             Project: Hadoop Common
          Issue Type: New Feature
    Affects Versions: 2.0.0-alpha
            Reporter: Eli Collins
            Assignee: Eli Collins
            Priority: Critical


There are a number of problems with Trash that continues to result in permanent 
data loss for users. The primary reasons trash is not used:

- Trash is configured client-side and not enabled by default.
- Trash is shell-only. FileSystem, WebHDFS, HttpFs, etc never use trash.
- If trash fails, for example, because we can't create the trash directory or 
the move itself fails, trash is bypassed and the data is deleted.

Trash was designed as a feature to help end users via the shell, however in my 
experience the primary use of trash is to help administrators implement data 
retention policies (this was also the motivation for HADOOP-7460).  One could 
argue that (periodic read-only) snapshots are a better solution to this 
problem, however snapshots are not slated for Hadoop 2.x and trash is 
complimentary to snapshots (and backup) - eg you may create and delete data 
within your snapshot or backup window - so it makes sense to revisit trash's 
design. I think it's worth bringing trash's functionality in line with what 
users need.

I propose we enable trash on a per-filesystem basis and implement it 
server-side. Ie trash becomes an HDFS feature enabled by administrators. 
Because the trash emptier lives in HDFS and users already have a per-filesystem 
trash directory we're mostly there already. The design preference from 
HADOOP-2514 was for trash to be implemented in "user code" however (a) in light 
of these problems, (b) we have a lot more user-facing APIs than the shell and 
(c) clients increasingly span file systems (via federation and symlinks) this 
design choice makes less sense. This is why we already use a per-filesystem 
trash/home directory instead of the user's client-configured one - otherwise 
trash would not work because renames can't span file systems.

In short, HDFS trash would work similarly to how it does today, the difference 
is that client delete APIs would result in a rename into trash (ala 
TrashPolicyDefault#moveToTrash) if trash is enabled. Like today it would be 
renamed to the trash directory on the file system where the file being removed 
resides. The primary difference is that enablement and policy are configured 
server-side by adminstrators and is used regardless of the API used to access 
the filesytem. The one execption to this is that I think we should continue to 
support the explict skipTrash shell option. The rationale for skipTrash 
(HADOOP-6080) is that a move to trash may fail in cases where a rm may not, if 
a user has a home directory quota and does a rmr /tonsOfData, for example. 
Without a way to bypass this the user has no way (unless we revisit quotas, 
permissions or trash paths) to remove a directory they have permissions to 
remove without getting their quota adjusted by an admin. The skip trash API can 
be implemented by adding an explicit FileSystem API that bypasses trash and 
modifying the shell to use it when skipTrash is enabled. Given that users must 
explicitly specify skipTrash the API is less error prone. We could have the 
shell ask confirmation and annotate the API private to FsShell to discourage 
programatic use. This is not ideal but can be done compatibly (unlike 
redefining quotas, permissions or trash paths).

In terms of compatibility, while this proposal is technically an incompatible 
change (client side configuration that disables trash and uses skipTrash with a 
previous FsShell release will now both be ignored if server-side trash is 
enabled, and non-HDFS file systems would need to make similar changes) I think 
it's worth targeting for Hadoop 2.x given that the new semantics preserve the 
current semantics. In 2.x I think we should preserve FsShell based trash and 
support both it and server-side trash (defaults to disabled). For trunk/3.x I 
think we should remove the FsShell based trash entirely and enable server-side 
trash by default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to