[ 
https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627909#action_12627909
 ] 

dhruba borthakur commented on HADOOP-4058:
------------------------------------------

Hadoop user's have been using the Hadoop clusters as a queryable archive 
warehouse. This means that data that once gets into the warehouse is very 
unlikely to be deleted. This puts tremendous pressure on adding additional 
storage capacity to the production cluster.

There could be a set of storage-heavy nodes that cannot be added to the 
production cluster because do not have enough memory and CPU. One option would 
be to use this old-cluster to archive old files from the production cluster.

A layer of software can scan the file system in the production cluster to find 
files with the earliest access times (HADOOP-1869). These files can be moved to 
the old-cluster and the original file in the production cluster can be replaced 
by a symbolic link (via HADOOP-4044). An access to read the original file still 
works because of the symbolic link. Some other piece of software periodically 
scans the old-cluster, finds out files that were accessed recently, and tries 
to move them back to the production cluster.

The advantage of this approach is that it is "layered"... it is not built into 
HDFS but depends on two artifacts of HDFS: symbolic links and access-times. I 
hate to put more and more intelligence into core-hdfs, otherwise the code 
becomes very bloated and difficult to maintain.


> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production 
> cluster. Access to those files from applications should continue to work 
> transparently, without changing application code, but maybe with reduced 
> performance. The policy engine  that does this could be layered on HDFS 
> rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to