[
https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627909#action_12627909
]
dhruba borthakur commented on HADOOP-4058:
------------------------------------------
Hadoop user's have been using the Hadoop clusters as a queryable archive
warehouse. This means that data that once gets into the warehouse is very
unlikely to be deleted. This puts tremendous pressure on adding additional
storage capacity to the production cluster.
There could be a set of storage-heavy nodes that cannot be added to the
production cluster because do not have enough memory and CPU. One option would
be to use this old-cluster to archive old files from the production cluster.
A layer of software can scan the file system in the production cluster to find
files with the earliest access times (HADOOP-1869). These files can be moved to
the old-cluster and the original file in the production cluster can be replaced
by a symbolic link (via HADOOP-4044). An access to read the original file still
works because of the symbolic link. Some other piece of software periodically
scans the old-cluster, finds out files that were accessed recently, and tries
to move them back to the production cluster.
The advantage of this approach is that it is "layered"... it is not built into
HDFS but depends on two artifacts of HDFS: symbolic links and access-times. I
hate to put more and more intelligence into core-hdfs, otherwise the code
becomes very bloated and difficult to maintain.
> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
> Key: HADOOP-4058
> URL: https://issues.apache.org/jira/browse/HADOOP-4058
> Project: Hadoop Core
> Issue Type: New Feature
> Components: dfs
> Reporter: dhruba borthakur
> Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production
> cluster. Access to those files from applications should continue to work
> transparently, without changing application code, but maybe with reduced
> performance. The policy engine that does this could be layered on HDFS
> rather than being built into HDFS itself.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.