[
https://issues.apache.org/jira/browse/HDFS-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Owen O'Malley resolved HDFS-224.
--------------------------------
Resolution: Duplicate
We have a different version of harchives.
> I propose a tool for creating and manipulating a new abstraction, Hadoop
> Archives.
> ----------------------------------------------------------------------------------
>
> Key: HDFS-224
> URL: https://issues.apache.org/jira/browse/HDFS-224
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Dick King
>
> -- Introduction
> In some hadoop map/reduce and dfs use cases, including a specific case that
> arises in my own work, users would like to populate dfs with a family of
> hundreds or thousands of directory trees, each of which consists of thousands
> of files. In our case, the trees each have perhaps 20 gigabytes; two or
> three 3-10-gigabyte files, a thousand small ones, and a large number of files
> of intermediate size. I am writing this JIRA to encourage discussion of a
> new facility I want to create and contribute to the dfs core.
> -- The problem
> You can't store such families of trees in dfs in the obvious manner. The
> problem is that the name nodes can't handle the millions or ten million files
> that result from such a family, especially if there are a couple of families.
> I understand that dfs will not be able to accommodate tens of millions of
> files in one instance for quite a while.
> -- Exposed API of my proposed solution
> I would therefore like to produce, and contribute to the dfs core, a new tool
> that implements an abstraction called a Hadoop Archive [or harchive].
> Conceptually, a harchive is a unit, but it manages a space that looks like a
> directory tree. The tool exposes an interface that allows a user to do the
> following:
> * directory-level operations
> ** create a harchive [either empty, or initially populated form a
> locally-stored directory tree] . The namespace for harchives is the same as
> the space of possible dfs directory locators, and a harchive would in fact be
> implemented as a dfs directory with specialized contents.
> ** Add a directory tree to an existing harchive in a specific place within
> the harchive
> ** retrieve a directory tree or subtree at or beneath the root of the
> harchive directory structure, into a local directory tree
> * file-level operations
> ** add a local file to a specific place in the harchive
> ** modify a file image in a specific place in the harchive to match a
> local file
> ** delete a file image in the harchive.
> ** move a file image within the harchive
> ** open a file image in the harchive for reading or writing.
> * stream operations
> ** open a harchive file image for reading or writing as a stream, in a
> manner similar to dfs files, and read or write it [ie., hdfsRead(...) ].
> This would include random access operators for reading.
> * management operations
> ** commit a group of changes [which would be made atomically -- there
> would be no way half of a change could be made to a harchive if a client
> crashes].
> ** clean up a harchive, if it's gotten less performant because of
> extensive editing
> ** delete a harchive
> We would also implement a command line interface.
> -- Brief sketch of internals
> A harchive would be represented as a small collection of files, called
> segments, in a dfs directory at the harchive's location. Each segment would
> contain some of the files of the harchive's file images in a format to be
> determined, plus a harchive index. We may group files by size, or some other
> criteria. It is likely that harchives would contain only one segment in
> common cases.
> Changes would be made by adding the text of the new files, either by
> rewriting an existing segment that contains not much more data than the size
> of the changes or by creating a new segment, complete with a new index. When
> dfs comes to be enhanced to allow appends to dfs files, as requested by
> HADOOP-1700 , we would be able to take advantage of that.
> Often, when a harchive is initially populated, it could be a single segment,
> and a file it contains could be accessed with two random accesses into the
> segment. The first access retrieves the index, and the second access
> retrieves the beginning of the file. We could choose to put smaller files
> closer to the index to allow lower average amortized costs per byte.
> We might instead choose to represent a harchive as one file or a few files
> for the large represented files, and smaller files for the represented
> smaller files. That lets us make modifications by copying at lower cost.
> The segment containing the index is found by a naming convention. Atomicity
> is obtained by creating indices and renaming the files containing them
> according to the convention, when a change is committed.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira