[jira] Issue Comment Edited: (HADOOP-3307) Archives in Hadoop.

Mahadev konar (JIRA) Thu, 24 Apr 2008 13:41:01 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164
 ]


mahadev edited comment on HADOOP-3307 at 4/24/08 1:36 PM:
----------------------------------------------------------------

Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small 
files that users do not use so often. We would like to create an archiving  
utility that is able to archive these files which are semi transparent and 
usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full 
fledged solution for archiving files. Users want to keep their files as 
distinct  files and would sometime like to unarchive and not lose the file 
layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent 
it to be implemented later.
 - Compression is not a goal.

-  *Archive Format*
- Conventional archive formats like tar are not convenient for parallel archive 
creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax*
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives 
running as input to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations 
on the files. Archives are immutable, so renames, deletes, creates will throw 
an exception in the initial versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will 
lose permissions that they initially had. In later versions of HAR, permissions 
can be stored into the metadata making it possible to unarchive without losing 
permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point 
to a archives and changes to DFSClient that will transparently walk this mount 
to the real archive and will allow transparent use of archives.
 
Comments?





      was (Author: mahadev):
    Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small 
files that users do not use so often. We would like to create an archiving  
utility that is able to archive these files which are semi transparent and 
usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full 
fledged solution for archiving files. Users want to keep their files as 
distinct  files and would sometime like to unarchive and not lose the file 
layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent 
it to be implemented later.
 - Compression is not a goal.

-  *Archive Format *
- Conventional archive formats like tar are not convenient for parallel archive 
creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax*
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives 
running as input to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations 
on the files. Archives are immutable, so renames, deletes, creates will throw 
an exception in the initial versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will 
lose permissions that they initially had. In later versions of HAR, permissions 
can be stored into the metadata making it possible to unarchive without losing 
permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point 
to a archives and changes to DFSClient that will transparently walk this mount 
to the real archive and will allow transparent use of archives.
 
Comments?




  
> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3307) Archives in Hadoop.

Reply via email to