[ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mahadev konar updated HADOOP-3307: ---------------------------------- Attachment: hadoop-3307_1.patch this patch addresses the archives isssue. This patch includes the following -- - har:///user/mahadev/foo.har denotes a Hadoop archive. This is default uri which will use the default underlying filesystem specififed in your conf. In case you want to be explicit or some other hdfs (not the defautlt one ) then the uri is -- har://hdfs-host:port/user/mahadev/foo.har The uri's have an implicit assumption on which part of the uri denotes the directory for hadoop archives. The code looks the path from the end and assumes the part matching *.har to be the directory that is the archive. - it has a filesystem layer so all the commands like hadoop fs -ls har:///user/mahadev/foo.har work. Most of the mutating commands are not implemented in the archives. -cat -copytolocal work as expected. - works with map reduce. so the input to a map reduce job could be har:///user/mahadev/foo.har and this would work fine. Code Design and explanation - - There are two index files _index file contains files of the form filename <dir>/<file> partfile startindex size childpathnames_if_directory. The _index file is sorted by hashcode of filenames. The second index file _masterindex contains pointers into the index file to speed up the lookuptime of files inside the _index file. - To create an archive user need to run bin/hadoop archives -archiveName foo.har inputpaths outputdir This is a map reduce job wherein all the files are distributed amongst the maps which create part files of around 2GB or so. The reduce then get the startindex and size ffrom the maps for all the files and creates the _index and _masterindex. - Permissions are not persisted. So the permissions returned by the Har filesystem are the same as those of index files. > Archives in Hadoop. > ------------------- > > Key: HADOOP-3307 > URL: https://issues.apache.org/jira/browse/HADOOP-3307 > Project: Hadoop Core > Issue Type: New Feature > Components: fs > Reporter: Mahadev konar > Assignee: Mahadev konar > Fix For: 0.18.0 > > Attachments: hadoop-3307_1.patch > > > This is a new feature for archiving and unarchiving files in HDFS. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.