[ https://issues.apache.org/jira/browse/HDFS-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564043#comment-14564043 ]
Sanjay Radia commented on HDFS-8401: ------------------------------------ Consider the following use case: one wants to run a few jobs and cache the input and the intermediate output just for the duration of these jobs. Today the user has to pin such data by changing the dir-file attributes, and when the jobs are finished he has to reset the attributes. It is easier to say "jobxxx input = memfs://.../input tmp=memfs://.../tmpdir output=". Here setting the scheme is not inconvenient since it is part of parameters to a program. Further this works with any existing application - Hive, Pig etc since the hint to cache is in the scheme of the pathname. Our existing policies and dir level setting work when things are semi-permanent (ie this dir has dimension tables and please cache them - all jobs will benefit). In addition we could add or already have programmatic APIs to indicate that a file being read or written needs to be cached. But this requires change to the application code. Once we get fully automated memory caching working we will not need our existing storage policies nor layers like memfs since the system will just take care of it all - but it will take us some time to get there. I think both approaches have their own strengths and are complementary. Note spark-tachyon uses a layered file system and the approach is viewed as a simple way to control which files get cached on a per-job basis. Further one can also cache specific Hive tables in hive meta store by giving a path name that has the memfs-scheme. Here the memfs-pathname or setting the dirs attribute are roughly equal from a ease-of-usage perspective. An additional point about memfs for non-hdfs systems: the Memfs *abstraction* allows caching S3 data in a very similar fashion. Of course one will have to build a full caching implementation of memfs for S3 because the memfs proposed in this Jira is very very thin layer over HDFS because ALL the caching mechanism is already in HDFS. So I expect several implementation of the memfs interface for HCFS file systems. > Memfs - a layered file system for in-memory storage in HDFS > ----------------------------------------------------------- > > Key: HDFS-8401 > URL: https://issues.apache.org/jira/browse/HDFS-8401 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Arpit Agarwal > Assignee: Arpit Agarwal > > We propose creating a layered filesystem that can provide in-memory storage > using existing features within HDFS. memfs will use lazy persist writes > introduced by HDFS-6581. For reads, memfs can use the Centralized Cache > Management feature introduced in HDFS-4949 to load hot data to memory. > Paths in memfs and hdfs will correspond 1:1 so memfs will require no > additional metadata and it can be implemented entirely as a client-side > library. > The advantage of a layered file system is that it requires little or no > changes to existing applications. e.g. Applications can use something like > {{memfs://}} instead of {{hdfs://}} for files targeted to memory storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)