[ 
https://issues.apache.org/jira/browse/HDFS-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564043#comment-14564043
 ] 

Sanjay Radia commented on HDFS-8401:
------------------------------------

Consider the following use case:   one wants to run a few jobs and cache the 
input and the intermediate output just for the duration of these jobs. Today 
the user has to pin such data by changing the dir-file attributes, and when the 
jobs are finished he has to reset the attributes. It is easier to say "jobxxx 
input = memfs://.../input tmp=memfs://.../tmpdir  output=". Here setting the 
scheme is not inconvenient since it is part of parameters to a program.  
Further this works with any existing application - Hive, Pig etc since the hint 
to cache is in the scheme of the pathname. Our existing policies and dir level 
setting work when things  are semi-permanent (ie this dir has dimension tables 
and please cache them - all jobs will benefit). In addition we could add or 
already have programmatic APIs to indicate that a file being read or written 
needs to be cached. But this requires change to the application code.   Once we 
get fully automated memory caching working we will not need  our existing 
storage policies nor layers like memfs since the system will just take care of 
it all - but it will take us some time to get there. 

I think both approaches have their own strengths and are complementary. Note  
spark-tachyon uses a layered file system and the approach is viewed as a simple 
way to control which files get cached on a per-job basis.

Further one can also cache specific Hive  tables in hive meta store by giving a 
path name that has the memfs-scheme. Here the memfs-pathname or setting the 
dirs attribute are roughly equal from a ease-of-usage perspective.

An additional point about memfs for non-hdfs systems: the Memfs *abstraction* 
allows caching S3 data in a very similar fashion. Of course one will have to 
build a full caching implementation of memfs for S3 because the memfs proposed 
in this Jira is very very thin layer over HDFS because ALL the caching 
mechanism is already in HDFS. So I expect several implementation of the memfs 
interface for HCFS file systems.

> Memfs - a layered file system for in-memory storage in HDFS
> -----------------------------------------------------------
>
>                 Key: HDFS-8401
>                 URL: https://issues.apache.org/jira/browse/HDFS-8401
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Arpit Agarwal
>            Assignee: Arpit Agarwal
>
> We propose creating a layered filesystem that can provide in-memory storage 
> using existing features within HDFS. memfs will use lazy persist writes 
> introduced by HDFS-6581. For reads, memfs can use the Centralized Cache 
> Management feature introduced in HDFS-4949 to load hot data to memory.
> Paths in memfs and hdfs will correspond 1:1 so memfs will require no 
> additional metadata and it can be implemented entirely as a client-side 
> library.
> The advantage of a layered file system is that it requires little or no 
> changes to existing applications. e.g. Applications can use something like 
> {{memfs://}} instead of {{hdfs://}} for files targeted to memory storage. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to