[
https://issues.apache.org/jira/browse/HDFS-8401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564043#comment-14564043
]
Sanjay Radia commented on HDFS-8401:
------------------------------------
Consider the following use case: one wants to run a few jobs and cache the
input and the intermediate output just for the duration of these jobs. Today
the user has to pin such data by changing the dir-file attributes, and when the
jobs are finished he has to reset the attributes. It is easier to say "jobxxx
input = memfs://.../input tmp=memfs://.../tmpdir output=". Here setting the
scheme is not inconvenient since it is part of parameters to a program.
Further this works with any existing application - Hive, Pig etc since the hint
to cache is in the scheme of the pathname. Our existing policies and dir level
setting work when things are semi-permanent (ie this dir has dimension tables
and please cache them - all jobs will benefit). In addition we could add or
already have programmatic APIs to indicate that a file being read or written
needs to be cached. But this requires change to the application code. Once we
get fully automated memory caching working we will not need our existing
storage policies nor layers like memfs since the system will just take care of
it all - but it will take us some time to get there.
I think both approaches have their own strengths and are complementary. Note
spark-tachyon uses a layered file system and the approach is viewed as a simple
way to control which files get cached on a per-job basis.
Further one can also cache specific Hive tables in hive meta store by giving a
path name that has the memfs-scheme. Here the memfs-pathname or setting the
dirs attribute are roughly equal from a ease-of-usage perspective.
An additional point about memfs for non-hdfs systems: the Memfs *abstraction*
allows caching S3 data in a very similar fashion. Of course one will have to
build a full caching implementation of memfs for S3 because the memfs proposed
in this Jira is very very thin layer over HDFS because ALL the caching
mechanism is already in HDFS. So I expect several implementation of the memfs
interface for HCFS file systems.
> Memfs - a layered file system for in-memory storage in HDFS
> -----------------------------------------------------------
>
> Key: HDFS-8401
> URL: https://issues.apache.org/jira/browse/HDFS-8401
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Arpit Agarwal
> Assignee: Arpit Agarwal
>
> We propose creating a layered filesystem that can provide in-memory storage
> using existing features within HDFS. memfs will use lazy persist writes
> introduced by HDFS-6581. For reads, memfs can use the Centralized Cache
> Management feature introduced in HDFS-4949 to load hot data to memory.
> Paths in memfs and hdfs will correspond 1:1 so memfs will require no
> additional metadata and it can be implemented entirely as a client-side
> library.
> The advantage of a layered file system is that it requires little or no
> changes to existing applications. e.g. Applications can use something like
> {{memfs://}} instead of {{hdfs://}} for files targeted to memory storage.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)