My intention isn't to make it a mandatory feature just as an option. Keeping data locally on a filesystem as a method of Lx cache is far better than getting it from the network and the cost of fs buffer cache is much cheaper than a RPC call.
On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo <[email protected]>wrote: > The challenges of this design is people accessing the same data over and > over again is the uncommon usecase for hadoop. Hadoop's bread and butter is > all about streaming through large datasets that do not fit in memory. Also > your shuffle-sort-spill is going to play havoc on and file system based > cache. The distributed cache roughly fits this role except that it does not > persist after a job. > > Replicating content to N nodes also is not a hard problem to tackle (you > can hack up a content delivery system with ssh+rsync) and get similar > results.The approach often taken has been to keep data that is accessed > repeatedly and fits in memory in some other system > (hbase/cassandra/mysql/whatever). > > Edward > > > On Mon, Jan 16, 2012 at 11:33 AM, Rita <[email protected]> wrote: > > > Thanks. I believe this is a good feature to have for clients especially > if > > you are reading the same large file over and over. > > > > > > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <[email protected]> wrote: > > > > > There is some work being done in this area by some folks over at UC > > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it > > > has been published quite yet, but the title of the project is "PACMan" > > > -- I expect it will be published soon. > > > > > > -Todd > > > > > > On Sat, Jan 14, 2012 at 5:30 PM, Rita <[email protected]> wrote: > > > > After reading this article, > > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I > > was > > > > wondering if there was a filesystem cache for hdfs. For example, if a > > > large > > > > file (10gigabytes) was keep getting accessed on the cluster instead > of > > > keep > > > > getting it from the network why not storage the content of the file > > > locally > > > > on the client itself. A use case on the client would be like this: > > > > > > > > > > > > > > > > <property> > > > > <name>dfs.client.cachedirectory</name> > > > > <value>/var/cache/hdfs</value> > > > > </property> > > > > > > > > > > > > <property> > > > > <name>dfs.client.cachesize</name> > > > > <description>in megabytes</description> > > > > <value>100000</value> > > > > </property> > > > > > > > > > > > > Any thoughts of a feature like this? > > > > > > > > > > > > -- > > > > --- Get your facts first, then you can distort them as you please.-- > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > > > > > > > -- > > --- Get your facts first, then you can distort them as you please.-- > > > -- --- Get your facts first, then you can distort them as you please.--
