Re: hadoop filesystem cache

Rita Tue, 17 Jan 2012 04:27:52 -0800

My intention isn't to make it a mandatory feature just as an option.
Keeping data locally on a filesystem as a method of Lx cache is far better
than getting it from the network and the cost of fs buffer cache is much
cheaper than a RPC call.


On Mon, Jan 16, 2012 at 1:07 PM, Edward Capriolo <[email protected]>wrote:

> The challenges of this design is people accessing the same data over and
> over again is the uncommon usecase for hadoop. Hadoop's bread and butter is
> all about streaming through large datasets that do not fit in memory. Also
> your shuffle-sort-spill is going to play havoc on and file system based
> cache. The distributed cache roughly fits this role except that it does not
> persist after a job.
>
> Replicating content to N nodes also is not a hard problem to tackle (you
> can hack up a content delivery system with ssh+rsync) and get similar
> results.The approach often taken has been to keep data that is accessed
> repeatedly and fits in memory in some other system
> (hbase/cassandra/mysql/whatever).
>
> Edward
>
>
> On Mon, Jan 16, 2012 at 11:33 AM, Rita <[email protected]> wrote:
>
> > Thanks. I believe this is a good feature to have for clients especially
> if
> > you are reading the same large file over and over.
> >
> >
> > On Sun, Jan 15, 2012 at 7:33 PM, Todd Lipcon <[email protected]> wrote:
> >
> > > There is some work being done in this area by some folks over at UC
> > > Berkeley's AMP Lab in coordination with Facebook. I don't believe it
> > > has been published quite yet, but the title of the project is "PACMan"
> > > -- I expect it will be published soon.
> > >
> > > -Todd
> > >
> > > On Sat, Jan 14, 2012 at 5:30 PM, Rita <[email protected]> wrote:
> > > > After reading this article,
> > > > http://www.cloudera.com/blog/2012/01/caching-in-hbase-slabcache/ , I
> > was
> > > > wondering if there was a filesystem cache for hdfs. For example, if a
> > > large
> > > > file (10gigabytes) was keep getting accessed on the cluster instead
> of
> > > keep
> > > > getting it from the network why not storage the content of the file
> > > locally
> > > > on the client itself.  A use case on the client would be like this:
> > > >
> > > >
> > > >
> > > > <property>
> > > >  <name>dfs.client.cachedirectory</name>
> > > >  <value>/var/cache/hdfs</value>
> > > > </property>
> > > >
> > > >
> > > > <property>
> > > > <name>dfs.client.cachesize</name>
> > > > <description>in megabytes</description>
> > > > <value>100000</value>
> > > > </property>
> > > >
> > > >
> > > > Any thoughts of a feature like this?
> > > >
> > > >
> > > > --
> > > > --- Get your facts first, then you can distort them as you please.--
> > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
> >
> >
> > --
> > --- Get your facts first, then you can distort them as you please.--
> >
>



-- 
--- Get your facts first, then you can distort them as you please.--

Re: hadoop filesystem cache

Reply via email to