After looking at the HBaseRegionServer and its functionality, I began
wondering if there is a more general use case for memory caching of
HDFS blocks/files. In many use cases people wish to store data on
Hadoop indefinitely, however the last day,last week, last month, data
is probably the most actively used. For some Hadoop clusters the
amount of raw new data could be less then the RAM memory in the
cluster.

Also some data will be used repeatedly, the same source data may be
used to generate multiple result sets, and those results may be used
as the input to other processes.

I am thinking an answer could be to dedicate an amount of physical
memory on each DataNode, or on several dedicated node to a distributed
memcache like layer. Managing this cache should be straight forward
since hadoop blocks are pretty much static. (So say for a DataNode
with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
1000 Nodes that cache would be quite large.

Additionally we could create a new file system type cachedhdfs
implemented as a facade, or possibly implement CachedInputFormat or
CachedOutputFormat.

I know that the underlying filesystems have cache, but I think Hadoop
writing intermediate data is going to evict some of the data which
"should be" semi-permanent.

So has anyone looked into something like this? This was the closest
thing I found.

http://issues.apache.org/jira/browse/HADOOP-288

My goal here is to keep recent data in memory so that tools like Hive
can get a big boost on queries for new data.

Does anyone have any ideas?

Reply via email to