After looking at the HBaseRegionServer and its functionality, I began wondering if there is a more general use case for memory caching of HDFS blocks/files. In many use cases people wish to store data on Hadoop indefinitely, however the last day,last week, last month, data is probably the most actively used. For some Hadoop clusters the amount of raw new data could be less then the RAM memory in the cluster.
Also some data will be used repeatedly, the same source data may be used to generate multiple result sets, and those results may be used as the input to other processes. I am thinking an answer could be to dedicate an amount of physical memory on each DataNode, or on several dedicated node to a distributed memcache like layer. Managing this cache should be straight forward since hadoop blocks are pretty much static. (So say for a DataNode with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had 1000 Nodes that cache would be quite large. Additionally we could create a new file system type cachedhdfs implemented as a facade, or possibly implement CachedInputFormat or CachedOutputFormat. I know that the underlying filesystems have cache, but I think Hadoop writing intermediate data is going to evict some of the data which "should be" semi-permanent. So has anyone looked into something like this? This was the closest thing I found. http://issues.apache.org/jira/browse/HADOOP-288 My goal here is to keep recent data in memory so that tools like Hive can get a big boost on queries for new data. Does anyone have any ideas?
