Given that HBase has it's own cache (block cache and bloom filters) and that 
all the table data is stored in HDFS, I'm wondering if HBase benefits from OS 
page cache at all. In the set up I'm using HBase Region Servers run on the same 
boxes as the HDFS data node. In such a scenario if the underlying HLog files 
lives on the same machine then having a healthy memory surplus may mean that 
the data node can serve underlying file from page cache and thus improving 
HBase performance. Is this really the case? (I guess page cache should also 
help in case where HLog file lives on a different machine but in that case 
network I/O will probably drown the speedup achieved due to not hitting the 
disk.

I'm asking because if page cache were useful then in an HBase set up not 
utilizing all the memory on the machine for the region server may not be that 
bad. The reason one would not want to use all the memory for region server 
would be long garbage collection pauses that large heap size may induce. I 
understand that work has been done to fix the long pauses caused due to memory 
fragmentation in the old generation, mostly concurrent garbage collector by 
using slab cache allocator for memstore but that feature is marked experimental 
and we're not ready to take risks yet. So if the page cache was useful in any 
way on Region Servers we could go with less memory for RegionServer process 
with the understanding that free memory on the machine is not completely going 
to waste. Thus my curiosity about utility of os page cache to performance of 
HBase.

Thanks in Advance,
Pankaj

Reply via email to