On Wed, Oct 7, 2009 at 7:45 AM, Edward Capriolo <[email protected]>wrote:
> > Todd, > > I do think it could be an inherent problem. With all the reading and > writing of intermediate data hadoop does, the file system cache would > would likely never contain the initial raw data you want to work with. > The HBase RegionServer seems to be successful, so there must be some > place for caching. > > Once I get something in HDFS, like lasts hours log data, about 40 > different processes are going to repeatedly re/read it from disk. I > think if i can force that data into a cache I can get much faster > processing. > > In cases like this, we should expose access type hints like posix_fadvise POSIX_ADV_DONTNEED for the data we dont' want to end up in the cache. There's already a JIRA out there for a JNI library for platform specific optimization, and I think this is one that will be worth doing. -ToddEdward
