Hey George

Any comments on the probability (currently) that reads by a Task are over the network vs. being "local", as seen in your tests? That is, are 10% of block reads over the network, or 90% of reads?

I haven't looked, but am wondering if this metrics is stuffed somewhere by Hadoop...

ckw


On Jan 8, 2009, at 10:13 AM, George Porter wrote:

Hi Jun,

The earlier responses to your email reference the JIRA that I opened about this issue. Short-circuiting the primary HDFS datapath does improve throughput, and the amount depends on your workload (random reads especially). Some initial experimental results are posted to that JIRA. A second advantage is that since the JVM hosting the HDFS client is doing the reading, the O/S will satisfy future disk requests from the cache, which isn't really possible when you read over the network (even to another JVM on the same host).

There are several real disadvantages, the largest of which include 1) it adds a new datapath, and 2) bypasses various security and auditing features of HDFS. I would certainly like to think through a more clean interface for achieving this goal, especially since reading local data should be the common case. Any thoughts you might have would be appreciated.

Thanks,
George

Jun Rao wrote:
Hi,

Today, HDFS always reads through a socket even when the data is local to the client. This adds a lot of overhead, especially for warm reads. It should be possible for a dfs client to test if a block to be read is local
and if so, bypass socket and read through local FS api directly. This
should improve random access performance significantly (e.g., for HBase).
Has this been considered in HDFS? Thanks,

Jun



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

Reply via email to