Re: short-circuiting HDFS reads

Chris K Wensel Thu, 08 Jan 2009 10:26:18 -0800

Hey George

Any comments on the probability (currently) that reads by a Task areover the network vs. being "local", as seen in your tests? That is,are 10% of block reads over the network, or 90% of reads?

I haven't looked, but am wondering if this metrics is stuffedsomewhere by Hadoop...


ckw


On Jan 8, 2009, at 10:13 AM, George Porter wrote:

Hi Jun,
The earlier responses to your email reference the JIRA that I openedabout this issue. Short-circuiting the primary HDFS datapath doesimprove throughput, and the amount depends on your workload (randomreads especially). Some initial experimental results are posted tothat JIRA. A second advantage is that since the JVM hosting theHDFS client is doing the reading, the O/S will satisfy future diskrequests from the cache, which isn't really possible when you readover the network (even to another JVM on the same host).
There are several real disadvantages, the largest of which include1) it adds a new datapath, and 2) bypasses various security andauditing features of HDFS. I would certainly like to think througha more clean interface for achieving this goal, especially sincereading local data should be the common case. Any thoughts youmight have would be appreciated.
Thanks,
George

Jun Rao wrote:
Hi,
Today, HDFS always reads through a socket even when the data islocal tothe client. This adds a lot of overhead, especially for warm reads.Itshould be possible for a dfs client to test if a block to be readis local
and if so, bypass socket and read through local FS api directly. This
should improve random access performance significantly (e.g., forHBase).
Has this been considered in HDFS? Thanks,

Jun


--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

Re: short-circuiting HDFS reads

Reply via email to