Hey George
Any comments on the probability (currently) that reads by a Task are
over the network vs. being "local", as seen in your tests? That is,
are 10% of block reads over the network, or 90% of reads?
I haven't looked, but am wondering if this metrics is stuffed
somewhere by Hadoop...
ckw
On Jan 8, 2009, at 10:13 AM, George Porter wrote:
Hi Jun,
The earlier responses to your email reference the JIRA that I opened
about this issue. Short-circuiting the primary HDFS datapath does
improve throughput, and the amount depends on your workload (random
reads especially). Some initial experimental results are posted to
that JIRA. A second advantage is that since the JVM hosting the
HDFS client is doing the reading, the O/S will satisfy future disk
requests from the cache, which isn't really possible when you read
over the network (even to another JVM on the same host).
There are several real disadvantages, the largest of which include
1) it adds a new datapath, and 2) bypasses various security and
auditing features of HDFS. I would certainly like to think through
a more clean interface for achieving this goal, especially since
reading local data should be the common case. Any thoughts you
might have would be appreciated.
Thanks,
George
Jun Rao wrote:
Hi,
Today, HDFS always reads through a socket even when the data is
local to
the client. This adds a lot of overhead, especially for warm reads.
It
should be possible for a dfs client to test if a block to be read
is local
and if so, bypass socket and read through local FS api directly. This
should improve random access performance significantly (e.g., for
HBase).
Has this been considered in HDFS? Thanks,
Jun
--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/