Re: Short circuit reads and hadoop

Aaron McCurry Sat, 12 Oct 2013 12:23:20 -0700

On Fri, Oct 11, 2013 at 11:12 AM, Ravikumar Govindarajan <
[email protected]> wrote:


> I came across this interesting JIRA in hadoop
> https://issues.apache.org/jira/browse/HDFS-385
>
> In essence, it allows us more control over where blocks are placed.
>
> A BlockPlacementPolicy to optimistically write all blocks of a given
> index-file into same set of datanodes could be highly helpful.
>
> Co-locating shard-servers and datanodes along with short-circuit reads
> should improve greatly. We can always utilize the file-system cache for
> local files. In case a given file is not served locally, then shard-servers
> can use the block-cache
>
> Do you see some positives in such an approach?
>

I'm sure that there will be some performance improvements my using local
reads for accessing HDFS.  The areas I would assume to see the biggest
increases in performance would be merging and fetching data for retrieval.
 Even with accessing local drives, the search time would likely not be
improved assuming the index is hot.  One of the reasons for doing short-cut
reads is to make use of the OS file system cache and since Blur already
uses a filesystem cache it might just be duplicating functionality.
 Another big reason for short-cut reads is to reduce network traffic.

Given the scenario when another server has gone down.  If the shard server
is in an opening state we would have to change the Blur layout system to
only fail to the server that contains the data for the index.  This might
have some problems because the layout of the shards are based on how many
shards are online in the existing shards not where they are located.

So in a perfect world if the shard fails to the correct server it would
reduce the amount of network traffic (to near 0 if it was perfect).  So in
short it could work, but it might be very hard.

Assuming for a moment that the system is not dealing with a failure and
assuming that the shard server is also running a datanode.  The first
replica for HDFS is the local datanode, so if we just configure Hadoop for
the short-cut reads we would already get the benefits for data fetching and
merging.

I had another idea that I wanted to run you.

What if we instead of actively placing blocks during writes with the a
Hadoop block layout policy, we write something like the Hadoop balancer.
 We get Hadoop to move the blocks to the shard server (datanode) that is
hosting the shard.  That way it would be asynchronous and after a failure /
shard movement it would relocate the blocks and still use a short-cut read.
 Kind of like the blocks following the shard around, instead of deciding up
front where the data for the shard should be located.  If this is what you
were thinking I'm sorry not understanding your suggestion.

Aaron



>
> --
> Ravi
>

Re: Short circuit reads and hadoop

Reply via email to