On Fri, Oct 11, 2013 at 11:12 AM, Ravikumar Govindarajan < [email protected]> wrote:
> I came across this interesting JIRA in hadoop > https://issues.apache.org/jira/browse/HDFS-385 > > In essence, it allows us more control over where blocks are placed. > > A BlockPlacementPolicy to optimistically write all blocks of a given > index-file into same set of datanodes could be highly helpful. > > Co-locating shard-servers and datanodes along with short-circuit reads > should improve greatly. We can always utilize the file-system cache for > local files. In case a given file is not served locally, then shard-servers > can use the block-cache > > Do you see some positives in such an approach? > I'm sure that there will be some performance improvements my using local reads for accessing HDFS. The areas I would assume to see the biggest increases in performance would be merging and fetching data for retrieval. Even with accessing local drives, the search time would likely not be improved assuming the index is hot. One of the reasons for doing short-cut reads is to make use of the OS file system cache and since Blur already uses a filesystem cache it might just be duplicating functionality. Another big reason for short-cut reads is to reduce network traffic. Given the scenario when another server has gone down. If the shard server is in an opening state we would have to change the Blur layout system to only fail to the server that contains the data for the index. This might have some problems because the layout of the shards are based on how many shards are online in the existing shards not where they are located. So in a perfect world if the shard fails to the correct server it would reduce the amount of network traffic (to near 0 if it was perfect). So in short it could work, but it might be very hard. Assuming for a moment that the system is not dealing with a failure and assuming that the shard server is also running a datanode. The first replica for HDFS is the local datanode, so if we just configure Hadoop for the short-cut reads we would already get the benefits for data fetching and merging. I had another idea that I wanted to run you. What if we instead of actively placing blocks during writes with the a Hadoop block layout policy, we write something like the Hadoop balancer. We get Hadoop to move the blocks to the shard server (datanode) that is hosting the shard. That way it would be asynchronous and after a failure / shard movement it would relocate the blocks and still use a short-cut read. Kind of like the blocks following the shard around, instead of deciding up front where the data for the shard should be located. If this is what you were thinking I'm sorry not understanding your suggestion. Aaron > > -- > Ravi >
