[ 
https://issues.apache.org/jira/browse/HDFS-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571234#comment-14571234
 ] 

Ahmed Mahran commented on HDFS-8416:
------------------------------------

Frankly speaking, I find this a hard question to answer. HDFS is a distributed 
file system in the first place while shared storage is a centralized storage 
system. From here comes the contradiction. However, I see the shared storage as 
only a storage medium that is separate from the filesystem. Hence, centralizing 
the storage medium is away from centralizing the filesystem. So, HDFS as a 
distributed file system would still serve as a distributed file system for 
Hadoop. The filesystem’s metadata is still handled in a distributed manner 
enabling efficient access to multiple data from the storage medium (as opposed 
to a centralized filesystem). Another thing is that, the data locality 
principle would still hold, moving code to where the datanode holding the data 
is, enabling better scheduling.

One might think of a shared storage also as a kind of archival storage. That is 
one kind of the heterogeneous storage types that HDFS supports.

Moreover, we are seeing enterprise shared storage providers coming into the 
playground of Hadoop in collaboration with enterprise Hadoop distribution 
providers. For example, Nimble with Hortonworks and EMC2 (Isilon) with 
Cloudera. Those kind of collaborations point the way in the direction of 
deploying HDFS on enterprise level shared storage combining the benefits of 
both.

> Short circuit remote reads from shared storage
> ----------------------------------------------
>
>                 Key: HDFS-8416
>                 URL: https://issues.apache.org/jira/browse/HDFS-8416
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, hdfs-client, nfs, performance
>            Reporter: Ahmed Mahran
>
> In a Hadoop cluster configuration that employs a shared storage system, HDFS 
> read and write operations are very expensive in terms of network bandwidth 
> consumption.
> For a DFS client to read a block from a remote datanode, the block is 
> transmitted first from the shared storage to the datanode then from the 
> datanode to the DFS client. Short circuiting the shared storage to datanode 
> hop and allowing the client to directly access the shared storage would 
> improve the performance substantially.
> This blog post describes the issue and provides a hack for the remote read.
> http://www.badrit.com/blog/2015/3/20/hdfs-short-circuit-shared-storage-remote-read-hacking-the-hdfs-short-circuit-local-read-for-short-circuiting-remote-reads-from-a-shared-storage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to