[
https://issues.apache.org/jira/browse/HDFS-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14571234#comment-14571234
]
Ahmed Mahran commented on HDFS-8416:
------------------------------------
Frankly speaking, I find this a hard question to answer. HDFS is a distributed
file system in the first place while shared storage is a centralized storage
system. From here comes the contradiction. However, I see the shared storage as
only a storage medium that is separate from the filesystem. Hence, centralizing
the storage medium is away from centralizing the filesystem. So, HDFS as a
distributed file system would still serve as a distributed file system for
Hadoop. The filesystem’s metadata is still handled in a distributed manner
enabling efficient access to multiple data from the storage medium (as opposed
to a centralized filesystem). Another thing is that, the data locality
principle would still hold, moving code to where the datanode holding the data
is, enabling better scheduling.
One might think of a shared storage also as a kind of archival storage. That is
one kind of the heterogeneous storage types that HDFS supports.
Moreover, we are seeing enterprise shared storage providers coming into the
playground of Hadoop in collaboration with enterprise Hadoop distribution
providers. For example, Nimble with Hortonworks and EMC2 (Isilon) with
Cloudera. Those kind of collaborations point the way in the direction of
deploying HDFS on enterprise level shared storage combining the benefits of
both.
> Short circuit remote reads from shared storage
> ----------------------------------------------
>
> Key: HDFS-8416
> URL: https://issues.apache.org/jira/browse/HDFS-8416
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: datanode, hdfs-client, nfs, performance
> Reporter: Ahmed Mahran
>
> In a Hadoop cluster configuration that employs a shared storage system, HDFS
> read and write operations are very expensive in terms of network bandwidth
> consumption.
> For a DFS client to read a block from a remote datanode, the block is
> transmitted first from the shared storage to the datanode then from the
> datanode to the DFS client. Short circuiting the shared storage to datanode
> hop and allowing the client to directly access the shared storage would
> improve the performance substantially.
> This blog post describes the issue and provides a hack for the remote read.
> http://www.badrit.com/blog/2015/3/20/hdfs-short-circuit-shared-storage-remote-read-hacking-the-hdfs-short-circuit-local-read-for-short-circuiting-remote-reads-from-a-shared-storage
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)