Hi,

The main reason Hadoop scales so well is because all components try to
adhere to the idea around having Data Locality.
In general this means that you are running the processing/query software on
the system where the actual data is already present on the local disk.

To me this NFS solution sounds like hooking the processing nodes to a
shared storage solution.
This may work for small clusters (say 5 nodes or so) but for large clusters
this shared storage will be the main bottle neck in the processing/query
speed.

We currently have more than 20 nodes with 12 harddisks each resulting in
over 50GB/sec [1] of disk-to-queryengine speed and this means that our
setup already goes much faster than any network connection to any NFS
solution can provide. We can simply go to say 50 nodes and exceed the
100GB/sec speed easy.

So to me this sounds like hooking a scalable processing platform to a non
scalable storage system (mainly because the network to this storage doesn't
scale).

So far I have only seen vendors of legacy storage solutions going in this
direction ... oh wait ... you are NetApp ... that explains it.

I am no committer in any of the Hadoop tools but I vote against having such
a "core concept breaking" piece in the main codebase. New people may start
to think it is a good idea to do this.

So I say you should simply make this plugin available to your customers,
just not as a core part of Hadoop.

Niels Basjes

[1]  50 GB/sec = approx 20*12*200MB/sec
      This page shows max read speed in the 200MB/sec range:

http://www.tomshardware.com/charts/enterprise-hdd-charts/-02-Read-Throughput-Maximum-h2benchw-3.16,3372.html


On Tue, Jan 13, 2015 at 10:35 PM, Gokul Soundararajan <
gokulsoun...@gmail.com> wrote:

> Hi,
>
> We (Jingxin Feng, Xing Lin, and I) have been working on providing a
> FileSystem implementation that allows Hadoop to utilize a NFSv3 storage
> server as a filesystem. It leverages code from hadoop-nfs project for all
> the request/response handling. We would like your help to add it as part of
> hadoop tools (similar to the way hadoop-aws and hadoop-azure).
>
> In more detail, the Hadoop NFS Connector allows Apache Hadoop (2.2+) and
> Apache Spark (1.2+) to use a NFSv3 storage server as a storage endpoint.
> The NFS Connector can be run in two modes: (1) secondary filesystem - where
> Hadoop/Spark runs using HDFS as its primary storage and can use NFS as a
> second storage endpoint, and (2) primary filesystem - where Hadoop/Spark
> runs entirely on a NFSv3 storage server.
>
> The code is written in a way such that existing applications do not have to
> change. All one has to do is to copy the connector jar into the lib/
> directory of Hadoop/Spark. Then, modify core-site.xml to provide the
> necessary details.
>
> The current version can be seen at:
> https://github.com/NetApp/NetApp-Hadoop-NFS-Connector
>
> It is my first time contributing to the Hadoop codebase. It would be great
> if someone on the Hadoop team can guide us through this process. I'm
> willing to make the necessary changes to integrate the code. What are the
> next steps? Should I create a JIRA entry?
>
> Thanks,
>
> Gokul
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Reply via email to