If you are very adhoc-y,  more bandwidth the merry-er!

James

Sent from my mobile. Please excuse the typos.

On 2011-06-28, at 5:03 PM, Matei Zaharia <[email protected]> wrote:

> Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile 
> your target Hadoop workload and see whether it's communication-bound. Hadoop 
> jobs can definitely be communication-bound if you shuffle a lot of data 
> between map and reduce, but I've also seen a lot of clusters that are 
> CPU-bound (due to decompression, running python, or just running expensive 
> user code) or disk-IO-bound. You might be surprised at what your bottleneck 
> is.
>
> Matei
>
> On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote:
>
>> Matt,
>> Thanks, this is helpful, I was wondering if you may have some thoughts
>> on the list of other potential benefits of 10GbE NICs for Hadoop
>> (listed in my original e-mail to the list)?
>>
>> regards,
>> Saqib
>>
>> -----Original Message-----
>> From: Matthew Foley [mailto:[email protected]]
>> Sent: Tuesday, June 28, 2011 12:04 PM
>> To: [email protected]
>> Cc: Matthew Foley
>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
>>
>> Hadoop common provides an abstract FileSystem class, and Hadoop applications
>> should be designed to run on that.  HDFS is just one implementation of a
>> valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported
>> LocalFileSystem are provided in Hadoop common.  Use of NFS-mounted storage
>> would fall under the LocalFileSystem model.
>>
>> However, one of the core values of Hadoop is the model of "bring the
>> computation to the data".  This does not seem viable with an NFS-based
>> NAS-model storage subsystem.  Thus, while it will "work" for small clusters
>> and small jobs, it is unlikely to scale with high performance to thousands
>> of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3.
>>
>> --Matt
>>
>>
>> On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote:
>>
>> I see. However, Hadoop is designed to operate best with HDFS because of its
>> inherent striping and blocking strategy - which is tracked by Hadoop.
>> Going outside of that mechanism will probably yield poor results and/or
>> confuse Hadoop.
>>
>> Just my thoughts.
>>
>> On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote:
>>> Darren,
>>> Thanks, the last pt was basically about 10GbE potentially allowing the
>>> use of a network file system e.g. via NFS as an alternative to HDFS,
>>> the question is there any merit in this. Basically, I was exploring if
>>> the commercial clustered NAS products offer any high-availability or
>>> data management benefits for use with Hadoop?
>>>
>>> Saqib
>>>
>>> -----Original Message-----
>>> From: Darren Govoni [mailto:[email protected]]
>>> Sent: Tuesday, June 28, 2011 10:21 AM
>>> To: [email protected]
>>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop?
>>>
>>> Hadoop, like other parallel networked computation architectures is I/O
>>> bound, predominantly.
>>> This means any increase in network bandwidth is "A Good Thing" and can
>>> have drastic positive effects on performance. All your points stem
>>> from this simple realization.
>>>
>>> Although I'm confused by your #6. Hadoop already uses a distributed
>>> file system. HDFS.
>>>
>>> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote:
>>>> Folks,
>>>>
>>>> I've been digging into the potential benefits of using
>>>>
>>>> 10 Gigabit Ethernet (10GbE) NIC server connections for
>>>>
>>>> Hadoop and wanted to run what I've come up with
>>>>
>>>> through initial research by the list for 'sanity check'
>>>>
>>>> feedback. I'd very much appreciate your input on
>>>>
>>>> the importance (or lack of it) of the following potential benefits of
>>>>
>>>> 10GbE server connectivity as well as other thoughts regarding
>>>>
>>>> 10GbE and Hadoop (My interest is specifically in the value
>>>>
>>>> of 10GbE server connections and 10GbE switching infrastructure,
>>>>
>>>> over scenarios such as bonded 1GbE server connections with
>>>>
>>>> 10GbE switching).
>>>>
>>>>
>>>>
>>>> 1.       HDFS Data Loading. The higher throughput enabled by 10GbE
>>>>
>>>> server and switching infrastructure allows faster processing and
>>>>
>>>> distribution of data.
>>>>
>>>> 2.       Hadoop Cluster Scalability. High-performance for initial data
>>>> processing
>>>>
>>>> and distribution directly impacts the degree of parallelism or
>>>> scalability supported
>>>>
>>>> by the cluster.
>>>>
>>>> 3.       HDFS Replication. Higher speed server connections allows faster
>>>> file replication.
>>>>
>>>> 4.       Map/Reduce Shuffle Phase. Improved end-to-end throughput and
>>>> latency directly impact the
>>>>
>>>> shuffle phase of a data set reduction especially for tasks that are
>>>> at the document level
>>>>
>>>> (including large documents) and lots of metadata generated by those
>>>> documents as well as video analytics and images.
>>>>
>>>> 5.       Data Reporting. 10GbE server networking etwork performance can
>>>>
>>>> improve data reporting performance, especially if the Hadoop cluster
>>>> is running
>>>>
>>>> multiple data reductions.
>>>>
>>>> 6.       Support of Cluster File Systems.  With 10 GbE NICs, Hadoop could
>>> be
>>>> reorganized
>>>>
>>>> to use a cluster or network file system. This would allow Hadoop even
>>>> with its Java implementation
>>>>
>>>> to have higher performance I/O and not have to be so concerned with
>>>> disk drive density in the same server.
>>>>
>>>> 7.       Others?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> thanks,
>>>>
>>>> Saqib
>>>>
>>>>
>>>>
>>>> Saqib Jang
>>>>
>>>> Principal/Founder
>>>>
>>>> Margalla Communications, Inc.
>>>>
>>>> 1339 Portola Road, Woodside, CA 94062
>>>>
>>>> (650) 274 8745
>>>>
>>>> www.margallacomm.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>

Reply via email to