If you are very adhoc-y, more bandwidth the merry-er! James
Sent from my mobile. Please excuse the typos. On 2011-06-28, at 5:03 PM, Matei Zaharia <[email protected]> wrote: > Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile > your target Hadoop workload and see whether it's communication-bound. Hadoop > jobs can definitely be communication-bound if you shuffle a lot of data > between map and reduce, but I've also seen a lot of clusters that are > CPU-bound (due to decompression, running python, or just running expensive > user code) or disk-IO-bound. You might be surprised at what your bottleneck > is. > > Matei > > On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote: > >> Matt, >> Thanks, this is helpful, I was wondering if you may have some thoughts >> on the list of other potential benefits of 10GbE NICs for Hadoop >> (listed in my original e-mail to the list)? >> >> regards, >> Saqib >> >> -----Original Message----- >> From: Matthew Foley [mailto:[email protected]] >> Sent: Tuesday, June 28, 2011 12:04 PM >> To: [email protected] >> Cc: Matthew Foley >> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? >> >> Hadoop common provides an abstract FileSystem class, and Hadoop applications >> should be designed to run on that. HDFS is just one implementation of a >> valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported >> LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage >> would fall under the LocalFileSystem model. >> >> However, one of the core values of Hadoop is the model of "bring the >> computation to the data". This does not seem viable with an NFS-based >> NAS-model storage subsystem. Thus, while it will "work" for small clusters >> and small jobs, it is unlikely to scale with high performance to thousands >> of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. >> >> --Matt >> >> >> On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: >> >> I see. However, Hadoop is designed to operate best with HDFS because of its >> inherent striping and blocking strategy - which is tracked by Hadoop. >> Going outside of that mechanism will probably yield poor results and/or >> confuse Hadoop. >> >> Just my thoughts. >> >> On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: >>> Darren, >>> Thanks, the last pt was basically about 10GbE potentially allowing the >>> use of a network file system e.g. via NFS as an alternative to HDFS, >>> the question is there any merit in this. Basically, I was exploring if >>> the commercial clustered NAS products offer any high-availability or >>> data management benefits for use with Hadoop? >>> >>> Saqib >>> >>> -----Original Message----- >>> From: Darren Govoni [mailto:[email protected]] >>> Sent: Tuesday, June 28, 2011 10:21 AM >>> To: [email protected] >>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? >>> >>> Hadoop, like other parallel networked computation architectures is I/O >>> bound, predominantly. >>> This means any increase in network bandwidth is "A Good Thing" and can >>> have drastic positive effects on performance. All your points stem >>> from this simple realization. >>> >>> Although I'm confused by your #6. Hadoop already uses a distributed >>> file system. HDFS. >>> >>> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >>>> Folks, >>>> >>>> I've been digging into the potential benefits of using >>>> >>>> 10 Gigabit Ethernet (10GbE) NIC server connections for >>>> >>>> Hadoop and wanted to run what I've come up with >>>> >>>> through initial research by the list for 'sanity check' >>>> >>>> feedback. I'd very much appreciate your input on >>>> >>>> the importance (or lack of it) of the following potential benefits of >>>> >>>> 10GbE server connectivity as well as other thoughts regarding >>>> >>>> 10GbE and Hadoop (My interest is specifically in the value >>>> >>>> of 10GbE server connections and 10GbE switching infrastructure, >>>> >>>> over scenarios such as bonded 1GbE server connections with >>>> >>>> 10GbE switching). >>>> >>>> >>>> >>>> 1. HDFS Data Loading. The higher throughput enabled by 10GbE >>>> >>>> server and switching infrastructure allows faster processing and >>>> >>>> distribution of data. >>>> >>>> 2. Hadoop Cluster Scalability. High-performance for initial data >>>> processing >>>> >>>> and distribution directly impacts the degree of parallelism or >>>> scalability supported >>>> >>>> by the cluster. >>>> >>>> 3. HDFS Replication. Higher speed server connections allows faster >>>> file replication. >>>> >>>> 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and >>>> latency directly impact the >>>> >>>> shuffle phase of a data set reduction especially for tasks that are >>>> at the document level >>>> >>>> (including large documents) and lots of metadata generated by those >>>> documents as well as video analytics and images. >>>> >>>> 5. Data Reporting. 10GbE server networking etwork performance can >>>> >>>> improve data reporting performance, especially if the Hadoop cluster >>>> is running >>>> >>>> multiple data reductions. >>>> >>>> 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could >>> be >>>> reorganized >>>> >>>> to use a cluster or network file system. This would allow Hadoop even >>>> with its Java implementation >>>> >>>> to have higher performance I/O and not have to be so concerned with >>>> disk drive density in the same server. >>>> >>>> 7. Others? >>>> >>>> >>>> >>>> >>>> >>>> thanks, >>>> >>>> Saqib >>>> >>>> >>>> >>>> Saqib Jang >>>> >>>> Principal/Founder >>>> >>>> Margalla Communications, Inc. >>>> >>>> 1339 Portola Road, Woodside, CA 94062 >>>> >>>> (650) 274 8745 >>>> >>>> www.margallacomm.com >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >> >> >
