Folks, I've been digging into the potential benefits of using
10 Gigabit Ethernet (10GbE) NIC server connections for Hadoop and wanted to run what I've come up with through initial research by the list for 'sanity check' feedback. I'd very much appreciate your input on the importance (or lack of it) of the following potential benefits of 10GbE server connectivity as well as other thoughts regarding 10GbE and Hadoop (My interest is specifically in the value of 10GbE server connections and 10GbE switching infrastructure, over scenarios such as bonded 1GbE server connections with 10GbE switching). 1. HDFS Data Loading. The higher throughput enabled by 10GbE server and switching infrastructure allows faster processing and distribution of data. 2. Hadoop Cluster Scalability. High-performance for initial data processing and distribution directly impacts the degree of parallelism or scalability supported by the cluster. 3. HDFS Replication. Higher speed server connections allows faster file replication. 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and latency directly impact the shuffle phase of a data set reduction especially for tasks that are at the document level (including large documents) and lots of metadata generated by those documents as well as video analytics and images. 5. Data Reporting. 10GbE server networking etwork performance can improve data reporting performance, especially if the Hadoop cluster is running multiple data reductions. 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be reorganized to use a cluster or network file system. This would allow Hadoop even with its Java implementation to have higher performance I/O and not have to be so concerned with disk drive density in the same server. 7. Others? thanks, Saqib Saqib Jang Principal/Founder Margalla Communications, Inc. 1339 Portola Road, Woodside, CA 94062 (650) 274 8745 www.margallacomm.com
