"commodity" really means x86 parts, non-RAID storage, no infiniband-connected storage array, no esoteric OS -just Linux- and commodity gigabit ether, nothing fancy like 10GBE except on a heavy-utilised backbone :) With those kind of configurations, you reduce your capital costs, leaving you more money to spend on the electricity bill. I'd still go for RAID and/or NFS-mounted RAID for bits of the namenode/2ary namenode if you care about the data.
Taeho Kang wrote:
If your "commodity" pc's don't have a whole lot of storage space, then you would have to run your HDFS datanodes elsewhere. In that case, a lot of data traffic will occur (e.g. sending data from datanodes to where data processing occurs), meaning map reduce performance will be slowed down. It's always good to have the actual data on the same machine where the processing will occur, or there will be extra network i/o involved. If you decide to host datanodes on pc's, then you also have to be able to protect the data. (e.g. make sure people don't accidentally delete data blocks.) Well, there are lots and lots of possibilities, and I would like to hear how your plan goes, too!
I would go for storing data off the desktop machines, and just using them as compute nodes -tasktrackers. This reduces the impact of them going offline without warning but lets them do useful work. This will bump up their bandwidth needs though.
This still leaves you with the problem of configuring the hadoop cluster for all these machines, especially if they are different. To work around that, why not creating a VirtualBox or VMWare OS image containing the hadoop binaries and configuration files. Everyone who runs the OS image joins the cluster, but as soon as they pause it, that tasktracker goes away.
When run Virtualized, HDD and network IO is slower, but if you are only connecting to network storage, that network throttling could be useful, it will cut back on LAN bandwidth. CPU performance can often be comparable, so if your code is CPU-intensive, this can work
