On Tue, Sep 29, 2009 at 6:09 AM, Xiance SI(司宪策) <[email protected]> wrote: > Virtualized nodes is a brilliant idea :) This greatly reduced the efforts, > especially when the PCs are not fully in your control. > Xiance > > On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran <[email protected]> wrote: > >> >> "commodity" really means x86 parts, non-RAID storage, no >> infiniband-connected storage array, no esoteric OS -just Linux- and >> commodity gigabit ether, nothing fancy like 10GBE except on a heavy-utilised >> backbone :) With those kind of configurations, you reduce your capital >> costs, leaving you more money to spend on the electricity bill. I'd still go >> for RAID and/or NFS-mounted RAID for bits of the namenode/2ary namenode if >> you care about the data. >> >> Taeho Kang wrote: >> >>> If your "commodity" pc's don't have a whole lot of storage space, then you >>> would have to run your HDFS datanodes elsewhere. In that case, a lot of >>> data >>> traffic will occur (e.g. sending data from datanodes to where data >>> processing occurs), meaning map reduce performance will be slowed down. >>> It's >>> always good to have the actual data on the same machine where the >>> processing >>> will occur, or there will be extra network i/o involved. >>> >>> If you decide to host datanodes on pc's, then you also have to be able to >>> protect the data. (e.g. make sure people don't accidentally delete data >>> blocks.) >>> >>> Well, there are lots and lots of possibilities, and I would like to hear >>> how >>> your plan goes, too! >>> >> >> I would go for storing data off the desktop machines, and just using them >> as compute nodes -tasktrackers. This reduces the impact of them going >> offline without warning but lets them do useful work. This will bump up >> their bandwidth needs though. >> >> This still leaves you with the problem of configuring the hadoop cluster >> for all these machines, especially if they are different. To work around >> that, why not creating a VirtualBox or VMWare OS image containing the hadoop >> binaries and configuration files. Everyone who runs the OS image joins the >> cluster, but as soon as they pause it, that tasktracker goes away. >> >> When run Virtualized, HDD and network IO is slower, but if you are only >> connecting to network storage, that network throttling could be useful, it >> will cut back on LAN bandwidth. CPU performance can often be comparable, so >> if your code is CPU-intensive, this can work >> >
In hadoop terms commodity means "Not super computer". If you look around most large deployments have DataNodes with duel quad core processors 8+GB ram and numerous disks, that is hardly the PC you find under your desk. For example, we had a dev clusters is a very modest setup, 5 HP DL 145. Duel Core 4 GB RAM, 2 SATA DISKS. I did not do any elaborate testing, but i found that: !!!!One!!!! DL180G5 (2x Quad Code) (8GB ram) 8 SATA disk crushed the 5 node cluster. So it is possible but it might be more administrative work then it is worth.
