Virtualized nodes is a brilliant idea :) This greatly reduced the efforts, especially when the PCs are not fully in your control. Xiance
On Tue, Sep 29, 2009 at 6:01 PM, Steve Loughran <[email protected]> wrote: > > "commodity" really means x86 parts, non-RAID storage, no > infiniband-connected storage array, no esoteric OS -just Linux- and > commodity gigabit ether, nothing fancy like 10GBE except on a heavy-utilised > backbone :) With those kind of configurations, you reduce your capital > costs, leaving you more money to spend on the electricity bill. I'd still go > for RAID and/or NFS-mounted RAID for bits of the namenode/2ary namenode if > you care about the data. > > Taeho Kang wrote: > >> If your "commodity" pc's don't have a whole lot of storage space, then you >> would have to run your HDFS datanodes elsewhere. In that case, a lot of >> data >> traffic will occur (e.g. sending data from datanodes to where data >> processing occurs), meaning map reduce performance will be slowed down. >> It's >> always good to have the actual data on the same machine where the >> processing >> will occur, or there will be extra network i/o involved. >> >> If you decide to host datanodes on pc's, then you also have to be able to >> protect the data. (e.g. make sure people don't accidentally delete data >> blocks.) >> >> Well, there are lots and lots of possibilities, and I would like to hear >> how >> your plan goes, too! >> > > I would go for storing data off the desktop machines, and just using them > as compute nodes -tasktrackers. This reduces the impact of them going > offline without warning but lets them do useful work. This will bump up > their bandwidth needs though. > > This still leaves you with the problem of configuring the hadoop cluster > for all these machines, especially if they are different. To work around > that, why not creating a VirtualBox or VMWare OS image containing the hadoop > binaries and configuration files. Everyone who runs the OS image joins the > cluster, but as soon as they pause it, that tasktracker goes away. > > When run Virtualized, HDD and network IO is slower, but if you are only > connecting to network storage, that network throttling could be useful, it > will cut back on LAN bandwidth. CPU performance can often be comparable, so > if your code is CPU-intensive, this can work >
