Hey James,I think this would be a fun project, but be prepared to have the desktop portion not work out in the end. I would recommend focusing on prototyping your application in MapReduce, and consider the fact you might be able to reuse your desktops as sugar-coating (remember there may be other concerns, such as how reliable are your desktops versus how much it costs you if the processing job fails).
Basically, if MapReduce only makes sense if you can run it on desktops, it might be successful. I would consider whether doing the data processing with MapReduce is considered a success in itself.
FWIW, our students run on nodes that are considered "crap" by today's desktop standards (dual-processor Athlon MP, 1GB RAM, POS hard drive). Whether Hadoop is I/O intensive, memory intensive, or CPU intensive is 100% a function of your application. It's somewhat like asking how intensive a programming language is without specifying how you're going to use your programming language.
Brian On Sep 29, 2009, at 11:05 AM, Edward Capriolo wrote:
On Tue, Sep 29, 2009 at 11:15 AM, Steve Loughran <[email protected]> wrote:Edward Capriolo wrote:In hadoop terms commodity means "Not super computer". If you look around most large deployments have DataNodes with duel quad coreprocessors 8+GB ram and numerous disks, that is hardly the PC you findunder your desk.I have 4 cores and 6GB RAM, but only one HDD on the desk. That ram gets sucked up by the memory hogs: IDE, firefox and VMWare. The fact that VMWare uses less RAM to host an entire OS image than firefox shows that you can getaway with virtualised work, or that firefox is overweight. 18862 20 0 2139m 1.4g 48m S 0 25.0 46:57.81 java 12822 20 0 1050m 476m 18m S 3 8.1 53:34.39 firefox 13037 20 0 949m 386m 365m S 11 6.5 206:34.70 vmware-vmx 20932 20 0 688m 374m 11m S 0 6.4 2:05.18 javaFor example, we had a dev clusters is a very modest setup, 5 HP DL 145. Duel Core 4 GB RAM, 2 SATA DISKS. I did not do any elaborate testing, but i found that:!!!!One!!!! DL180G5 (2x Quad Code) (8GB ram) 8 SATA disk crushed the 5node cluster. So it is possible but it might be more administrative work then it is worth.Interesting. Is this CPU or IO intensive work?Interesting. Is this CPU or IO intensive work?I think this was a hive query. By nature hadoop is very CPU and IO intensive. Mapreduce is always spilling files and writing to the disk and there is a decent amount of traffic in shuffle sorting. The way I view it if your DataNodes/TaskTrackers nodes don't have enough 'pop' your ratios are bad. By ratios I mean: Administrator to PC, Power Utilization, MAP/Reduce overhead VS actual data processing There is many ways to look at it, but 1 if you are just running Task Trackers and not a DataNode on your workstations you have 0 data locality. That is a bad thing because after all hadoop wants to move the processing close to the data. If the disks backing the DataNode and TaskTracker are not fast and multi-threaded that node is not going to perform well. Now it does sound like your workstations have more processing power then my test cluster so you might have better results. Personally, I would probably try Hadoop on windows or colinux instead of VMWARE. VMWare has to emulate disk drives, kernels, interrupts. IMHO that overhead was too much, I do not know if anyone has any numbers on it. In this paper http://www.cca08.org/papers/Poster10-Simone-Leo.pdf They mention XEN overhead seems to be 5%. I would think that VMWare virtualization would perform worse but try it yourself and let me know!
smime.p7s
Description: S/MIME cryptographic signature
