On Tue, Sep 29, 2009 at 11:15 AM, Steve Loughran <[email protected]> wrote: > Edward Capriolo wrote: > >> In hadoop terms commodity means "Not super computer". If you look >> around most large deployments have DataNodes with duel quad core >> processors 8+GB ram and numerous disks, that is hardly the PC you find >> under your desk. > > I have 4 cores and 6GB RAM, but only one HDD on the desk. That ram gets > sucked up by the memory hogs: IDE, firefox and VMWare. The fact that VMWare > uses less RAM to host an entire OS image than firefox shows that you can get > away with virtualised work, or that firefox is overweight. > > 18862 20 0 2139m 1.4g 48m S 0 25.0 46:57.81 java > 12822 20 0 1050m 476m 18m S 3 8.1 53:34.39 firefox > 13037 20 0 949m 386m 365m S 11 6.5 206:34.70 vmware-vmx > 20932 20 0 688m 374m 11m S 0 6.4 2:05.18 java > >> For example, we had a dev clusters is a very modest setup, 5 HP DL >> 145. Duel Core 4 GB RAM, 2 SATA DISKS. >> >> I did not do any elaborate testing, but i found that: >> >> !!!!One!!!! DL180G5 (2x Quad Code) (8GB ram) 8 SATA disk crushed the 5 >> node cluster. So it is possible but it might be more administrative >> work then it is worth. > > Interesting. Is this CPU or IO intensive work? >
>>Interesting. Is this CPU or IO intensive work? I think this was a hive query. By nature hadoop is very CPU and IO intensive. Mapreduce is always spilling files and writing to the disk and there is a decent amount of traffic in shuffle sorting. The way I view it if your DataNodes/TaskTrackers nodes don't have enough 'pop' your ratios are bad. By ratios I mean: Administrator to PC, Power Utilization, MAP/Reduce overhead VS actual data processing There is many ways to look at it, but 1 if you are just running Task Trackers and not a DataNode on your workstations you have 0 data locality. That is a bad thing because after all hadoop wants to move the processing close to the data. If the disks backing the DataNode and TaskTracker are not fast and multi-threaded that node is not going to perform well. Now it does sound like your workstations have more processing power then my test cluster so you might have better results. Personally, I would probably try Hadoop on windows or colinux instead of VMWARE. VMWare has to emulate disk drives, kernels, interrupts. IMHO that overhead was too much, I do not know if anyone has any numbers on it. In this paper http://www.cca08.org/papers/Poster10-Simone-Leo.pdf They mention XEN overhead seems to be 5%. I would think that VMWare virtualization would perform worse but try it yourself and let me know!
