On Tue, Sep 29, 2009 at 11:15 AM, Steve Loughran <[email protected]> wrote:
> Edward Capriolo wrote:
>
>> In hadoop terms commodity means "Not super computer". If you look
>> around most large deployments have DataNodes with duel quad core
>> processors 8+GB ram and numerous disks, that is hardly the PC you find
>> under your desk.
>
> I have 4 cores and 6GB RAM, but only one HDD on the desk. That ram gets
> sucked up by the memory hogs: IDE, firefox and VMWare. The fact that VMWare
> uses less RAM to host an entire OS image than firefox shows that you can get
> away with virtualised work, or that firefox is overweight.
>
> 18862        20   0 2139m 1.4g  48m S    0 25.0  46:57.81 java
> 12822        20   0 1050m 476m  18m S    3  8.1  53:34.39 firefox
> 13037        20   0  949m 386m 365m S   11  6.5 206:34.70 vmware-vmx
> 20932        20   0  688m 374m  11m S    0  6.4   2:05.18 java
>
>> For example, we had a dev clusters is a very modest setup, 5 HP DL
>> 145. Duel Core 4 GB RAM, 2 SATA DISKS.
>>
>> I did not do any elaborate testing, but i found that:
>>
>> !!!!One!!!! DL180G5 (2x Quad Code) (8GB ram) 8 SATA disk crushed the 5
>> node cluster. So it is possible but it might be more administrative
>> work then it is worth.
>
> Interesting. Is this CPU or IO intensive work?
>

>>Interesting. Is this CPU or IO intensive work?

I think this was a hive query.

By nature hadoop is very CPU and IO intensive. Mapreduce is always
spilling files and writing to the disk and there is a decent amount of
traffic in shuffle sorting.

The way I view it if your DataNodes/TaskTrackers nodes don't have
enough 'pop' your ratios are bad.

By ratios I mean:
Administrator to PC,
Power Utilization,
MAP/Reduce overhead VS actual data processing

There is many ways to look at it, but 1 if you are just running Task
Trackers and not a DataNode on your workstations you have 0 data
locality. That is a bad thing because after all hadoop wants to move
the processing close to the data.  If the disks backing the DataNode
and TaskTracker are not fast and multi-threaded that node is not going
to perform well.

Now it does sound like your workstations have more processing power
then my test cluster so you might have better results.

Personally, I would probably try Hadoop on windows or colinux instead
of VMWARE. VMWare has to emulate disk drives, kernels, interrupts.
IMHO that overhead was too much, I do not know if anyone has any
numbers on it.

In this paper
http://www.cca08.org/papers/Poster10-Simone-Leo.pdf
They mention XEN overhead seems to be 5%. I would think that VMWare
virtualization would perform worse but try it yourself and let me
know!

Reply via email to