Hey James,

I think this would be a fun project, but be prepared to have the desktop portion not work out in the end. I would recommend focusing on prototyping your application in MapReduce, and consider the fact you might be able to reuse your desktops as sugar-coating (remember there may be other concerns, such as how reliable are your desktops versus how much it costs you if the processing job fails).

Basically, if MapReduce only makes sense if you can run it on desktops, it might be successful. I would consider whether doing the data processing with MapReduce is considered a success in itself.

FWIW, our students run on nodes that are considered "crap" by today's desktop standards (dual-processor Athlon MP, 1GB RAM, POS hard drive). Whether Hadoop is I/O intensive, memory intensive, or CPU intensive is 100% a function of your application. It's somewhat like asking how intensive a programming language is without specifying how you're going to use your programming language.

Brian

On Sep 29, 2009, at 11:05 AM, Edward Capriolo wrote:

On Tue, Sep 29, 2009 at 11:15 AM, Steve Loughran <[email protected]> wrote:
Edward Capriolo wrote:

In hadoop terms commodity means "Not super computer". If you look
around most large deployments have DataNodes with duel quad core
processors 8+GB ram and numerous disks, that is hardly the PC you find
under your desk.

I have 4 cores and 6GB RAM, but only one HDD on the desk. That ram gets sucked up by the memory hogs: IDE, firefox and VMWare. The fact that VMWare uses less RAM to host an entire OS image than firefox shows that you can get
away with virtualised work, or that firefox is overweight.

18862        20   0 2139m 1.4g  48m S    0 25.0  46:57.81 java
12822        20   0 1050m 476m  18m S    3  8.1  53:34.39 firefox
13037        20   0  949m 386m 365m S   11  6.5 206:34.70 vmware-vmx
20932        20   0  688m 374m  11m S    0  6.4   2:05.18 java

For example, we had a dev clusters is a very modest setup, 5 HP DL
145. Duel Core 4 GB RAM, 2 SATA DISKS.

I did not do any elaborate testing, but i found that:

!!!!One!!!! DL180G5 (2x Quad Code) (8GB ram) 8 SATA disk crushed the 5
node cluster. So it is possible but it might be more administrative
work then it is worth.

Interesting. Is this CPU or IO intensive work?


Interesting. Is this CPU or IO intensive work?

I think this was a hive query.

By nature hadoop is very CPU and IO intensive. Mapreduce is always
spilling files and writing to the disk and there is a decent amount of
traffic in shuffle sorting.

The way I view it if your DataNodes/TaskTrackers nodes don't have
enough 'pop' your ratios are bad.

By ratios I mean:
Administrator to PC,
Power Utilization,
MAP/Reduce overhead VS actual data processing

There is many ways to look at it, but 1 if you are just running Task
Trackers and not a DataNode on your workstations you have 0 data
locality. That is a bad thing because after all hadoop wants to move
the processing close to the data.  If the disks backing the DataNode
and TaskTracker are not fast and multi-threaded that node is not going
to perform well.

Now it does sound like your workstations have more processing power
then my test cluster so you might have better results.

Personally, I would probably try Hadoop on windows or colinux instead
of VMWARE. VMWare has to emulate disk drives, kernels, interrupts.
IMHO that overhead was too much, I do not know if anyone has any
numbers on it.

In this paper
http://www.cca08.org/papers/Poster10-Simone-Leo.pdf
They mention XEN overhead seems to be 5%. I would think that VMWare
virtualization would perform worse but try it yourself and let me
know!

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to