On Sun, Apr 18, 2010 at 10:29 AM, Colin Yates <[email protected]> wrote: > Many thanks for an excellent and thorough answer - cloudera certainly looks > very interesting as a developer resource. > > I have a few more questions if that is OK. > > So I have a directory filled with .CSV files representing each dump of data, > or alternatively I have a single SequenceFiles. How do I ensure a job is > executed on as many nodes (i.e. all) as possible? If I set a HDFS > replication factor of N, does that restrict N parallel jobs?
Replication is not what matters for parallelization. Hadoop will process many "chunks" of one file in parallel. If you use SequenceFiles, Hadoop will understand how to split this up and work on it in parallel. The same is true for text / csv files. You should check out Tom White's excellent book: Hadoop the Definitive Guide. > I want to get as many jobs running parallel as possible. > > The most common map/reduce job will be the equivalent of: > > "select the aggregation of some columns from this table where columnX=this > and columnY=that group by columnP" > > where columnX, columnY and columnP change from job to job :) > > I am absolutely trying to get the most out of divide and conquer - tens, or > even hundreds of little virtual machines (I take your point about > virtualisation) throwing their CPUs at this job. Again, virtualization will almost certainly hurt performance rather than help it. > My second question is the number of jobs per node. If I run this on a > (non-virtual) machine with 2 quad-core CPUs (each hyper-threaded), so 16 > cores, that (simplistically) means that there could be at least 16 map > and/or reduce jobs in parallel on each machine. How do I ensure this, or > will hadoop do this automatically? My only motivation for using VMs was to > achieve the best utilisation of CPU. (I get life isn't as simple as this > and the point about IO wait and CPU wait etc.) The maximum number of allowed tasks per node is a user configured parameter. Hadoop will trust you do this correctly. You configure map and reduce slots per node and Hadoop assigns tasks to run in those slots. The number of tasks one can run per node is based on the type of tasks and machine configuration. If you're jobs are CPU heavy, you'll want more cores per disk. IO heavy jobs will want the opposite. All tasks will consume RAM as they're running in JVMs. The baseline recommended by Cloudera can be found at http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ in detail. Again, avoid virtualization in data intensive processing systems like this. If you still don't believe me, ask yourself this - would you increase your Oracle 10g capacity by taking the same hardware and splitting it up into smaller machines? If anyone answers yes to this, email me off list. I have a bridge to sell you. ;) HTH and good luck! -- Eric Sammer phone: +1-917-287-2675 twitter: esammer data: www.cloudera.com
