> How long are your tasks, and which version of Hadoop are you using? 

Hadoop-0.20.1 (eventually we're looking to upgrade to Brisk).
Tasks take 5-30 seconds.

> In older versions (0.20.*), the fair
> scheduler doesn't launch multiple tasks per heartbeat, so it performs poorly 
> when your tasks are small
> (less than 5-10 seconds). You may be able to improve it a bit by setting
> mapred.fairscheduler.assignmultiple to true in your mapred-site.xml. However, 
> even this will assign
> too few tasks per heartbeat. I recommend using either Hadoop 0.21, where this 
> issue is fixed by default, or
> Cloudera's Hadoop distribution, which is the only 0.20-based version that has 
> backported the relevant
> fair scheduler improvements from 0.21.

mapred.fairscheduler.assignmultiple gave no apparent benefit. Maps running is 
still too low (4 or 5 now).

[OT] Does Brisk-1.0-beta2 include this fix you mention?

> If you have short tasks though, you should beware that Hadoop as a whole will 
> be inefficient because it will
> spend most of its time launching JVMs and waiting on heartbeats to send back 
> status updates. 

I'm running with mapred.job.reuse.jvm.num.tasks=-1 so to re-use all JVMs.

> You should try
> to tune your task size (number of input records per task) so that each tasks 
> takes at least 30-60 seconds, or
> you won't be running at the maximum efficiency possible for your cluster.

The default here is 
 cassandra.input.split.size=65536

Raising it to 262144 (default x 4) indeed fixes the problem :-)
 (Now i just need to check this doesn't send memory through the roof...)


Thanks for your valuable and quick help Matei.


~mck


-- 
"Physics is to math what sex is to masturbation." Richard Feynman 

| http://semb.wever.org | http://sesat.no |
| http://tech.finn.no   | Java XSS Filter |

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to