How long are your tasks, and which version of Hadoop are you using? In older 
versions (0.20.*), the fair scheduler doesn't launch multiple tasks per 
heartbeat, so it performs poorly when your tasks are small (less than 5-10 
seconds). You may be able to improve it a bit by setting 
mapred.fairscheduler.assignmultiple to true in your mapred-site.xml. However, 
even this will assign too few tasks per heartbeat. I recommend using either 
Hadoop 0.21, where this issue is fixed by default, or Cloudera's Hadoop 
distribution, which is the only 0.20-based version that has backported the 
relevant fair scheduler improvements from 0.21.

If you have short tasks though, you should beware that Hadoop as a whole will 
be inefficient because it will spend most of its time launching JVMs and 
waiting on heartbeats to send back status updates. You should try to tune your 
task size (number of input records per task) so that each tasks takes at least 
30-60 seconds, or you won't be running at the maximum efficiency possible for 
your cluster.

Matei

On Aug 18, 2011, at 10:14 AM, Mick Semb Wever wrote:

> I'm new to setting up hadoop's scheduler and i'm trying to set up
> Fairscheduler on a 3-node cluster. The initial setup is fine but
> throughput is abysmal.
> 
> Each node is configured with 16 map task capacity and 8 reduce task
> capacity. Most jobs being run are reading data from cassandra installed
> on the same nodes using ColumnFamilyInputFormat.
> 
> With the default scheduler these jobs take from 5 to 15 minutes.
> 
> When i plug in the fairscheduler they take from one to many hours.
> 
> What i see is that the map task capacity is not being used. Jobs now
> only run 3 map tasks at a time whereas before they would always run all
> 48 map tasks.
> 
> This is without any custom fair-scheduler.xml configuration. But i've also
> tried configuring userMaxJobsDefault, maxRunningJobs, and weight
> without any luck.
> 
> I've also tried adding mapred.fairscheduler.locality.delay=0 without any
> luck.
> 
> Is it possible with fairscheduler to get the same throughput when only
> one job is running as it is with hadoop's default scheduler? Am i
> missing something obvious?
> 
> ~mck
> 
> 
> -- 
> Linux, because I'd rather own a free OS than steal one that's not worth
> paying for. 
> 
> | http://semb.wever.org | http://sesat.no |
> | http://tech.finn.no   | Java XSS Filter |
> 

Reply via email to