load balanced task distribution

Clements, Michael Thu, 07 Jan 2010 15:10:17 -0800

We have built a Hadoop prototype that is using Map-Reduce on HDFS &
HBase to analyze web traffic data. It is deployed on a 4 node cluster of
commodity 4-core machines running Linux. It works, but we are getting
pitiful performance.


The main problem is the way the Hadoop task tracker allocates tasks to
machines. One simply configures a max # of tasks per machine, and Hadoop
blindly gives it tasks until reaching that limit. The problem is, we
have to configure these limits artificially low to prevent the machine
from exploding when it has to do other work.

For example, consider a Map-Red task that loads data into HBase. As the
HBase table grows, it splits new regions onto other servers. These are
the same servers that are running Map-Red tasks. So whatever server
happens to be hosting the region that all the Map-Red tasks are using,
gets overloaded, the HBase process gets slow to respond, and the MapRed
job fails.

Another example is that not all tasks are created equal. Sometimes 10
tasks per machine is ideal, sometimes only 2. It is tedious and clumsy
to configure each job individually for how many tasks it should have per
server.

The only way we've found around this is to configure the Map-Red tasks
limit so low, it can still handle any extra load that comes its way (for
example, becoming an HBase region server). But this means all but one of
the machines in the cluster are mostly idle (just in case they become
the HBase region server) which gives poor overall performance. I would
expect a simple load based task distribution to run 3-10 times faster,
since Hadoop is currently using only a small fraction of the available
processing power in each machine.

We could separate HBase into other servers instead of sharing the same
machines with Map-Red tasks. But this at best only reduces the magnitude
of the problem, as it is still a static allocation that wastes machines
with idle CPU.

Load based task distribution would automatically compensate for this and
dynamically adjust as the job runs. Imagine if the Hadoop task tracker
allocated tasks to machines based on each machine's load average.
Instead of configuring maximum task count in mapred-site.xml, we'd like
to configure a target load average (for example 2.0). This would give
dynamic task balancing that would automatically compensate for different
tasks, different hardware, etc.

As the HBase region server moved from machine A to B, B's load average
would increase and the Hadoop task tracker would stop giving it jobs.
Meanwhile, A's load average would drop and Hadoop would give it more
tasks. Each machine in the cluster would run steadily at a targeted load
average which gives maximum throughput.

I don't believe the "Fair Scheduler" would solve this problem, because
it shares jobs across the cluster, not the tasks of each job. My
understanding is that it still uses the same static approach to
allocating the tasks.

But within the FairScheduler source code I found a "LoadManager"
interface. It has one implementation, CapBasedLoadManager, which
allocates tasks based on max counts.

We could implement our own LoadManager which uses the system load
average [for example by checking
java.lang.management.OperatingSystemMXBean.getSystemLoadAverage()]. But
this is specific to the FairScheduler. Ideally, such a task balancer
could be used with ANY job scheduler, not just the fair scheduler.

That is, one shouldn't need to use the fair scheduler just to get load
average based task distribution for each job. This should be done
independently, regardless of how jobs are prioritized in the cluster.

Does the community have a good solution to this problem? Does the
LoadManager approach make sense? How could it be applied more generally
(not just within FairScheduler)?

Thanks
Michael Clements
Solutions Architect
[email protected]
206 664-4374 office
360 317 5051 mobile

load balanced task distribution

Reply via email to