We have built a Hadoop prototype that is using Map-Reduce on HDFS & HBase to analyze web traffic data. It is deployed on a 4 node cluster of commodity 4-core machines running Linux. It works, but we are getting pitiful performance.
The main problem is the way the Hadoop task tracker allocates tasks to machines. One simply configures a max # of tasks per machine, and Hadoop blindly gives it tasks until reaching that limit. The problem is, we have to configure these limits artificially low to prevent the machine from exploding when it has to do other work. For example, consider a Map-Red task that loads data into HBase. As the HBase table grows, it splits new regions onto other servers. These are the same servers that are running Map-Red tasks. So whatever server happens to be hosting the region that all the Map-Red tasks are using, gets overloaded, the HBase process gets slow to respond, and the MapRed job fails. Another example is that not all tasks are created equal. Sometimes 10 tasks per machine is ideal, sometimes only 2. It is tedious and clumsy to configure each job individually for how many tasks it should have per server. The only way we've found around this is to configure the Map-Red tasks limit so low, it can still handle any extra load that comes its way (for example, becoming an HBase region server). But this means all but one of the machines in the cluster are mostly idle (just in case they become the HBase region server) which gives poor overall performance. I would expect a simple load based task distribution to run 3-10 times faster, since Hadoop is currently using only a small fraction of the available processing power in each machine. We could separate HBase into other servers instead of sharing the same machines with Map-Red tasks. But this at best only reduces the magnitude of the problem, as it is still a static allocation that wastes machines with idle CPU. Load based task distribution would automatically compensate for this and dynamically adjust as the job runs. Imagine if the Hadoop task tracker allocated tasks to machines based on each machine's load average. Instead of configuring maximum task count in mapred-site.xml, we'd like to configure a target load average (for example 2.0). This would give dynamic task balancing that would automatically compensate for different tasks, different hardware, etc. As the HBase region server moved from machine A to B, B's load average would increase and the Hadoop task tracker would stop giving it jobs. Meanwhile, A's load average would drop and Hadoop would give it more tasks. Each machine in the cluster would run steadily at a targeted load average which gives maximum throughput. I don't believe the "Fair Scheduler" would solve this problem, because it shares jobs across the cluster, not the tasks of each job. My understanding is that it still uses the same static approach to allocating the tasks. But within the FairScheduler source code I found a "LoadManager" interface. It has one implementation, CapBasedLoadManager, which allocates tasks based on max counts. We could implement our own LoadManager which uses the system load average [for example by checking java.lang.management.OperatingSystemMXBean.getSystemLoadAverage()]. But this is specific to the FairScheduler. Ideally, such a task balancer could be used with ANY job scheduler, not just the fair scheduler. That is, one shouldn't need to use the fair scheduler just to get load average based task distribution for each job. This should be done independently, regardless of how jobs are prioritized in the cluster. Does the community have a good solution to this problem? Does the LoadManager approach make sense? How could it be applied more generally (not just within FairScheduler)? Thanks Michael Clements Solutions Architect [email protected] 206 664-4374 office 360 317 5051 mobile
