Adding new filesystem to Hadoop causing too many Map tasks

Esteban Molina-Estolano Fri, 01 Jun 2007 01:15:17 -0700

I'm adding support in Hadoop for Ceph (http://ceph.sourceforge.net/),a distributed filesystem developed at UC Santa Cruz (http://ssrc.cse.ucsc.edu/). Ceph runs entirely in userspace and is writtenin C++. My current implementation is a subclass of FileSystem thatuses a bit of JNI glue to invoke the C++ Ceph client code.

I'm having trouble with a small test: RandomWriter, 4 TaskTrackernodes, 5 maps per node, 10 MB per map, for a total of 200 MB over 20Map tasks. I tried it on Hadoop with DFS, and it took about 30seconds. Then, I ran the same test using Ceph. I changedfs.default.name to "ceph:///"; added fs.ceph.impl asorg.apache.hadoop.fs.ceph.CephFileSystem; and left all otherconfiguration settings untouched. It ran horrifically slowly.

I ran the JobTracker and each TaskTracker in a separate terminal towatch the output. One of the TaskTracker nodes gave me this:07/06/01 00:16:49 INFO mapred.TaskRunner: task_0001_r_000000_0 Need400 map output(s)07/06/01 00:16:49 INFO mapred.TaskRunner: task_0001_r_000000_0 Need400 map output location(s)


Then the JobTracker spawned 400 Map tasks:

07/06/01 00:23:11 INFO mapred.JobTracker: Adding task'task_0001_m_000397_0' to tip tip_0001_m_000397, for tracker'tracker_issdm-11.cse.ucsc.edu:50050'07/06/01 00:23:12 INFO mapred.JobInProgress: Task'task_0001_m_000396_0' has completed tip_0001_m_000396 successfully.07/06/01 00:23:12 INFO mapred.TaskInProgress: Task'task_0001_m_000396_0' has completed.07/06/01 00:23:12 INFO mapred.JobInProgress: Choosing normal tasktip_0001_m_00039807/06/01 00:23:12 INFO mapred.JobTracker: Adding task'task_0001_m_000398_0' to tip tip_0001_m_000398, for tracker'tracker_issdm-8.cse.ucsc.edu:50050'07/06/01 00:23:13 INFO mapred.JobInProgress: Task'task_0001_m_000397_0' has completed tip_0001_m_000397 successfully.07/06/01 00:23:13 INFO mapred.TaskInProgress: Task'task_0001_m_000397_0' has completed.07/06/01 00:23:13 INFO mapred.JobInProgress: Choosing normal tasktip_0001_m_00039907/06/01 00:23:13 INFO mapred.JobTracker: Adding task'task_0001_m_000399_0' to tip tip_0001_m_000399, for tracker'tracker_issdm-11.cse.ucsc.edu:50050'

I'm ending up with way too many Map tasks, and as a result the jobtakes way too long to run.

I strongly suspect this is a problem with my implementation, but I'mnot sure where to start looking. What sort of problem on theFileSystem side could cause MapReduce to spawn so many extra tasks?How can I pin down the cause?


Thanks,
    ~ Esteban

Adding new filesystem to Hadoop causing too many Map tasks

Reply via email to