RE: Adding new filesystem to Hadoop causing too many Map tasks

Devaraj Das Fri, 01 Jun 2007 05:06:25 -0700

Moving this to hadoop-user.
Just to clarify, did you set test.randomwrite.maps_per_host to 5 in the run
with ceph?

-----Original Message-----
From: Esteban Molina-Estolano [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 01, 2007 1:45 PM
To: [EMAIL PROTECTED]
Subject: Adding new filesystem to Hadoop causing too many Map tasks

I'm adding support in Hadoop for Ceph (http://ceph.sourceforge.net/), a
distributed filesystem developed at UC Santa Cruz (http://
ssrc.cse.ucsc.edu/). Ceph runs entirely in userspace and is written in C++.
My current implementation is a subclass of FileSystem that uses a bit of JNI
glue to invoke the C++ Ceph client code.

I'm having trouble with a small test: RandomWriter, 4 TaskTracker nodes, 5
maps per node, 10 MB per map, for a total of 200 MB over 20 Map tasks. I
tried it on Hadoop with DFS, and it took about 30 seconds. Then, I ran the
same test using Ceph. I changed fs.default.name to "ceph:///"; added
fs.ceph.impl as org.apache.hadoop.fs.ceph.CephFileSystem; and left all other
configuration settings untouched. It ran horrifically slowly.

I ran the JobTracker and each TaskTracker in a separate terminal to watch
the output. One of the TaskTracker nodes gave me this:
07/06/01 00:16:49 INFO mapred.TaskRunner: task_0001_r_000000_0 Need 400 map
output(s)
07/06/01 00:16:49 INFO mapred.TaskRunner: task_0001_r_000000_0 Need 400 map
output location(s)

Then the JobTracker spawned 400 Map tasks:
07/06/01 00:23:11 INFO mapred.JobTracker: Adding task 'task_0001_m_000397_0'
to tip tip_0001_m_000397, for tracker 'tracker_issdm-11.cse.ucsc.edu:50050'
07/06/01 00:23:12 INFO mapred.JobInProgress: Task 'task_0001_m_000396_0' has
completed tip_0001_m_000396 successfully.
07/06/01 00:23:12 INFO mapred.TaskInProgress: Task 'task_0001_m_000396_0'
has completed.
07/06/01 00:23:12 INFO mapred.JobInProgress: Choosing normal task
tip_0001_m_000398
07/06/01 00:23:12 INFO mapred.JobTracker: Adding task 'task_0001_m_000398_0'
to tip tip_0001_m_000398, for tracker 'tracker_issdm-8.cse.ucsc.edu:50050'
07/06/01 00:23:13 INFO mapred.JobInProgress: Task 'task_0001_m_000397_0' has
completed tip_0001_m_000397 successfully.
07/06/01 00:23:13 INFO mapred.TaskInProgress: Task 'task_0001_m_000397_0'
has completed.
07/06/01 00:23:13 INFO mapred.JobInProgress: Choosing normal task
tip_0001_m_000399
07/06/01 00:23:13 INFO mapred.JobTracker: Adding task 'task_0001_m_000399_0'
to tip tip_0001_m_000399, for tracker 'tracker_issdm-11.cse.ucsc.edu:50050'

I'm ending up with way too many Map tasks, and as a result the job takes way
too long to run.

I strongly suspect this is a problem with my implementation, but I'm not
sure where to start looking. What sort of problem on the FileSystem side
could cause MapReduce to spawn so many extra tasks?  
How can I pin down the cause?

Thanks,
     ~ Esteban

RE: Adding new filesystem to Hadoop causing too many Map tasks

Reply via email to