On Mar 8, 2007, at 4:29 PM, R. James Firby wrote:

However, now the JobClient computes the task splits at the central point rather than at the JobTracker. That step involves looking up the default number of mapred tasks in the cluster configuration (ie. mapred.map.tasks). But, unfortunately, the cluster configuration isn't available where we are running the JobClient, it is available at the cluster. In the past this didn't matter because all the JobClient really needed from the configuration
was communication information.

The computation of the splits was moved from the job tracker to the client, to offload the job tracker and more importantly to remove the need to load the user code in the job tracker.

I agree that since the cluster size and composition are defined by the cluster, it would make sense to pass back the capacity of the cluster via the JobSubmissionProtocol like the name of the default file system is. (I created HADOOP-1100.) I would pull out the default values out of hadoop-default.xml for mapred.{map,reduce}.tasks and have JobConf return a number based on the cluster capacity if the user hasn't given a specific value.

In addition, doing the splits in the JobClient lets a locally set
mapred.map.tasks value override the value set in hadoop-site.xml on the
cluster, which seems like a bug.

Once the input splits are generated, the number of splits defines the number of maps. In my opinion, it is far less confusing to the users to have conf.getNumMapTasks() return the real number of maps rather than the original hint.

-- Owen

Reply via email to