We finally upgraded our Hadoop install from 9.2 to 12.0. It went pretty smoothly. Kudos to all. However, one change in behavior of the JobClient seems like a problem.
In our install we have several different clusters running at different locations with different sizes and characteristics. We have been submitting jobs to these clusters from a separate, central point using JobClient. In the past all we had to do was point JobClient at the right JobTracker and submit the job. The -jt flag to JobClient makes this simple. However, now the JobClient computes the task splits at the central point rather than at the JobTracker. That step involves looking up the default number of mapred tasks in the cluster configuration (ie. mapred.map.tasks). But, unfortunately, the cluster configuration isn't available where we are running the JobClient, it is available at the cluster. In the past this didn't matter because all the JobClient really needed from the configuration was communication information. For things to work right, we need to maintain a separate configuration for every cluster at the central point and at every other place where we might want to use JobClient. It was much simpler when we could use a single central config to submit jobs to all clusters. It might be good to keep cluster specific configuration parameters from being needed to submit a job using JobClient. In addition, doing the splits in the JobClient lets a locally set mapred.map.tasks value override the value set in hadoop-site.xml on the cluster, which seems like a bug. Jim Firby Powerset Inc.
