Jeremy Bensley (sent by Nabble.com) wrote:
I have been experimenting with MapReduce to perform some distributed tasks
aside from the normal fetch/index routine of Nutch, and overall have had much
success.
I'm glad to hear this!
Today I have been experimenting with running extended duration tasks, but have
run into issues with the tasks timing out. I attempted to both override the
mapred.tasks.timeout option in mapred-default.xml and in the actual code for my
Mapper class, but my timeout durations remained steady at the default 10
minutes.
I looked at TaskTracker and I see that it is assigning to static variables some of the configuration options, and then using the variables for comparison. I have seen that TaskTracker parses the configuration XML files each time a new task is assigned, assuming that this is so that the TaskTracker options can be updated without restarting the process.
Code Examples: (from TaskTracker.java)
private static final int MAX_CURRENT_TASKS =
NutchConf.get().getInt("mapred.tasktracker.tasks.maximum", 2);
static final long TASK_TIMEOUT =
NutchConf.get().getInt("mapred.task.timeout", 10* 60 * 1000);
It seems to me that these parameters should be fetched each time instead of
being stored static and loaded only once. I am just getting my feet wet with
the whole MapReduce thing, so if this is the intended operation then I
apologise.
For the task timeout, I agree, this would be a good idea. It would
require some changes to the TaskTracker, so that a separate timeout
could be kept for each running task.
I'm not so sure about the tasks per task tracker. The best value is
probably node-specific (typically something a bit larger than the number
of processors). Even if it were job-specific, a TaskTracker can, in
theory, be running tasks from different jobs at the same time. Unless
we want to prohibit that, a single limit on the number of tasks to run
concurrently is required. How would you vary this with job?
Also, is this the proper place to report (possible) bugs, or should I just go
directly to the bug reporting system, even if it's not a verified issue?
This is a fine place. Typically one should first check the bug
database, then, if nothing is found, either file a bug or send an
inquiry to the list. The best way to get a bug fixed is to submit a
patch that fixes it.
Cheers,
Doug