On Jul 17, 2006, at 1:36 AM, Thomas FRIOL wrote:
Hi all,
I am a new hadoop user and I am now writting my own map reduce
operations but it is hard for me to find out where comes from the
problem when the job fails.
So my question is : What is the best way to debug a map reduce job ?
Ok, I should probably put this onto a wiki page, but my short answer is:
1. Start by getting everything running (likely on a small input) in the
local runner. You do this by setting your
job tracker to "local" in your config. The local runner can run under
the debugger and is not distributed.
2. Run the small input on a 1 node cluster. This will smoke out all of
the issues that happen with distribution and the "real" task runner,
but you only have a single place to look at logs. Most useful are the
task and job tracker logs. Make sure you are logging at the INFO level
or you will miss clues like the output of your tasks.
3. Run on a big cluster. Recently, I added the keep.failed.task.files
config variable that tells the system to keep files for tasks that
fail. This leaves "dead" files around that you can debug with. On the
node with the failed task, go to the task tracker's local directory and
cd to <local>/taskTracker/<taskid> and run
% hadoop org.apache.hadoop.IsolationRunner job.xml
This will run the failed task in a single jvm, which can be in the
debugger, over precisely the same input.
I also have a patch that will let you specify a task to keep, even if
it doesn't fail. Other than that, logging is your friend.
I don't have issues with my log messages getting through, so you might
check your filters. Exceptions are mostly handled right, but we've
found and fixed spots where they weren't, so that is possible. Usually
it involves someone throwing an unchecked exception like RuntimeError
and the catch only catching checked exceptions.
-- Owen