On Jul 17, 2006, at 1:36 AM, Thomas FRIOL wrote:

Hi all,

I am a new hadoop user and I am now writting my own map reduce operations but it is hard for me to find out where comes from the problem when the job fails.
So my question is : What is the best way to debug a map reduce job ?

Ok, I should probably put this onto a wiki page, but my short answer is:

1. Start by getting everything running (likely on a small input) in the local runner. You do this by setting your job tracker to "local" in your config. The local runner can run under the debugger and is not distributed.

2. Run the small input on a 1 node cluster. This will smoke out all of the issues that happen with distribution and the "real" task runner, but you only have a single place to look at logs. Most useful are the task and job tracker logs. Make sure you are logging at the INFO level or you will miss clues like the output of your tasks.

3. Run on a big cluster. Recently, I added the keep.failed.task.files config variable that tells the system to keep files for tasks that fail. This leaves "dead" files around that you can debug with. On the node with the failed task, go to the task tracker's local directory and cd to <local>/taskTracker/<taskid> and run
% hadoop org.apache.hadoop.IsolationRunner job.xml
This will run the failed task in a single jvm, which can be in the debugger, over precisely the same input.

I also have a patch that will let you specify a task to keep, even if it doesn't fail. Other than that, logging is your friend.

I don't have issues with my log messages getting through, so you might check your filters. Exceptions are mostly handled right, but we've found and fixed spots where they weren't, so that is possible. Usually it involves someone throwing an unchecked exception like RuntimeError and the catch only catching checked exceptions.

-- Owen

Reply via email to