On 28-Jul-08, at 6:33 PM, charles du wrote:
Hi: I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0). I used 10 reducers. 9 of them returns quickly ( in a few seconds), but one has been running for several hours, and still no sign of completion. Do you know how I can debug it or find out what is going on with this reducer?
You can log, and set the status message. If you're using streaming, I think you're limited to writing to stderr. The only way I've found to read the logs on a distributed run is by sshing to the actual task box and looking at the log directory. I've almost gotten frustrated enough to have my tasks send email, but not quite.
Debugging is easier on a single pseudodistributed box because all the logs and stderr is right there, so try that if you can.