The task subdirectories are being deleted, but the job directory and its work subdirectory are not. This is causing a problem since disk space is filling up over time, and restarting the cluster after a long time is very slow as the tasktrackers clear out the jobcache directories.
This doesn't happen for every task run by a tasktracker, but it is happening to a significant number. I think it has something to do with the KillJobAction not being called because if I grep the log for lines from the relevant job containing "Kill" I see this: 2008-07-01 10:15:04,046 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction: task_200806300936_0279_r_000000_0 2008-07-01 10:15:16,223 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction: task_200806300936_0279_r_000001_0 2008-07-01 10:15:31,556 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction: task_200806300936_0279_r_000003_0 2008-07-01 10:15:39,882 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction: task_200806300936_0279_r_000002_0 2008-07-01 10:15:41,863 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction: task_200806300936_0279_r_000004_0 2008-07-01 10:15:51,484 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction: task_200806300936_0279_r_000006_0 2008-07-01 10:15:51,939 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0214:localhost/127.0.0.1:41484 -> KillTaskAction: task_200806300936_0279_r_000007_0 2008-07-01 10:15:59,695 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0207:localhost/127.0.0.1:37241 -> KillTaskAction: task_200806300936_0279_r_000008_0 2008-07-01 10:16:45,620 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0202:localhost/127.0.0.1:47183 -> KillTaskAction: task_200806300936_0279_r_000005_0 2008-07-01 10:16:47,328 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0216:localhost/127.0.0.1:37282 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:47,334 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f020c:localhost/127.0.0.1:52033 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:47,453 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0210:localhost/127.0.0.1:35235 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:47,768 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f020d:localhost/127.0.0.1:41562 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:48,652 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0203:localhost/127.0.0.1:65277 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:49,005 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0205:localhost/127.0.0.1:48747 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:49,365 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0209:localhost/127.0.0.1:59538 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:49,563 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0214:localhost/127.0.0.1:41484 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:49,747 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f020a:localhost/127.0.0.1:40410 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:50,321 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0212:localhost/127.0.0.1:33514 -> KillJobAction: job_200806300936_0279 2008-07-01 10:16:50,352 DEBUG org.apache.hadoop.mapred.JobTracker: tracker_m0f0207:localhost/127.0.0.1:37241 -> KillJobAction: job_200806300936_0279 Notice that tracker_m0f0202 receives a KillTaskAction (which removes the task working directory), but not a KillJobAction (which would remove the job directory). All the other trackers received KillJobAction. I'm not sure what's happening here to cause this. This is on 0.16.4. Anyone else seen this? Tom
