Hi folks, During our test about hadoop framework, I find that one of the slave does not update task status correctly to master. As a result, the slave hangs and can not launch new tasktracker for the incoming job.
The web UI of the slave is like this, and we can see that slave believes task93 and task102 are running. But actually these 2 tasktracker are shutdown. Executors ID <http://hd1dz.prod.mediav.com:5050/> Name<http://hd1dz.prod.mediav.com:5050/> Source <http://hd1dz.prod.mediav.com:5050/> Active Tasks<http://hd1dz.prod.mediav.com:5050/> Queued Tasks <http://hd1dz.prod.mediav.com:5050/> CPUs (Used / Allocated)<http://hd1dz.prod.mediav.com:5050/> Mem (Used / Allocated) <http://hd1dz.prod.mediav.com:5050/> Sandbox executor_Task_Tracker_93<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_93> Hadoop TaskTracker Task_Tracker_93 0 0 6.635 / 1 5 GB / 1 GB browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_93%2Fruns%2F19a8d258-fde4-43dc-80be-4280cec442bb> executor_Task_Tracker_124<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_124> Hadoop TaskTracker Task_Tracker_124 0 1 / 1 / 1 GB browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_124%2Fruns%2F6c4f0869-facd-4711-b962-8603d4023647> executor_Task_Tracker_115<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_115> Hadoop TaskTracker Task_Tracker_115 0 1 / 1 / 1 GB browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_115%2Fruns%2F624ce615-4a25-4d15-b658-ed0b2172919a> executor_Task_Tracker_102<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_102> Hadoop TaskTracker Task_Tracker_102 0 0 8.886 / 1 6 GB / 1 GB browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_102%2Fruns%2F242f8569-c112-40d2-9709-e6f758fbfb05> I can check the log for task93, I am sure the log shows that the tasktracker is shutdown gracefully. it is as follow 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000233_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000075_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000041_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000003_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000226_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000223_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000246_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000124_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000019_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000169_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000134_0 not found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000076_0 not found in cache 13/05/23 09:50:30 INFO mapred.UserLogCleaner: Adding job_201305221443_0242 for user-log deletion with retainTimeStamp:1369360230977 13/05/23 09:50:30 INFO util.AsyncDiskService: Shutting down all AsyncDiskService threads... 13/05/23 09:50:30 INFO util.AsyncDiskService: All AsyncDiskService threads are terminated. 13/05/23 09:50:30 INFO util.MRAsyncDiskService: Deleting toBeDeleted directory. Is this related to this param "--executor_shutdown_grace_period", I can see the default value is 5 seconds, if the executor shutdown after 5 seconds, what will happen then? Thanks. Guodong
