Hi folks,

During our test about hadoop framework, I find that one of the slave does
not update task status correctly to master. As a result, the slave hangs
and can not launch new tasktracker for the incoming job.

The web UI of the slave is like this, and we can see that slave believes
task93 and task102 are running. But actually these 2 tasktracker are
shutdown.

Executors ID <http://hd1dz.prod.mediav.com:5050/>
Name<http://hd1dz.prod.mediav.com:5050/>
  Source <http://hd1dz.prod.mediav.com:5050/>  Active
Tasks<http://hd1dz.prod.mediav.com:5050/>
  Queued Tasks <http://hd1dz.prod.mediav.com:5050/>  CPUs (Used /
Allocated)<http://hd1dz.prod.mediav.com:5050/>
  Mem (Used / Allocated) <http://hd1dz.prod.mediav.com:5050/>  Sandbox
executor_Task_Tracker_93<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_93>
Hadoop
TaskTracker Task_Tracker_93 0 0 6.635 / 1 5 GB / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_93%2Fruns%2F19a8d258-fde4-43dc-80be-4280cec442bb>
executor_Task_Tracker_124<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_124>
Hadoop
TaskTracker Task_Tracker_124 0 1 / 1 / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_124%2Fruns%2F6c4f0869-facd-4711-b962-8603d4023647>
executor_Task_Tracker_115<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_115>
Hadoop
TaskTracker Task_Tracker_115 0 1 / 1 / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_115%2Fruns%2F624ce615-4a25-4d15-b658-ed0b2172919a>
executor_Task_Tracker_102<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/frameworks/201305221443-252063498-5050-2128-0000/executors/executor_Task_Tracker_102>
Hadoop
TaskTracker Task_Tracker_102 0 0 8.886 / 1 6 GB / 1 GB
browse<http://hd1dz.prod.mediav.com:5050/#/slaves/201305221443-252063498-5050-2128-6/browse?path=%2Fdata%2Fmesos-slave-work-dir%2Fslaves%2F201305221443-252063498-5050-2128-6%2Fframeworks%2F201305221443-252063498-5050-2128-0000%2Fexecutors%2Fexecutor_Task_Tracker_102%2Fruns%2F242f8569-c112-40d2-9709-e6f758fbfb05>

I can check the log for task93, I am sure the log shows that the
tasktracker is shutdown gracefully. it is as follow


13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000233_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000075_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000041_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000003_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000226_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000223_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000246_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000124_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000019_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000169_0 not
found in cache 13/05/23 09:50:30 INFO mapred.IndexCache: Map ID
attempt_201305221443_0242_m_000134_0 not found in cache 13/05/23 09:50:30
INFO mapred.IndexCache: Map ID attempt_201305221443_0242_m_000076_0 not
found in cache 13/05/23 09:50:30 INFO mapred.UserLogCleaner: Adding
job_201305221443_0242 for user-log deletion with
retainTimeStamp:1369360230977 13/05/23 09:50:30 INFO util.AsyncDiskService:
Shutting down all AsyncDiskService threads... 13/05/23 09:50:30 INFO
util.AsyncDiskService: All AsyncDiskService threads are terminated.
13/05/23 09:50:30 INFO util.MRAsyncDiskService: Deleting toBeDeleted
directory.


Is this related to this param "--executor_shutdown_grace_period", I can see
the default value is 5 seconds, if the executor shutdown after 5 seconds,
what will happen then?


Thanks.

Guodong

Reply via email to