Hi, so I have that strange behavior that once in a blue moon, the mesos executor decides to not send a "launchTask" message to my executor.
This is what I see (mesos-slave.INFO): I0722 02:09:34.524090 6769 slave.cpp:933] Got assigned task 83ea269c-8988-49bd-9d23-034c33858352 for framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:34.524222 6769 slave.cpp:1043] Launching task 83ea269c-8988-49bd-9d23-034c33858352 for framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:34.524915 6769 slave.cpp:1153] Queuing task '83ea269c-8988-49bd-9d23-034c33858352' for executor candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4 of framework '20140711-183251-3078776074-5050-5240-0083 I0722 02:09:34.524968 6772 mesos_containerizer.cpp:537] Starting container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' for executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework '20140711-183251-3078776074-5050-5240-0083' I0722 02:09:34.527037 6772 launcher.cpp:117] Forked child with pid '428' for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' I0722 02:09:34.527452 6772 mesos_containerizer.cpp:647] Fetching URIs for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' using command '/usr/local/libexec/mesos/mesos-fetcher' I0722 02:09:35.237933 6767 slave.cpp:2355] Monitoring executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework '20140711-183251-3078776074-5050-5240-0083' in container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' I0722 02:09:36.131356 6767 slave.cpp:1664] Got registration for executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:36.131597 6767 slave.cpp:1783] Flushing queued task 83ea269c-8988-49bd-9d23-034c33858352 for executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework 20140711-183251-3078776074-5050-5240-0083 For a working task, it looks like this: I0722 02:08:34.484169 6773 slave.cpp:933] Got assigned task c448be94-cd7f-4623-8348-21cf95651ee8 for framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:08:34.484282 6773 slave.cpp:1043] Launching task c448be94-cd7f-4623-8348-21cf95651ee8 for framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:08:34.486240 6774 mesos_containerizer.cpp:537] Starting container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' for executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework '20140711-183251-3078776074-5050-5240-0083' I0722 02:08:34.487082 6774 launcher.cpp:117] Forked child with pid '32644' for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' I0722 02:08:34.487540 6774 mesos_containerizer.cpp:647] Fetching URIs for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' using command '/usr/local/libexec/mesos/mesos-fetcher' I0722 02:08:34.489589 6773 slave.cpp:1153] Queuing task 'c448be94-cd7f-4623-8348-21cf95651ee8' for executor candyland_e8741342-35af-407b-9986-fef48f432ae8 of framework '20140711-183251-3078776074-5050-5240-0083 I0722 02:08:35.193210 6771 slave.cpp:2355] Monitoring executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework '20140711-183251-3078776074-5050-5240-0083' in container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' I0722 02:08:35.970985 6768 slave.cpp:1664] Got registration for executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:08:35.971294 6768 slave.cpp:1783] Flushing queued task c448be94-cd7f-4623-8348-21cf95651ee8 for executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:08:36.099797 6771 slave.cpp:2018] Handling status update TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 from [email protected]:37954 I0722 02:08:36.099890 6771 status_update_manager.cpp:320] Received status update TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:08:36.100342 6771 status_update_manager.cpp:373] Forwarding status update TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 to [email protected]:5050 I0722 02:08:36.100461 6771 slave.cpp:2145] Sending acknowledgement for status update TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 to [email protected]:37954 I0722 02:08:36.103785 6769 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:08:36.158952 6769 slave.cpp:2018] Handling status update TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 from [email protected]:37954 I0722 02:08:36.159203 6769 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:08:36.159416 6769 status_update_manager.cpp:373] Forwarding status update TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 to [email protected]:5050 I0722 02:08:36.159515 6769 slave.cpp:2145] Sending acknowledgement for status update TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 to [email protected]:37954 I0722 02:08:36.162083 6774 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:40.064846 6772 slave.cpp:2018] Handling status update TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 from [email protected]:37954 I0722 02:09:40.065062 6772 status_update_manager.cpp:320] Received status update TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:40.065203 6772 status_update_manager.cpp:373] Forwarding status update TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 to [email protected]:5050 I0722 02:09:40.065286 6772 slave.cpp:2145] Sending acknowledgement for status update TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 to [email protected]:37954 I0722 02:09:40.068207 6768 status_update_manager.cpp:398] Received status update acknowledgement (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task c448be94-cd7f-4623-8348-21cf95651ee8 of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:40.241500 6768 mesos_containerizer.cpp:1108] Executor for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' has exited I0722 02:09:40.243710 6768 mesos_containerizer.cpp:992] Destroying container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' On the executor side, for the failed task, I see 02:09:35.509 [main] INFO c.g.c.mesos.CandylandExecutor - *** Candyland Executor Main running 02:09:35.620 [main] INFO c.g.c.mesos.CandylandExecutor - *** Entering Driver loop 02:09:35.663 [main] INFO org.xnio - XNIO version 3.2.0.Final 02:09:35.754 [main] INFO org.xnio.nio - XNIO NIO Implementation Version 3.2.0.Final 02:09:35.894 [executor-driver-0] DEBUG c.g.mesos.util.HttpProtocolSender - Sending to URL http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage: framework_id { value: "20140711-183251-3078776074-5050-5240-0083" } executor_id { value: "candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4" } 02:09:36.146 [OkHttp http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage] DEBUG c.g.mesos.util.HttpProtocolSender - Response POST http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage: 202 but nothing else. For the succeeding task, I see 02:08:35.437 [main] INFO c.g.c.mesos.CandylandExecutor - *** Candyland Executor Main running 02:08:35.571 [main] INFO c.g.c.mesos.CandylandExecutor - *** Entering Driver loop 02:08:35.614 [main] INFO org.xnio - XNIO version 3.2.0.Final 02:08:35.673 [main] INFO org.xnio.nio - XNIO NIO Implementation Version 3.2.0.Final 02:08:35.820 [executor-driver-0] DEBUG c.g.mesos.util.HttpProtocolSender - Sending to URL http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage: framework_id { value: "20140711-183251-3078776074-5050-5240-0083" } executor_id { value: "candyland_e8741342-35af-407b-9986-fef48f432ae8" } 02:08:35.979 [OkHttp http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage] DEBUG c.g.mesos.util.HttpProtocolSender - Response POST http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage: 202 02:08:36.062 [executor-callback-0] INFO c.g.c.mesos.CandylandExecutor - *** Launch Task, driver is com.groupon.mesos.JesosExecutorDriver@10ae9677 02:08:36.075 [executor-callback-1] INFO c.g.c.mesos.CandylandExecutor - *** Registered, driver is com.groupon.mesos.JesosExecutorDriver@10ae9677 02:08:36.096 [executor-driver-0] DEBUG c.g.mesos.util.HttpProtocolSender - Sending to URL http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage: update { framework_id { value: "20140711-183251-3078776074-5050-5240-0083" } executor_id { value: "candyland_e8741342-35af-407b-9986-fef48f432ae8" } slave_id { value: "20140711-183251-3078776074-5050-5240-0" } status { task_id { value: "c448be94-cd7f-4623-8348-21cf95651ee8" } state: TASK_STARTING timestamp: 1.405994916E9 } timestamp: 1.405994916E9 uuid: "\001\226\302}qiN-\225\264,\325\a\nh\365" } pid: "[email protected]:37954" 02:08:36.100 [OkHttp http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage] DEBUG c.g.mesos.util.HttpProtocolSender - Response POST http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage: 202 For a successful task, my executor receives a launchTask message right after the registerExecutor message. For the failing one, I never receive launchTask from the slave. And the task hangs around in STARTING forever. I can somewhat reliably reproduce this with mesos 0.19.0 running in vagrant (it will show up running my test within an hour). Running mesos 0.19.1 on bare metal, I could not reproduce it in many hours of starting and stopping tasks, so it might be gone in 0.19.1. If that behaviour is to be expected (with messages being unreliable), should I more aggressively try to reconcile these starting tasks? Right now, I only reconcile tasks that at least returned a single status update message from the master. The only data point that I have is that for a failed taks, I see additional messages from other tasks interspersed in the log. For a succeeding task, there is no other task doing updates. This is what I saw interspersed with the log messages of the failed task. I0722 02:09:35.071388 6767 slave.cpp:1664] Got registration for executor 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:35.071637 6767 slave.cpp:1783] Flushing queued task dc5677b3-5b3b-4c3d-b750-19b5edea42eb for executor 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:35.237853 6771 mesos_containerizer.cpp:1108] Executor for container '97af22bb-cc60-4e9b-93ca-22fab235fd39' has exited I0722 02:09:35.240733 6771 mesos_containerizer.cpp:992] Destroying container '97af22bb-cc60-4e9b-93ca-22fab235fd39' I0722 02:09:35.251838 6769 slave.cpp:2413] Executor 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework 20140711-183251-3078776074-5050-5240-0083 has exited with status 0 I0722 02:09:35.252343 6769 slave.cpp:2552] Cleaning up executor 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework 20140711-183251-3078776074-5050-5240-0083 I0722 02:09:35.252504 6769 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5/runs/97af22bb-cc60-4e9b-93ca-22fab235fd39' for gc 6.99999707835556days in the future I0722 02:09:35.252594 6769 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' for gc 6.9999970780563days in the future Sorry for being somewhat vague, I am chasing this bug for while now and I would like to see whether this is me not understanding what is going on or an actual bug. Thanks, Henning (I can make the candyland code available if that helps but it is really ugly test hacking code and it shows how little I know about Mesos. ;-) )
