There were no bugs/fixes related to slave<->executor message passing in 0.19.1. So, I'm surprised you weren't able to repro this behavior in 0.19.1.
Looks like you are using pure scheduler/executor clients to talk to mesos? If yes, there might be a bug in your executor? Although, if it's in your executor that should be reproed in 0.19.1 as well. I would recommend running the default command executor to see if you are having issues with message passing in 0.19.0. If no issues, I would dig more closely into your executor client. On Wed, Jul 23, 2014 at 12:17 PM, Henning Schmiedehausen < [email protected]> wrote: > Hi, > > so I have that strange behavior that once in a blue moon, the mesos > executor decides to not send a "launchTask" message to my executor. > > This is what I see (mesos-slave.INFO): > > I0722 02:09:34.524090 6769 slave.cpp:933] Got assigned task > 83ea269c-8988-49bd-9d23-034c33858352 for framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:34.524222 6769 slave.cpp:1043] Launching task > 83ea269c-8988-49bd-9d23-034c33858352 for framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:34.524915 6769 slave.cpp:1153] Queuing task > '83ea269c-8988-49bd-9d23-034c33858352' for executor > candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4 of framework > '20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:34.524968 6772 mesos_containerizer.cpp:537] Starting > container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' for executor > 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework > '20140711-183251-3078776074-5050-5240-0083' > I0722 02:09:34.527037 6772 launcher.cpp:117] Forked child with pid > '428' for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' > I0722 02:09:34.527452 6772 mesos_containerizer.cpp:647] Fetching URIs > for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' using command > '/usr/local/libexec/mesos/mesos-fetcher' > I0722 02:09:35.237933 6767 slave.cpp:2355] Monitoring executor > 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework > '20140711-183251-3078776074-5050-5240-0083' in container > 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' > I0722 02:09:36.131356 6767 slave.cpp:1664] Got registration for > executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:36.131597 6767 slave.cpp:1783] Flushing queued task > 83ea269c-8988-49bd-9d23-034c33858352 for executor > 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework > 20140711-183251-3078776074-5050-5240-0083 > > > For a working task, it looks like this: > > I0722 02:08:34.484169 6773 slave.cpp:933] Got assigned task > c448be94-cd7f-4623-8348-21cf95651ee8 for framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:34.484282 6773 slave.cpp:1043] Launching task > c448be94-cd7f-4623-8348-21cf95651ee8 for framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:34.486240 6774 mesos_containerizer.cpp:537] Starting > container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' for executor > 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework > '20140711-183251-3078776074-5050-5240-0083' > I0722 02:08:34.487082 6774 launcher.cpp:117] Forked child with pid > '32644' for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' > I0722 02:08:34.487540 6774 mesos_containerizer.cpp:647] Fetching URIs > for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' using command > '/usr/local/libexec/mesos/mesos-fetcher' > I0722 02:08:34.489589 6773 slave.cpp:1153] Queuing task > 'c448be94-cd7f-4623-8348-21cf95651ee8' for executor > candyland_e8741342-35af-407b-9986-fef48f432ae8 of framework > '20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:35.193210 6771 slave.cpp:2355] Monitoring executor > 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework > '20140711-183251-3078776074-5050-5240-0083' in container > '1db76576-e0fb-42b5-b9c7-3877e79e35d5' > I0722 02:08:35.970985 6768 slave.cpp:1664] Got registration for > executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:35.971294 6768 slave.cpp:1783] Flushing queued task > c448be94-cd7f-4623-8348-21cf95651ee8 for executor > 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:36.099797 6771 slave.cpp:2018] Handling status update > TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 from > [email protected]:37954 > I0722 02:08:36.099890 6771 status_update_manager.cpp:320] Received > status update TASK_STARTING (UUID: > 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:36.100342 6771 status_update_manager.cpp:373] Forwarding > status update TASK_STARTING (UUID: > 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 to > [email protected]:5050 > I0722 02:08:36.100461 6771 slave.cpp:2145] Sending acknowledgement > for status update TASK_STARTING (UUID: > 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 to > [email protected]:37954 > I0722 02:08:36.103785 6769 status_update_manager.cpp:398] Received > status update acknowledgement (UUID: > 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:36.158952 6769 slave.cpp:2018] Handling status update > TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 from > [email protected]:37954 > I0722 02:08:36.159203 6769 status_update_manager.cpp:320] Received > status update TASK_RUNNING (UUID: > 888db1de-7a44-43f8-8025-f19f80e8d499) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:08:36.159416 6769 status_update_manager.cpp:373] Forwarding > status update TASK_RUNNING (UUID: > 888db1de-7a44-43f8-8025-f19f80e8d499) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 to > [email protected]:5050 > I0722 02:08:36.159515 6769 slave.cpp:2145] Sending acknowledgement > for status update TASK_RUNNING (UUID: > 888db1de-7a44-43f8-8025-f19f80e8d499) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 to > [email protected]:37954 > I0722 02:08:36.162083 6774 status_update_manager.cpp:398] Received > status update acknowledgement (UUID: > 888db1de-7a44-43f8-8025-f19f80e8d499) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:40.064846 6772 slave.cpp:2018] Handling status update > TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 from > [email protected]:37954 > I0722 02:09:40.065062 6772 status_update_manager.cpp:320] Received > status update TASK_FINISHED (UUID: > 40880600-8e57-4d5c-ae28-89ccb347aab1) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:40.065203 6772 status_update_manager.cpp:373] Forwarding > status update TASK_FINISHED (UUID: > 40880600-8e57-4d5c-ae28-89ccb347aab1) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 to > [email protected]:5050 > I0722 02:09:40.065286 6772 slave.cpp:2145] Sending acknowledgement > for status update TASK_FINISHED (UUID: > 40880600-8e57-4d5c-ae28-89ccb347aab1) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 to > [email protected]:37954 > I0722 02:09:40.068207 6768 status_update_manager.cpp:398] Received > status update acknowledgement (UUID: > 40880600-8e57-4d5c-ae28-89ccb347aab1) for task > c448be94-cd7f-4623-8348-21cf95651ee8 of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:40.241500 6768 mesos_containerizer.cpp:1108] Executor for > container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' has exited > I0722 02:09:40.243710 6768 mesos_containerizer.cpp:992] Destroying > container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' > > > On the executor side, for the failed task, I see > > 02:09:35.509 [main] INFO c.g.c.mesos.CandylandExecutor - *** > Candyland Executor Main running > 02:09:35.620 [main] INFO c.g.c.mesos.CandylandExecutor - *** Entering > Driver loop > 02:09:35.663 [main] INFO org.xnio - XNIO version 3.2.0.Final > 02:09:35.754 [main] INFO org.xnio.nio - XNIO NIO Implementation > Version 3.2.0.Final > 02:09:35.894 [executor-driver-0] DEBUG > c.g.mesos.util.HttpProtocolSender - Sending to URL > http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage > : > framework_id { > value: "20140711-183251-3078776074-5050-5240-0083" > } > executor_id { > value: "candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4" > } > > 02:09:36.146 [OkHttp > http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage > ] > DEBUG c.g.mesos.util.HttpProtocolSender - Response POST > http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage > : > 202 > > but nothing else. For the succeeding task, I see > > 02:08:35.437 [main] INFO c.g.c.mesos.CandylandExecutor - *** > Candyland Executor Main running > 02:08:35.571 [main] INFO c.g.c.mesos.CandylandExecutor - *** Entering > Driver loop > 02:08:35.614 [main] INFO org.xnio - XNIO version 3.2.0.Final > 02:08:35.673 [main] INFO org.xnio.nio - XNIO NIO Implementation > Version 3.2.0.Final > 02:08:35.820 [executor-driver-0] DEBUG > c.g.mesos.util.HttpProtocolSender - Sending to URL > http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage > : > framework_id { > value: "20140711-183251-3078776074-5050-5240-0083" > } > executor_id { > value: "candyland_e8741342-35af-407b-9986-fef48f432ae8" > } > > 02:08:35.979 [OkHttp > http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage > ] > DEBUG c.g.mesos.util.HttpProtocolSender - Response POST > http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage > : > 202 > 02:08:36.062 [executor-callback-0] INFO c.g.c.mesos.CandylandExecutor > - *** Launch Task, driver is > com.groupon.mesos.JesosExecutorDriver@10ae9677 > 02:08:36.075 [executor-callback-1] INFO c.g.c.mesos.CandylandExecutor > - *** Registered, driver is > com.groupon.mesos.JesosExecutorDriver@10ae9677 > 02:08:36.096 [executor-driver-0] DEBUG > c.g.mesos.util.HttpProtocolSender - Sending to URL > http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage: > update { > framework_id { > value: "20140711-183251-3078776074-5050-5240-0083" > } > executor_id { > value: "candyland_e8741342-35af-407b-9986-fef48f432ae8" > } > slave_id { > value: "20140711-183251-3078776074-5050-5240-0" > } > status { > task_id { > value: "c448be94-cd7f-4623-8348-21cf95651ee8" > } > state: TASK_STARTING > timestamp: 1.405994916E9 > } > timestamp: 1.405994916E9 > uuid: "\001\226\302}qiN-\225\264,\325\a\nh\365" > } > pid: "[email protected]:37954" > > 02:08:36.100 [OkHttp > http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage] > DEBUG c.g.mesos.util.HttpProtocolSender - Response POST > http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage: > 202 > > > For a successful task, my executor receives a launchTask message right > after the registerExecutor message. For the failing one, I never > receive launchTask from the slave. And the task hangs around in > STARTING forever. > > I can somewhat reliably reproduce this with mesos 0.19.0 running in > vagrant (it will show up running my test within an hour). Running > mesos 0.19.1 on bare metal, I could not reproduce it in many hours of > starting and stopping tasks, so it might be gone in 0.19.1. > > If that behaviour is to be expected (with messages being unreliable), > should I more aggressively try to reconcile these starting tasks? > Right now, I only reconcile tasks that at least returned a single > status update message from the master. > > The only data point that I have is that for a failed taks, I see > additional messages from other tasks interspersed in the log. For a > succeeding task, there is no other task doing updates. This is what I > saw interspersed with the log messages of the failed task. > > I0722 02:09:35.071388 6767 slave.cpp:1664] Got registration for > executor 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:35.071637 6767 slave.cpp:1783] Flushing queued task > dc5677b3-5b3b-4c3d-b750-19b5edea42eb for executor > 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:35.237853 6771 mesos_containerizer.cpp:1108] Executor for > container '97af22bb-cc60-4e9b-93ca-22fab235fd39' has exited > I0722 02:09:35.240733 6771 mesos_containerizer.cpp:992] Destroying > container '97af22bb-cc60-4e9b-93ca-22fab235fd39' > I0722 02:09:35.251838 6769 slave.cpp:2413] Executor > 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework > 20140711-183251-3078776074-5050-5240-0083 has exited with status 0 > I0722 02:09:35.252343 6769 slave.cpp:2552] Cleaning up executor > 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework > 20140711-183251-3078776074-5050-5240-0083 > I0722 02:09:35.252504 6769 gc.cpp:56] Scheduling > > '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5/runs/97af22bb-cc60-4e9b-93ca-22fab235fd39' > for gc 6.99999707835556days in the future > I0722 02:09:35.252594 6769 gc.cpp:56] Scheduling > > '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' > for gc 6.9999970780563days in the future > > Sorry for being somewhat vague, I am chasing this bug for while now > and I would like to see whether this is me not understanding what is > going on or an actual bug. > > Thanks, > Henning > > (I can make the candyland code available if that helps but it is > really ugly test hacking code and it shows how little I know about > Mesos. ;-) ) >
