Hey Vinod, thanks for looking into this. I am not sure that this is related to the actual executor implementation because what I am not seeing is the "launchTasks" message from the slave. The only significant change over using the native code is that the slave must use "real" http to launch the task (but it actually does not matter because in a worst case scenario it would ignore the 202 ACCEPTED line from the executor implementation). As the slave does not wait for any acknowledgement from the slave
The trace seems to imply that the slave, even though it acknowledged the executor registration, fails to schedule the task. So I was wondering if that might be a problem with the registerExecutor message hitting just at a bad moment and somehow getting lost inside the slave. My understanding is that, if the slave processes the registerExecutor message, it will then launch the task on it. And that is the part that never happens. But the slave saw the executor registration, because it logs I0722 02:09:36.131356 6767 slave.cpp:1664] Got registration for executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework 20140711-183251-3078776074-5050-5240-0083 in its INFO log. That is why I assume that this is not related to the executor implementation but may an actual mesos bug. The 0.19 vs 0.19.1 might be a red herring, I agree. On Wed, Jul 23, 2014 at 12:43 PM, Vinod Kone <[email protected]> wrote: > There were no bugs/fixes related to slave<->executor message passing in > 0.19.1. So, I'm surprised you weren't able to repro this behavior in 0.19.1. > > Looks like you are using pure scheduler/executor clients to talk to mesos? > If yes, there might be a bug in your executor? Although, if it's in your > executor that should be reproed in 0.19.1 as well. > > I would recommend running the default command executor to see if you are > having issues with message passing in 0.19.0. If no issues, I would dig > more closely into your executor client. > > > On Wed, Jul 23, 2014 at 12:17 PM, Henning Schmiedehausen < > [email protected]> wrote: > >> Hi, >> >> so I have that strange behavior that once in a blue moon, the mesos >> executor decides to not send a "launchTask" message to my executor. >> >> This is what I see (mesos-slave.INFO): >> >> I0722 02:09:34.524090 6769 slave.cpp:933] Got assigned task >> 83ea269c-8988-49bd-9d23-034c33858352 for framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:34.524222 6769 slave.cpp:1043] Launching task >> 83ea269c-8988-49bd-9d23-034c33858352 for framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:34.524915 6769 slave.cpp:1153] Queuing task >> '83ea269c-8988-49bd-9d23-034c33858352' for executor >> candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4 of framework >> '20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:34.524968 6772 mesos_containerizer.cpp:537] Starting >> container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' for executor >> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework >> '20140711-183251-3078776074-5050-5240-0083' >> I0722 02:09:34.527037 6772 launcher.cpp:117] Forked child with pid >> '428' for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' >> I0722 02:09:34.527452 6772 mesos_containerizer.cpp:647] Fetching URIs >> for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' using command >> '/usr/local/libexec/mesos/mesos-fetcher' >> I0722 02:09:35.237933 6767 slave.cpp:2355] Monitoring executor >> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework >> '20140711-183251-3078776074-5050-5240-0083' in container >> 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' >> I0722 02:09:36.131356 6767 slave.cpp:1664] Got registration for >> executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:36.131597 6767 slave.cpp:1783] Flushing queued task >> 83ea269c-8988-49bd-9d23-034c33858352 for executor >> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework >> 20140711-183251-3078776074-5050-5240-0083 >> >> >> For a working task, it looks like this: >> >> I0722 02:08:34.484169 6773 slave.cpp:933] Got assigned task >> c448be94-cd7f-4623-8348-21cf95651ee8 for framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:34.484282 6773 slave.cpp:1043] Launching task >> c448be94-cd7f-4623-8348-21cf95651ee8 for framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:34.486240 6774 mesos_containerizer.cpp:537] Starting >> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' for executor >> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework >> '20140711-183251-3078776074-5050-5240-0083' >> I0722 02:08:34.487082 6774 launcher.cpp:117] Forked child with pid >> '32644' for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' >> I0722 02:08:34.487540 6774 mesos_containerizer.cpp:647] Fetching URIs >> for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' using command >> '/usr/local/libexec/mesos/mesos-fetcher' >> I0722 02:08:34.489589 6773 slave.cpp:1153] Queuing task >> 'c448be94-cd7f-4623-8348-21cf95651ee8' for executor >> candyland_e8741342-35af-407b-9986-fef48f432ae8 of framework >> '20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:35.193210 6771 slave.cpp:2355] Monitoring executor >> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework >> '20140711-183251-3078776074-5050-5240-0083' in container >> '1db76576-e0fb-42b5-b9c7-3877e79e35d5' >> I0722 02:08:35.970985 6768 slave.cpp:1664] Got registration for >> executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:35.971294 6768 slave.cpp:1783] Flushing queued task >> c448be94-cd7f-4623-8348-21cf95651ee8 for executor >> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:36.099797 6771 slave.cpp:2018] Handling status update >> TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 from >> [email protected]:37954 >> I0722 02:08:36.099890 6771 status_update_manager.cpp:320] Received >> status update TASK_STARTING (UUID: >> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:36.100342 6771 status_update_manager.cpp:373] Forwarding >> status update TASK_STARTING (UUID: >> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 to >> [email protected]:5050 >> I0722 02:08:36.100461 6771 slave.cpp:2145] Sending acknowledgement >> for status update TASK_STARTING (UUID: >> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 to >> [email protected]:37954 >> I0722 02:08:36.103785 6769 status_update_manager.cpp:398] Received >> status update acknowledgement (UUID: >> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:36.158952 6769 slave.cpp:2018] Handling status update >> TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 from >> [email protected]:37954 >> I0722 02:08:36.159203 6769 status_update_manager.cpp:320] Received >> status update TASK_RUNNING (UUID: >> 888db1de-7a44-43f8-8025-f19f80e8d499) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:08:36.159416 6769 status_update_manager.cpp:373] Forwarding >> status update TASK_RUNNING (UUID: >> 888db1de-7a44-43f8-8025-f19f80e8d499) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 to >> [email protected]:5050 >> I0722 02:08:36.159515 6769 slave.cpp:2145] Sending acknowledgement >> for status update TASK_RUNNING (UUID: >> 888db1de-7a44-43f8-8025-f19f80e8d499) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 to >> [email protected]:37954 >> I0722 02:08:36.162083 6774 status_update_manager.cpp:398] Received >> status update acknowledgement (UUID: >> 888db1de-7a44-43f8-8025-f19f80e8d499) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:40.064846 6772 slave.cpp:2018] Handling status update >> TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 from >> [email protected]:37954 >> I0722 02:09:40.065062 6772 status_update_manager.cpp:320] Received >> status update TASK_FINISHED (UUID: >> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:40.065203 6772 status_update_manager.cpp:373] Forwarding >> status update TASK_FINISHED (UUID: >> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 to >> [email protected]:5050 >> I0722 02:09:40.065286 6772 slave.cpp:2145] Sending acknowledgement >> for status update TASK_FINISHED (UUID: >> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 to >> [email protected]:37954 >> I0722 02:09:40.068207 6768 status_update_manager.cpp:398] Received >> status update acknowledgement (UUID: >> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task >> c448be94-cd7f-4623-8348-21cf95651ee8 of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:40.241500 6768 mesos_containerizer.cpp:1108] Executor for >> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' has exited >> I0722 02:09:40.243710 6768 mesos_containerizer.cpp:992] Destroying >> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' >> >> >> On the executor side, for the failed task, I see >> >> 02:09:35.509 [main] INFO c.g.c.mesos.CandylandExecutor - *** >> Candyland Executor Main running >> 02:09:35.620 [main] INFO c.g.c.mesos.CandylandExecutor - *** Entering >> Driver loop >> 02:09:35.663 [main] INFO org.xnio - XNIO version 3.2.0.Final >> 02:09:35.754 [main] INFO org.xnio.nio - XNIO NIO Implementation >> Version 3.2.0.Final >> 02:09:35.894 [executor-driver-0] DEBUG >> c.g.mesos.util.HttpProtocolSender - Sending to URL >> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage >> : >> framework_id { >> value: "20140711-183251-3078776074-5050-5240-0083" >> } >> executor_id { >> value: "candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4" >> } >> >> 02:09:36.146 [OkHttp >> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage >> ] >> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST >> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage >> : >> 202 >> >> but nothing else. For the succeeding task, I see >> >> 02:08:35.437 [main] INFO c.g.c.mesos.CandylandExecutor - *** >> Candyland Executor Main running >> 02:08:35.571 [main] INFO c.g.c.mesos.CandylandExecutor - *** Entering >> Driver loop >> 02:08:35.614 [main] INFO org.xnio - XNIO version 3.2.0.Final >> 02:08:35.673 [main] INFO org.xnio.nio - XNIO NIO Implementation >> Version 3.2.0.Final >> 02:08:35.820 [executor-driver-0] DEBUG >> c.g.mesos.util.HttpProtocolSender - Sending to URL >> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage >> : >> framework_id { >> value: "20140711-183251-3078776074-5050-5240-0083" >> } >> executor_id { >> value: "candyland_e8741342-35af-407b-9986-fef48f432ae8" >> } >> >> 02:08:35.979 [OkHttp >> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage >> ] >> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST >> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage >> : >> 202 >> 02:08:36.062 [executor-callback-0] INFO c.g.c.mesos.CandylandExecutor >> - *** Launch Task, driver is >> com.groupon.mesos.JesosExecutorDriver@10ae9677 >> 02:08:36.075 [executor-callback-1] INFO c.g.c.mesos.CandylandExecutor >> - *** Registered, driver is >> com.groupon.mesos.JesosExecutorDriver@10ae9677 >> 02:08:36.096 [executor-driver-0] DEBUG >> c.g.mesos.util.HttpProtocolSender - Sending to URL >> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage: >> update { >> framework_id { >> value: "20140711-183251-3078776074-5050-5240-0083" >> } >> executor_id { >> value: "candyland_e8741342-35af-407b-9986-fef48f432ae8" >> } >> slave_id { >> value: "20140711-183251-3078776074-5050-5240-0" >> } >> status { >> task_id { >> value: "c448be94-cd7f-4623-8348-21cf95651ee8" >> } >> state: TASK_STARTING >> timestamp: 1.405994916E9 >> } >> timestamp: 1.405994916E9 >> uuid: "\001\226\302}qiN-\225\264,\325\a\nh\365" >> } >> pid: "[email protected]:37954" >> >> 02:08:36.100 [OkHttp >> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage] >> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST >> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage: >> 202 >> >> >> For a successful task, my executor receives a launchTask message right >> after the registerExecutor message. For the failing one, I never >> receive launchTask from the slave. And the task hangs around in >> STARTING forever. >> >> I can somewhat reliably reproduce this with mesos 0.19.0 running in >> vagrant (it will show up running my test within an hour). Running >> mesos 0.19.1 on bare metal, I could not reproduce it in many hours of >> starting and stopping tasks, so it might be gone in 0.19.1. >> >> If that behaviour is to be expected (with messages being unreliable), >> should I more aggressively try to reconcile these starting tasks? >> Right now, I only reconcile tasks that at least returned a single >> status update message from the master. >> >> The only data point that I have is that for a failed taks, I see >> additional messages from other tasks interspersed in the log. For a >> succeeding task, there is no other task doing updates. This is what I >> saw interspersed with the log messages of the failed task. >> >> I0722 02:09:35.071388 6767 slave.cpp:1664] Got registration for >> executor 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:35.071637 6767 slave.cpp:1783] Flushing queued task >> dc5677b3-5b3b-4c3d-b750-19b5edea42eb for executor >> 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:35.237853 6771 mesos_containerizer.cpp:1108] Executor for >> container '97af22bb-cc60-4e9b-93ca-22fab235fd39' has exited >> I0722 02:09:35.240733 6771 mesos_containerizer.cpp:992] Destroying >> container '97af22bb-cc60-4e9b-93ca-22fab235fd39' >> I0722 02:09:35.251838 6769 slave.cpp:2413] Executor >> 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework >> 20140711-183251-3078776074-5050-5240-0083 has exited with status 0 >> I0722 02:09:35.252343 6769 slave.cpp:2552] Cleaning up executor >> 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework >> 20140711-183251-3078776074-5050-5240-0083 >> I0722 02:09:35.252504 6769 gc.cpp:56] Scheduling >> >> '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5/runs/97af22bb-cc60-4e9b-93ca-22fab235fd39' >> for gc 6.99999707835556days in the future >> I0722 02:09:35.252594 6769 gc.cpp:56] Scheduling >> >> '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' >> for gc 6.9999970780563days in the future >> >> Sorry for being somewhat vague, I am chasing this bug for while now >> and I would like to see whether this is me not understanding what is >> going on or an actual bug. >> >> Thanks, >> Henning >> >> (I can make the candyland code available if that helps but it is >> really ugly test hacking code and it shows how little I know about >> Mesos. ;-) ) >>
