Hey Vinod,

thanks for looking into this. I am not sure that this is related to
the actual executor implementation because what I am not seeing is the
"launchTasks" message from the slave. The only significant change over
using the native code is that the slave must use "real" http to launch
the task (but it actually does not matter because in a worst case
scenario it would ignore the 202 ACCEPTED line from the executor
implementation). As the slave does not wait for any acknowledgement
from the slave

The trace seems to imply that the slave, even though it acknowledged
the executor registration, fails to schedule the task. So I was
wondering if that might be a problem with the registerExecutor message
hitting just at a bad moment and somehow getting lost inside the
slave. My understanding is that, if the slave processes the
registerExecutor message, it will then launch the task on it. And that
is the part that never happens. But the slave saw the executor
registration, because it logs

I0722 02:09:36.131356  6767 slave.cpp:1664] Got registration for
executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
20140711-183251-3078776074-5050-5240-0083

in its INFO log. That is why I assume that this is not related to the
executor implementation but may an actual mesos bug.

The 0.19 vs 0.19.1 might be a red herring, I agree.

On Wed, Jul 23, 2014 at 12:43 PM, Vinod Kone <[email protected]> wrote:
> There were no bugs/fixes related to slave<->executor message passing in
> 0.19.1. So, I'm surprised you weren't able to repro this behavior in 0.19.1.
>
> Looks like you are using pure scheduler/executor clients to talk to mesos?
> If yes, there might be a bug in your executor? Although, if it's in your
> executor that should be reproed in 0.19.1 as well.
>
> I would recommend running the default command executor to see if you are
> having issues with message passing in 0.19.0. If no issues, I would dig
> more closely into your executor client.
>
>
> On Wed, Jul 23, 2014 at 12:17 PM, Henning Schmiedehausen <
> [email protected]> wrote:
>
>> Hi,
>>
>> so I have that strange behavior that once in a blue moon, the mesos
>> executor decides to not send a "launchTask" message to my executor.
>>
>> This is what I see (mesos-slave.INFO):
>>
>> I0722 02:09:34.524090  6769 slave.cpp:933] Got assigned task
>> 83ea269c-8988-49bd-9d23-034c33858352 for framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:34.524222  6769 slave.cpp:1043] Launching task
>> 83ea269c-8988-49bd-9d23-034c33858352 for framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:34.524915  6769 slave.cpp:1153] Queuing task
>> '83ea269c-8988-49bd-9d23-034c33858352' for executor
>> candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4 of framework
>> '20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:34.524968  6772 mesos_containerizer.cpp:537] Starting
>> container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' for executor
>> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
>> '20140711-183251-3078776074-5050-5240-0083'
>> I0722 02:09:34.527037  6772 launcher.cpp:117] Forked child with pid
>> '428' for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c'
>> I0722 02:09:34.527452  6772 mesos_containerizer.cpp:647] Fetching URIs
>> for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' using command
>> '/usr/local/libexec/mesos/mesos-fetcher'
>> I0722 02:09:35.237933  6767 slave.cpp:2355] Monitoring executor
>> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
>> '20140711-183251-3078776074-5050-5240-0083' in container
>> 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c'
>> I0722 02:09:36.131356  6767 slave.cpp:1664] Got registration for
>> executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:36.131597  6767 slave.cpp:1783] Flushing queued task
>> 83ea269c-8988-49bd-9d23-034c33858352 for executor
>> 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
>> 20140711-183251-3078776074-5050-5240-0083
>>
>>
>> For a working task, it looks like this:
>>
>> I0722 02:08:34.484169  6773 slave.cpp:933] Got assigned task
>> c448be94-cd7f-4623-8348-21cf95651ee8 for framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:34.484282  6773 slave.cpp:1043] Launching task
>> c448be94-cd7f-4623-8348-21cf95651ee8 for framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:34.486240  6774 mesos_containerizer.cpp:537] Starting
>> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' for executor
>> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
>> '20140711-183251-3078776074-5050-5240-0083'
>> I0722 02:08:34.487082  6774 launcher.cpp:117] Forked child with pid
>> '32644' for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5'
>> I0722 02:08:34.487540  6774 mesos_containerizer.cpp:647] Fetching URIs
>> for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' using command
>> '/usr/local/libexec/mesos/mesos-fetcher'
>> I0722 02:08:34.489589  6773 slave.cpp:1153] Queuing task
>> 'c448be94-cd7f-4623-8348-21cf95651ee8' for executor
>> candyland_e8741342-35af-407b-9986-fef48f432ae8 of framework
>> '20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:35.193210  6771 slave.cpp:2355] Monitoring executor
>> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
>> '20140711-183251-3078776074-5050-5240-0083' in container
>> '1db76576-e0fb-42b5-b9c7-3877e79e35d5'
>> I0722 02:08:35.970985  6768 slave.cpp:1664] Got registration for
>> executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:35.971294  6768 slave.cpp:1783] Flushing queued task
>> c448be94-cd7f-4623-8348-21cf95651ee8 for executor
>> 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:36.099797  6771 slave.cpp:2018] Handling status update
>> TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 from
>> [email protected]:37954
>> I0722 02:08:36.099890  6771 status_update_manager.cpp:320] Received
>> status update TASK_STARTING (UUID:
>> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:36.100342  6771 status_update_manager.cpp:373] Forwarding
>> status update TASK_STARTING (UUID:
>> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 to
>> [email protected]:5050
>> I0722 02:08:36.100461  6771 slave.cpp:2145] Sending acknowledgement
>> for status update TASK_STARTING (UUID:
>> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 to
>> [email protected]:37954
>> I0722 02:08:36.103785  6769 status_update_manager.cpp:398] Received
>> status update acknowledgement (UUID:
>> 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:36.158952  6769 slave.cpp:2018] Handling status update
>> TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 from
>> [email protected]:37954
>> I0722 02:08:36.159203  6769 status_update_manager.cpp:320] Received
>> status update TASK_RUNNING (UUID:
>> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:08:36.159416  6769 status_update_manager.cpp:373] Forwarding
>> status update TASK_RUNNING (UUID:
>> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 to
>> [email protected]:5050
>> I0722 02:08:36.159515  6769 slave.cpp:2145] Sending acknowledgement
>> for status update TASK_RUNNING (UUID:
>> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 to
>> [email protected]:37954
>> I0722 02:08:36.162083  6774 status_update_manager.cpp:398] Received
>> status update acknowledgement (UUID:
>> 888db1de-7a44-43f8-8025-f19f80e8d499) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:40.064846  6772 slave.cpp:2018] Handling status update
>> TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 from
>> [email protected]:37954
>> I0722 02:09:40.065062  6772 status_update_manager.cpp:320] Received
>> status update TASK_FINISHED (UUID:
>> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:40.065203  6772 status_update_manager.cpp:373] Forwarding
>> status update TASK_FINISHED (UUID:
>> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 to
>> [email protected]:5050
>> I0722 02:09:40.065286  6772 slave.cpp:2145] Sending acknowledgement
>> for status update TASK_FINISHED (UUID:
>> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083 to
>> [email protected]:37954
>> I0722 02:09:40.068207  6768 status_update_manager.cpp:398] Received
>> status update acknowledgement (UUID:
>> 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
>> c448be94-cd7f-4623-8348-21cf95651ee8 of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:40.241500  6768 mesos_containerizer.cpp:1108] Executor for
>> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' has exited
>> I0722 02:09:40.243710  6768 mesos_containerizer.cpp:992] Destroying
>> container '1db76576-e0fb-42b5-b9c7-3877e79e35d5'
>>
>>
>> On the executor side, for the failed task, I see
>>
>> 02:09:35.509 [main] INFO  c.g.c.mesos.CandylandExecutor - ***
>> Candyland Executor Main running
>> 02:09:35.620 [main] INFO  c.g.c.mesos.CandylandExecutor - *** Entering
>> Driver loop
>> 02:09:35.663 [main] INFO  org.xnio - XNIO version 3.2.0.Final
>> 02:09:35.754 [main] INFO  org.xnio.nio - XNIO NIO Implementation
>> Version 3.2.0.Final
>> 02:09:35.894 [executor-driver-0] DEBUG
>> c.g.mesos.util.HttpProtocolSender - Sending to URL
>> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
>> :
>> framework_id {
>>   value: "20140711-183251-3078776074-5050-5240-0083"
>> }
>> executor_id {
>>   value: "candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4"
>> }
>>
>> 02:09:36.146 [OkHttp
>> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
>> ]
>> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
>> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
>> :
>> 202
>>
>> but nothing else. For the succeeding task, I see
>>
>> 02:08:35.437 [main] INFO  c.g.c.mesos.CandylandExecutor - ***
>> Candyland Executor Main running
>> 02:08:35.571 [main] INFO  c.g.c.mesos.CandylandExecutor - *** Entering
>> Driver loop
>> 02:08:35.614 [main] INFO  org.xnio - XNIO version 3.2.0.Final
>> 02:08:35.673 [main] INFO  org.xnio.nio - XNIO NIO Implementation
>> Version 3.2.0.Final
>> 02:08:35.820 [executor-driver-0] DEBUG
>> c.g.mesos.util.HttpProtocolSender - Sending to URL
>> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
>> :
>> framework_id {
>>   value: "20140711-183251-3078776074-5050-5240-0083"
>> }
>> executor_id {
>>   value: "candyland_e8741342-35af-407b-9986-fef48f432ae8"
>> }
>>
>> 02:08:35.979 [OkHttp
>> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
>> ]
>> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
>> http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage
>> :
>> 202
>> 02:08:36.062 [executor-callback-0] INFO  c.g.c.mesos.CandylandExecutor
>> - *** Launch Task, driver is
>> com.groupon.mesos.JesosExecutorDriver@10ae9677
>> 02:08:36.075 [executor-callback-1] INFO  c.g.c.mesos.CandylandExecutor
>> - *** Registered, driver is
>> com.groupon.mesos.JesosExecutorDriver@10ae9677
>> 02:08:36.096 [executor-driver-0] DEBUG
>> c.g.mesos.util.HttpProtocolSender - Sending to URL
>> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage:
>> update {
>>   framework_id {
>>     value: "20140711-183251-3078776074-5050-5240-0083"
>>   }
>>   executor_id {
>>     value: "candyland_e8741342-35af-407b-9986-fef48f432ae8"
>>   }
>>   slave_id {
>>     value: "20140711-183251-3078776074-5050-5240-0"
>>   }
>>   status {
>>     task_id {
>>       value: "c448be94-cd7f-4623-8348-21cf95651ee8"
>>     }
>>     state: TASK_STARTING
>>     timestamp: 1.405994916E9
>>   }
>>   timestamp: 1.405994916E9
>>   uuid: "\001\226\302}qiN-\225\264,\325\a\nh\365"
>> }
>> pid: "[email protected]:37954"
>>
>> 02:08:36.100 [OkHttp
>> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage]
>> DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
>> http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage:
>> 202
>>
>>
>> For a successful task, my executor receives a launchTask message right
>> after the registerExecutor message. For the failing one, I never
>> receive launchTask from the slave. And the task hangs around in
>> STARTING forever.
>>
>> I can somewhat reliably reproduce this with mesos 0.19.0 running in
>> vagrant (it will show up running my test within an hour). Running
>> mesos 0.19.1 on bare metal, I could not reproduce it in many hours of
>> starting and stopping tasks, so it might be gone in 0.19.1.
>>
>> If that behaviour is to be expected (with messages being unreliable),
>> should I more aggressively try to reconcile these starting tasks?
>> Right now, I only reconcile tasks that at least returned a single
>> status update message from the master.
>>
>> The only data point that I have is that for a failed taks, I see
>> additional messages from other tasks interspersed in the log. For a
>> succeeding task, there is no other task doing updates. This is what I
>> saw interspersed with the log messages of the failed task.
>>
>> I0722 02:09:35.071388  6767 slave.cpp:1664] Got registration for
>> executor 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:35.071637  6767 slave.cpp:1783] Flushing queued task
>> dc5677b3-5b3b-4c3d-b750-19b5edea42eb for executor
>> 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:35.237853  6771 mesos_containerizer.cpp:1108] Executor for
>> container '97af22bb-cc60-4e9b-93ca-22fab235fd39' has exited
>> I0722 02:09:35.240733  6771 mesos_containerizer.cpp:992] Destroying
>> container '97af22bb-cc60-4e9b-93ca-22fab235fd39'
>> I0722 02:09:35.251838  6769 slave.cpp:2413] Executor
>> 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework
>> 20140711-183251-3078776074-5050-5240-0083 has exited with status 0
>> I0722 02:09:35.252343  6769 slave.cpp:2552] Cleaning up executor
>> 'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework
>> 20140711-183251-3078776074-5050-5240-0083
>> I0722 02:09:35.252504  6769 gc.cpp:56] Scheduling
>>
>> '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5/runs/97af22bb-cc60-4e9b-93ca-22fab235fd39'
>> for gc 6.99999707835556days in the future
>> I0722 02:09:35.252594  6769 gc.cpp:56] Scheduling
>>
>> '/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5'
>> for gc 6.9999970780563days in the future
>>
>> Sorry for being somewhat vague, I am chasing this bug for while now
>> and I would like to see whether this is me not understanding what is
>> going on or an actual bug.
>>
>> Thanks,
>>     Henning
>>
>> (I can make the candyland code available if that helps but it is
>> really ugly test hacking code and it shows how little I know about
>> Mesos. ;-) )
>>

Reply via email to