Hi,

so I have that strange behavior that once in a blue moon, the mesos
executor decides to not send a "launchTask" message to my executor.

This is what I see (mesos-slave.INFO):

I0722 02:09:34.524090  6769 slave.cpp:933] Got assigned task
83ea269c-8988-49bd-9d23-034c33858352 for framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:34.524222  6769 slave.cpp:1043] Launching task
83ea269c-8988-49bd-9d23-034c33858352 for framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:34.524915  6769 slave.cpp:1153] Queuing task
'83ea269c-8988-49bd-9d23-034c33858352' for executor
candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4 of framework
'20140711-183251-3078776074-5050-5240-0083
I0722 02:09:34.524968  6772 mesos_containerizer.cpp:537] Starting
container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' for executor
'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
'20140711-183251-3078776074-5050-5240-0083'
I0722 02:09:34.527037  6772 launcher.cpp:117] Forked child with pid
'428' for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c'
I0722 02:09:34.527452  6772 mesos_containerizer.cpp:647] Fetching URIs
for container 'f4d1c44b-63e4-4409-be6e-6d5983e01d8c' using command
'/usr/local/libexec/mesos/mesos-fetcher'
I0722 02:09:35.237933  6767 slave.cpp:2355] Monitoring executor
'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
'20140711-183251-3078776074-5050-5240-0083' in container
'f4d1c44b-63e4-4409-be6e-6d5983e01d8c'
I0722 02:09:36.131356  6767 slave.cpp:1664] Got registration for
executor 'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:36.131597  6767 slave.cpp:1783] Flushing queued task
83ea269c-8988-49bd-9d23-034c33858352 for executor
'candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4' of framework
20140711-183251-3078776074-5050-5240-0083


For a working task, it looks like this:

I0722 02:08:34.484169  6773 slave.cpp:933] Got assigned task
c448be94-cd7f-4623-8348-21cf95651ee8 for framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:08:34.484282  6773 slave.cpp:1043] Launching task
c448be94-cd7f-4623-8348-21cf95651ee8 for framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:08:34.486240  6774 mesos_containerizer.cpp:537] Starting
container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' for executor
'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
'20140711-183251-3078776074-5050-5240-0083'
I0722 02:08:34.487082  6774 launcher.cpp:117] Forked child with pid
'32644' for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5'
I0722 02:08:34.487540  6774 mesos_containerizer.cpp:647] Fetching URIs
for container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' using command
'/usr/local/libexec/mesos/mesos-fetcher'
I0722 02:08:34.489589  6773 slave.cpp:1153] Queuing task
'c448be94-cd7f-4623-8348-21cf95651ee8' for executor
candyland_e8741342-35af-407b-9986-fef48f432ae8 of framework
'20140711-183251-3078776074-5050-5240-0083
I0722 02:08:35.193210  6771 slave.cpp:2355] Monitoring executor
'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
'20140711-183251-3078776074-5050-5240-0083' in container
'1db76576-e0fb-42b5-b9c7-3877e79e35d5'
I0722 02:08:35.970985  6768 slave.cpp:1664] Got registration for
executor 'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:08:35.971294  6768 slave.cpp:1783] Flushing queued task
c448be94-cd7f-4623-8348-21cf95651ee8 for executor
'candyland_e8741342-35af-407b-9986-fef48f432ae8' of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:08:36.099797  6771 slave.cpp:2018] Handling status update
TASK_STARTING (UUID: 0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 from
[email protected]:37954
I0722 02:08:36.099890  6771 status_update_manager.cpp:320] Received
status update TASK_STARTING (UUID:
0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:08:36.100342  6771 status_update_manager.cpp:373] Forwarding
status update TASK_STARTING (UUID:
0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 to
[email protected]:5050
I0722 02:08:36.100461  6771 slave.cpp:2145] Sending acknowledgement
for status update TASK_STARTING (UUID:
0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 to
[email protected]:37954
I0722 02:08:36.103785  6769 status_update_manager.cpp:398] Received
status update acknowledgement (UUID:
0196c27d-7169-4e2d-95b4-2cd5070a68f5) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:08:36.158952  6769 slave.cpp:2018] Handling status update
TASK_RUNNING (UUID: 888db1de-7a44-43f8-8025-f19f80e8d499) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 from
[email protected]:37954
I0722 02:08:36.159203  6769 status_update_manager.cpp:320] Received
status update TASK_RUNNING (UUID:
888db1de-7a44-43f8-8025-f19f80e8d499) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:08:36.159416  6769 status_update_manager.cpp:373] Forwarding
status update TASK_RUNNING (UUID:
888db1de-7a44-43f8-8025-f19f80e8d499) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 to
[email protected]:5050
I0722 02:08:36.159515  6769 slave.cpp:2145] Sending acknowledgement
for status update TASK_RUNNING (UUID:
888db1de-7a44-43f8-8025-f19f80e8d499) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 to
[email protected]:37954
I0722 02:08:36.162083  6774 status_update_manager.cpp:398] Received
status update acknowledgement (UUID:
888db1de-7a44-43f8-8025-f19f80e8d499) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:40.064846  6772 slave.cpp:2018] Handling status update
TASK_FINISHED (UUID: 40880600-8e57-4d5c-ae28-89ccb347aab1) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 from
[email protected]:37954
I0722 02:09:40.065062  6772 status_update_manager.cpp:320] Received
status update TASK_FINISHED (UUID:
40880600-8e57-4d5c-ae28-89ccb347aab1) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:40.065203  6772 status_update_manager.cpp:373] Forwarding
status update TASK_FINISHED (UUID:
40880600-8e57-4d5c-ae28-89ccb347aab1) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 to
[email protected]:5050
I0722 02:09:40.065286  6772 slave.cpp:2145] Sending acknowledgement
for status update TASK_FINISHED (UUID:
40880600-8e57-4d5c-ae28-89ccb347aab1) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083 to
[email protected]:37954
I0722 02:09:40.068207  6768 status_update_manager.cpp:398] Received
status update acknowledgement (UUID:
40880600-8e57-4d5c-ae28-89ccb347aab1) for task
c448be94-cd7f-4623-8348-21cf95651ee8 of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:40.241500  6768 mesos_containerizer.cpp:1108] Executor for
container '1db76576-e0fb-42b5-b9c7-3877e79e35d5' has exited
I0722 02:09:40.243710  6768 mesos_containerizer.cpp:992] Destroying
container '1db76576-e0fb-42b5-b9c7-3877e79e35d5'


On the executor side, for the failed task, I see

02:09:35.509 [main] INFO  c.g.c.mesos.CandylandExecutor - ***
Candyland Executor Main running
02:09:35.620 [main] INFO  c.g.c.mesos.CandylandExecutor - *** Entering
Driver loop
02:09:35.663 [main] INFO  org.xnio - XNIO version 3.2.0.Final
02:09:35.754 [main] INFO  org.xnio.nio - XNIO NIO Implementation
Version 3.2.0.Final
02:09:35.894 [executor-driver-0] DEBUG
c.g.mesos.util.HttpProtocolSender - Sending to URL
http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage:
framework_id {
  value: "20140711-183251-3078776074-5050-5240-0083"
}
executor_id {
  value: "candyland_e3030c10-d154-4a34-a72f-aba07e1a84d4"
}

02:09:36.146 [OkHttp
http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage]
DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage:
202

but nothing else. For the succeeding task, I see

02:08:35.437 [main] INFO  c.g.c.mesos.CandylandExecutor - ***
Candyland Executor Main running
02:08:35.571 [main] INFO  c.g.c.mesos.CandylandExecutor - *** Entering
Driver loop
02:08:35.614 [main] INFO  org.xnio - XNIO version 3.2.0.Final
02:08:35.673 [main] INFO  org.xnio.nio - XNIO NIO Implementation
Version 3.2.0.Final
02:08:35.820 [executor-driver-0] DEBUG
c.g.mesos.util.HttpProtocolSender - Sending to URL
http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage:
framework_id {
  value: "20140711-183251-3078776074-5050-5240-0083"
}
executor_id {
  value: "candyland_e8741342-35af-407b-9986-fef48f432ae8"
}

02:08:35.979 [OkHttp
http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage]
DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
http://10.101.130.183:5051/slave(1)/mesos.internal.RegisterExecutorMessage:
202
02:08:36.062 [executor-callback-0] INFO  c.g.c.mesos.CandylandExecutor
- *** Launch Task, driver is
com.groupon.mesos.JesosExecutorDriver@10ae9677
02:08:36.075 [executor-callback-1] INFO  c.g.c.mesos.CandylandExecutor
- *** Registered, driver is
com.groupon.mesos.JesosExecutorDriver@10ae9677
02:08:36.096 [executor-driver-0] DEBUG
c.g.mesos.util.HttpProtocolSender - Sending to URL
http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage:
update {
  framework_id {
    value: "20140711-183251-3078776074-5050-5240-0083"
  }
  executor_id {
    value: "candyland_e8741342-35af-407b-9986-fef48f432ae8"
  }
  slave_id {
    value: "20140711-183251-3078776074-5050-5240-0"
  }
  status {
    task_id {
      value: "c448be94-cd7f-4623-8348-21cf95651ee8"
    }
    state: TASK_STARTING
    timestamp: 1.405994916E9
  }
  timestamp: 1.405994916E9
  uuid: "\001\226\302}qiN-\225\264,\325\a\nh\365"
}
pid: "[email protected]:37954"

02:08:36.100 [OkHttp
http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage]
DEBUG c.g.mesos.util.HttpProtocolSender - Response POST
http://10.101.130.183:5051/slave(1)/mesos.internal.StatusUpdateMessage:
202


For a successful task, my executor receives a launchTask message right
after the registerExecutor message. For the failing one, I never
receive launchTask from the slave. And the task hangs around in
STARTING forever.

I can somewhat reliably reproduce this with mesos 0.19.0 running in
vagrant (it will show up running my test within an hour). Running
mesos 0.19.1 on bare metal, I could not reproduce it in many hours of
starting and stopping tasks, so it might be gone in 0.19.1.

If that behaviour is to be expected (with messages being unreliable),
should I more aggressively try to reconcile these starting tasks?
Right now, I only reconcile tasks that at least returned a single
status update message from the master.

The only data point that I have is that for a failed taks, I see
additional messages from other tasks interspersed in the log. For a
succeeding task, there is no other task doing updates. This is what I
saw interspersed with the log messages of the failed task.

I0722 02:09:35.071388  6767 slave.cpp:1664] Got registration for
executor 'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:35.071637  6767 slave.cpp:1783] Flushing queued task
dc5677b3-5b3b-4c3d-b750-19b5edea42eb for executor
'candyland_810ffa40-5572-4619-ba3c-680a8d12166c' of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:35.237853  6771 mesos_containerizer.cpp:1108] Executor for
container '97af22bb-cc60-4e9b-93ca-22fab235fd39' has exited
I0722 02:09:35.240733  6771 mesos_containerizer.cpp:992] Destroying
container '97af22bb-cc60-4e9b-93ca-22fab235fd39'
I0722 02:09:35.251838  6769 slave.cpp:2413] Executor
'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework
20140711-183251-3078776074-5050-5240-0083 has exited with status 0
I0722 02:09:35.252343  6769 slave.cpp:2552] Cleaning up executor
'candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5' of framework
20140711-183251-3078776074-5050-5240-0083
I0722 02:09:35.252504  6769 gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5/runs/97af22bb-cc60-4e9b-93ca-22fab235fd39'
for gc 6.99999707835556days in the future
I0722 02:09:35.252594  6769 gc.cpp:56] Scheduling
'/tmp/mesos/slaves/20140711-183251-3078776074-5050-5240-0/frameworks/20140711-183251-3078776074-5050-5240-0083/executors/candyland_7a83d24a-2449-4514-84f0-d960ee3c1fd5'
for gc 6.9999970780563days in the future

Sorry for being somewhat vague, I am chasing this bug for while now
and I would like to see whether this is me not understanding what is
going on or an actual bug.

Thanks,
    Henning

(I can make the candyland code available if that helps but it is
really ugly test hacking code and it shows how little I know about
Mesos. ;-) )

Reply via email to