James DeFelice created MESOS-4679:
-------------------------------------
Summary: slave dies unexpectedly: Mismatched checkpoint value for
status update TASK_LOST
Key: MESOS-4679
URL: https://issues.apache.org/jira/browse/MESOS-4679
Project: Mesos
Issue Type: Bug
Reporter: James DeFelice
It looks like the custom executor is sending out multiple terminal status
updates for a specific task and that's crashing the slave (as well as possibly
mishandling status-update UUID's?). In any event, I think that the slave should
handle this case with a bit more aplomb.
Custom executor logs:
{code}
I0215 20:43:59.551657 11068 executor.go:426] Executor driver killTask
I0215 20:43:59.551719 11068 executor.go:436] Executor driver is asked to kill
task
'&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],}'
I0215 20:43:59.552189 11068 executor.go:687] Executor sending status update
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
253 145 223 212 36 17 229 158 224 82 84 0 231 66
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.552599 11068 executor.go:687] Executor sending status update
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
253 162 110 212 36 17 229 158 224 82 84 0 231 66
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.557376 11068 suicide.go:51] stopping suicide watch
I0215 20:43:59.559077 11068 executor.go:445] Executor
statusUpdateAcknowledgement
I0215 20:43:59.559129 11068 executor.go:448] Receiving status update
acknowledgement
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
253 145 223 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
I0215 20:43:59.562016 11068 executor.go:470] Executor driver received
frameworkMessage
I0215 20:43:59.562073 11068 executor.go:480] Executor driver receives
framework message
I0215 20:43:59.562100 11068 executor.go:445] Executor
statusUpdateAcknowledgement
I0215 20:43:59.562112 11068 executor.go:448] Receiving status update
acknowledgement
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
253 162 110 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
I0215 20:43:59.562173 11068 executor.go:579] Receives message from framework
task-lost:pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
I0215 20:43:59.562292 11068 executor.go:687] Executor sending status update
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*task-lost-ack,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
255 28 217 212 36 17 229 158 224 82 84 0 231 66
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.562463 11068 executor.go:687] Executor sending status update
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
255 35 27 212 36 17 229 158 224 82 84 0 231 66
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.568237 11068 executor.go:445] Executor
statusUpdateAcknowledgement
I0215 20:43:59.568286 11068 executor.go:448] Receiving status update
acknowledgement
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
255 28 217 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
I0215 20:43:59.588373 11068 suicide.go:51] stopping suicide watch
I0215 20:43:59.588566 11068 executor.go:687] Executor sending status update
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6ce1b7db-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
3 30 254 212 36 17 229 158 224 82 84 0 231 66
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.595983 11068 executor.go:260] slave disconnected, will wait
for recovery
I0215 20:43:59.596040 11068 executor.go:328] Slave is disconnected
I0215 20:43:59.623678 11068 suicide.go:51] stopping suicide watch
I0215 20:43:59.623841 11068 executor.go:687] Executor sending status update
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6d006a26-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
8 128 159 212 36 17 229 158 224 82 84 0 231 66
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.624399 11068 executor.go:284] slave exited ... shutting down
I0215 20:43:59.624442 11068 executor.go:613] Aborting the executor driver
{code}
Slave logs:
{code}
I0215 20:43:59.564084 15780 slave.cpp:2762] Handling status update TASK_LOST
(UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006 from executor(1)@10.2.0.6:40672
W0215 20:43:59.564115 15780 slave.cpp:2856] Could not find the executor for
status update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006
I0215 20:43:59.564321 15782 status_update_manager.cpp:826] Checkpointing UPDATE
for status update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for
task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006
I0215 20:43:59.566783 15782 status_update_manager.cpp:322] Received status
update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006
I0215 20:43:59.566879 15782 slave.cpp:3087] Forwarding the update TASK_LOST
(UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006 to [email protected]:5050
I0215 20:43:59.566952 15782 slave.cpp:3011] Sending acknowledgement for status
update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006 to executor(1)@10.2.0.6:40672
F0215 20:43:59.567073 15782 slave.cpp:3003] CHECK_READY(future): is FAILED:
Mismatched checkpoint value for status update TASK_LOST (UUID:
d6ff231b-d424-11e5-9ee0-525400e74246) for task
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006 (expected checkpoint=true actual
checkpoint=false) Failed to handle status update TASK_LOST (UUID:
d6ff231b-d424-11e5-9ee0-525400e74246) for task
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
df95a79b-d6d4-4b96-853e-55686628e898-0006
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)