James DeFelice created MESOS-4679:
-------------------------------------

             Summary: slave dies unexpectedly: Mismatched checkpoint value for 
status update TASK_LOST
                 Key: MESOS-4679
                 URL: https://issues.apache.org/jira/browse/MESOS-4679
             Project: Mesos
          Issue Type: Bug
            Reporter: James DeFelice


It looks like the custom executor is sending out multiple terminal status 
updates for a specific task and that's crashing the slave (as well as possibly 
mishandling status-update UUID's?). In any event, I think that the slave should 
handle this case with a bit more aplomb.

Custom executor logs:
{code}
I0215 20:43:59.551657   11068 executor.go:426] Executor driver killTask
I0215 20:43:59.551719   11068 executor.go:436] Executor driver is asked to kill 
task 
'&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],}'
I0215 20:43:59.552189   11068 executor.go:687] Executor sending status update 
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
 253 145 223 212 36 17 229 158 224 82 84 0 231 66 
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.552599   11068 executor.go:687] Executor sending status update 
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
 253 162 110 212 36 17 229 158 224 82 84 0 231 66 
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.557376   11068 suicide.go:51] stopping suicide watch
I0215 20:43:59.559077   11068 executor.go:445] Executor 
statusUpdateAcknowledgement
I0215 20:43:59.559129   11068 executor.go:448] Receiving status update 
acknowledgement 
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
 253 145 223 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
I0215 20:43:59.562016   11068 executor.go:470] Executor driver received 
frameworkMessage
I0215 20:43:59.562073   11068 executor.go:480] Executor driver receives 
framework message
I0215 20:43:59.562100   11068 executor.go:445] Executor 
statusUpdateAcknowledgement
I0215 20:43:59.562112   11068 executor.go:448] Receiving status update 
acknowledgement 
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
 253 162 110 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
I0215 20:43:59.562173   11068 executor.go:579] Receives message from framework 
task-lost:pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
I0215 20:43:59.562292   11068 executor.go:687] Executor sending status update 
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*task-lost-ack,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
 255 28 217 212 36 17 229 158 224 82 84 0 231 66 
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.562463   11068 executor.go:687] Executor sending status update 
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
 255 35 27 212 36 17 229 158 224 82 84 0 231 66 
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.568237   11068 executor.go:445] Executor 
statusUpdateAcknowledgement
I0215 20:43:59.568286   11068 executor.go:448] Receiving status update 
acknowledgement 
&StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
 255 28 217 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
I0215 20:43:59.588373   11068 suicide.go:51] stopping suicide watch
I0215 20:43:59.588566   11068 executor.go:687] Executor sending status update 
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6ce1b7db-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
 3 30 254 212 36 17 229 158 224 82 84 0 231 66 
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.595983   11068 executor.go:260] slave disconnected, will wait 
for recovery
I0215 20:43:59.596040   11068 executor.go:328] Slave is disconnected
I0215 20:43:59.623678   11068 suicide.go:51] stopping suicide watch
I0215 20:43:59.623841   11068 executor.go:687] Executor sending status update 
&StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6d006a26-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
 8 128 159 212 36 17 229 158 224 82 84 0 231 66 
70],LatestState:nil,XXX_unrecognized:[],}
I0215 20:43:59.624399   11068 executor.go:284] slave exited ... shutting down
I0215 20:43:59.624442   11068 executor.go:613] Aborting the executor driver
{code}

Slave logs:
{code}
I0215 20:43:59.564084 15780 slave.cpp:2762] Handling status update TASK_LOST 
(UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task 
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006 from executor(1)@10.2.0.6:40672
W0215 20:43:59.564115 15780 slave.cpp:2856] Could not find the executor for 
status update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task 
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006
I0215 20:43:59.564321 15782 status_update_manager.cpp:826] Checkpointing UPDATE 
for status update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for 
task pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006
I0215 20:43:59.566783 15782 status_update_manager.cpp:322] Received status 
update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task 
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006
I0215 20:43:59.566879 15782 slave.cpp:3087] Forwarding the update TASK_LOST 
(UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task 
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006 to [email protected]:5050
I0215 20:43:59.566952 15782 slave.cpp:3011] Sending acknowledgement for status 
update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task 
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006 to executor(1)@10.2.0.6:40672
F0215 20:43:59.567073 15782 slave.cpp:3003] CHECK_READY(future): is FAILED: 
Mismatched checkpoint value for status update TASK_LOST (UUID: 
d6ff231b-d424-11e5-9ee0-525400e74246) for task 
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006 (expected checkpoint=true actual 
checkpoint=false) Failed to handle status update TASK_LOST (UUID: 
d6ff231b-d424-11e5-9ee0-525400e74246) for task 
pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
df95a79b-d6d4-4b96-853e-55686628e898-0006
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to