[
https://issues.apache.org/jira/browse/MESOS-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169214#comment-15169214
]
James DeFelice commented on MESOS-4679:
---------------------------------------
Pretty sure that prior versions of mesos are affected too, but I've upgraded
and I forget which prior versions I experienced this on.
> slave dies unexpectedly: Mismatched checkpoint value for status update
> TASK_LOST
> --------------------------------------------------------------------------------
>
> Key: MESOS-4679
> URL: https://issues.apache.org/jira/browse/MESOS-4679
> Project: Mesos
> Issue Type: Bug
> Components: slave
> Affects Versions: 0.26.0
> Reporter: James DeFelice
> Labels: mesosphere
>
> It looks like the custom executor is sending out multiple terminal status
> updates for a specific task and that's crashing the slave (as well as
> possibly mishandling status-update UUID's?). In any event, I think that the
> slave should handle this case with a bit more aplomb.
> Custom executor logs:
> {code}
> I0215 20:43:59.551657 11068 executor.go:426] Executor driver killTask
> I0215 20:43:59.551719 11068 executor.go:436] Executor driver is asked to
> kill task
> '&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],}'
> I0215 20:43:59.552189 11068 executor.go:687] Executor sending status update
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
> 253 145 223 212 36 17 229 158 224 82 84 0 231 66
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.552599 11068 executor.go:687] Executor sending status update
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
> 253 162 110 212 36 17 229 158 224 82 84 0 231 66
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.557376 11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.559077 11068 executor.go:445] Executor
> statusUpdateAcknowledgement
> I0215 20:43:59.559129 11068 executor.go:448] Receiving status update
> acknowledgement
> &StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
> 253 145 223 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562016 11068 executor.go:470] Executor driver received
> frameworkMessage
> I0215 20:43:59.562073 11068 executor.go:480] Executor driver receives
> framework message
> I0215 20:43:59.562100 11068 executor.go:445] Executor
> statusUpdateAcknowledgement
> I0215 20:43:59.562112 11068 executor.go:448] Receiving status update
> acknowledgement
> &StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
> 253 162 110 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562173 11068 executor.go:579] Receives message from
> framework task-lost:pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
> I0215 20:43:59.562292 11068 executor.go:687] Executor sending status update
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*task-lost-ack,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
> 255 28 217 212 36 17 229 158 224 82 84 0 231 66
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.562463 11068 executor.go:687] Executor sending status update
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
> 255 35 27 212 36 17 229 158 224 82 84 0 231 66
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.568237 11068 executor.go:445] Executor
> statusUpdateAcknowledgement
> I0215 20:43:59.568286 11068 executor.go:448] Receiving status update
> acknowledgement
> &StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
> 255 28 217 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.588373 11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.588566 11068 executor.go:687] Executor sending status update
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6ce1b7db-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
> 3 30 254 212 36 17 229 158 224 82 84 0 231 66
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.595983 11068 executor.go:260] slave disconnected, will wait
> for recovery
> I0215 20:43:59.596040 11068 executor.go:328] Slave is disconnected
> I0215 20:43:59.623678 11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.623841 11068 executor.go:687] Executor sending status update
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6d006a26-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
> 8 128 159 212 36 17 229 158 224 82 84 0 231 66
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.624399 11068 executor.go:284] slave exited ... shutting down
> I0215 20:43:59.624442 11068 executor.go:613] Aborting the executor driver
> {code}
> Slave logs:
> {code}
> I0215 20:43:59.564084 15780 slave.cpp:2762] Handling status update TASK_LOST
> (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006 from executor(1)@10.2.0.6:40672
> W0215 20:43:59.564115 15780 slave.cpp:2856] Could not find the executor for
> status update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.564321 15782 status_update_manager.cpp:826] Checkpointing
> UPDATE for status update TASK_LOST (UUID:
> d6ff1cd9-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.566783 15782 status_update_manager.cpp:322] Received status
> update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.566879 15782 slave.cpp:3087] Forwarding the update TASK_LOST
> (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006 to [email protected]:5050
> I0215 20:43:59.566952 15782 slave.cpp:3011] Sending acknowledgement for
> status update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006 to executor(1)@10.2.0.6:40672
> F0215 20:43:59.567073 15782 slave.cpp:3003] CHECK_READY(future): is FAILED:
> Mismatched checkpoint value for status update TASK_LOST (UUID:
> d6ff231b-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006 (expected checkpoint=true actual
> checkpoint=false) Failed to handle status update TASK_LOST (UUID:
> d6ff231b-d424-11e5-9ee0-525400e74246) for task
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)