[ 
https://issues.apache.org/jira/browse/MESOS-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15169214#comment-15169214
 ] 

James DeFelice commented on MESOS-4679:
---------------------------------------

Pretty sure that prior versions of mesos are affected too, but I've upgraded 
and I forget which prior versions I experienced this on.

> slave dies unexpectedly: Mismatched checkpoint value for status update 
> TASK_LOST
> --------------------------------------------------------------------------------
>
>                 Key: MESOS-4679
>                 URL: https://issues.apache.org/jira/browse/MESOS-4679
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.26.0
>            Reporter: James DeFelice
>              Labels: mesosphere
>
> It looks like the custom executor is sending out multiple terminal status 
> updates for a specific task and that's crashing the slave (as well as 
> possibly mishandling status-update UUID's?). In any event, I think that the 
> slave should handle this case with a bit more aplomb.
> Custom executor logs:
> {code}
> I0215 20:43:59.551657   11068 executor.go:426] Executor driver killTask
> I0215 20:43:59.551719   11068 executor.go:436] Executor driver is asked to 
> kill task 
> '&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],}'
> I0215 20:43:59.552189   11068 executor.go:687] Executor sending status update 
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
>  253 145 223 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.552599   11068 executor.go:687] Executor sending status update 
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
>  253 162 110 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.557376   11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.559077   11068 executor.go:445] Executor 
> statusUpdateAcknowledgement
> I0215 20:43:59.559129   11068 executor.go:448] Receiving status update 
> acknowledgement 
> &StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
>  253 145 223 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562016   11068 executor.go:470] Executor driver received 
> frameworkMessage
> I0215 20:43:59.562073   11068 executor.go:480] Executor driver receives 
> framework message
> I0215 20:43:59.562100   11068 executor.go:445] Executor 
> statusUpdateAcknowledgement
> I0215 20:43:59.562112   11068 executor.go:448] Receiving status update 
> acknowledgement 
> &StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
>  253 162 110 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.562173   11068 executor.go:579] Receives message from 
> framework task-lost:pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f
> I0215 20:43:59.562292   11068 executor.go:687] Executor sending status update 
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*task-lost-ack,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
>  255 28 217 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.562463   11068 executor.go:687] Executor sending status update 
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*kill-pod-task,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[214
>  255 35 27 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.568237   11068 executor.go:445] Executor 
> statusUpdateAcknowledgement
> I0215 20:43:59.568286   11068 executor.go:448] Receiving status update 
> acknowledgement 
> &StatusUpdateAcknowledgementMessage{SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},TaskId:&TaskID{Value:*pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},Uuid:*[214
>  255 28 217 212 36 17 229 158 224 82 84 0 231 66 70],XXX_unrecognized:[],}
> I0215 20:43:59.588373   11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.588566   11068 executor.go:687] Executor sending status update 
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6ce1b7db-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
>  3 30 254 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.595983   11068 executor.go:260] slave disconnected, will wait 
> for recovery
> I0215 20:43:59.596040   11068 executor.go:328] Slave is disconnected
> I0215 20:43:59.623678   11068 suicide.go:51] stopping suicide watch
> I0215 20:43:59.623841   11068 executor.go:687] Executor sending status update 
> &StatusUpdate{FrameworkId:&FrameworkID{Value:*df95a79b-d6d4-4b96-853e-55686628e898-0006,XXX_unrecognized:[],},ExecutorId:&ExecutorID{Value:*31df9d040f057abd_k8sm-executor,XXX_unrecognized:[],},SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Status:&TaskStatus{TaskId:&TaskID{Value:*pod.6d006a26-d1db-11e5-8a9a-525400309a8f,XXX_unrecognized:[],},State:*TASK_KILLED,Data:nil,Message:*pod-deleted,SlaveId:&SlaveID{Value:*20150628-154106-117441034-5050-1315-S2,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,ExecutorId:nil,Healthy:nil,Source:nil,Reason:nil,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],},Timestamp:*1.455569039e+09,Uuid:*[215
>  8 128 159 212 36 17 229 158 224 82 84 0 231 66 
> 70],LatestState:nil,XXX_unrecognized:[],}
> I0215 20:43:59.624399   11068 executor.go:284] slave exited ... shutting down
> I0215 20:43:59.624442   11068 executor.go:613] Aborting the executor driver
> {code}
> Slave logs:
> {code}
> I0215 20:43:59.564084 15780 slave.cpp:2762] Handling status update TASK_LOST 
> (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006 from executor(1)@10.2.0.6:40672
> W0215 20:43:59.564115 15780 slave.cpp:2856] Could not find the executor for 
> status update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.564321 15782 status_update_manager.cpp:826] Checkpointing 
> UPDATE for status update TASK_LOST (UUID: 
> d6ff1cd9-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.566783 15782 status_update_manager.cpp:322] Received status 
> update TASK_LOST (UUID: d6ff231b-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> I0215 20:43:59.566879 15782 slave.cpp:3087] Forwarding the update TASK_LOST 
> (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006 to [email protected]:5050
> I0215 20:43:59.566952 15782 slave.cpp:3011] Sending acknowledgement for 
> status update TASK_LOST (UUID: d6ff1cd9-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006 to executor(1)@10.2.0.6:40672
> F0215 20:43:59.567073 15782 slave.cpp:3003] CHECK_READY(future): is FAILED: 
> Mismatched checkpoint value for status update TASK_LOST (UUID: 
> d6ff231b-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006 (expected checkpoint=true actual 
> checkpoint=false) Failed to handle status update TASK_LOST (UUID: 
> d6ff231b-d424-11e5-9ee0-525400e74246) for task 
> pod.1e4f9fbe-d1db-11e5-8a9a-525400309a8f of framework 
> df95a79b-d6d4-4b96-853e-55686628e898-0006
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to