James DeFelice created MESOS-4565:
-------------------------------------
Summary: slave recovers and attempt to destroy executor's child
containers, then begins rejecting task status updates
Key: MESOS-4565
URL: https://issues.apache.org/jira/browse/MESOS-4565
Project: Mesos
Issue Type: Bug
Affects Versions: 0.26.0
Reporter: James DeFelice
AFAICT the slave is doing this:
1) recovering from some kind of failure
2) checking the containers that it pulled from its state store
3) complaining about cgroup children hanging off of executor containers
4) rejecting task status updates related to the executor container, the first
of which in the logs is:
{code}
E0130 02:22:21.979852 12683 slave.cpp:2963] Failed to update resources for
container 1d965a20-849c-40d8-9446-27cb723220a9 of executor
'd701ab48a0c0f13_k8sm-executor' running task
pod.f2dc2c43-c6f7-11e5-ad28-0ad18c5e6c7f on status update for terminal task,
destroying container: Container '1d965a20-849c-40d8-9446-27cb723220a9' not found
{code}
To be fair, I don't believe that my custom executor is re-registering properly
with the slave prior to attempting to send these (failing) status updates. But
the slave doesn't complain about that .. it complains that it can't find the
**container**.
slave log here:
https://gist.github.com/jdef/265663461156b7a7ed4e
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)