A. Dukhovniy created MESOS-8756:
-----------------------------------
Summary: Missing reasons for early task failures
Key: MESOS-8756
URL: https://issues.apache.org/jira/browse/MESOS-8756
Project: Mesos
Issue Type: Bug
Components: executor, master
Affects Versions: 1.6.0
Reporter: A. Dukhovniy
Some early task failures are not propagated to the framework. Here is an
example of a marathon pod (mesos containerizer) definition with a non-existing
image:
{code:java}
{
"id": "/fail",
"containers": [
{
"name": "container-1",
"resources": {
"cpus": 0.1,
"mem": 128
},
"image": {
"id": "non-existing-image-56789",
"kind": "DOCKER"
}
}
],
"scaling": {
"instances": 1,
"kind": "fixed"
},
"networks": [
{
"mode": "host"
}
],
"volumes": [],
"fetch": [],
"scheduling": {
"placement": {
"constraints": []
}
}
}
{code}
Here the status update the framework receives is {{TASK_FAILED (Executor
terminated)}}.
Here another example where a non-existing artifact is being fetched:
{code:java}
{
"id": "/fail2",
"containers": [
{
"name": "container-1",
"resources": {
"cpus": 0.1,
"mem": 128
},
"image": {
"id": "nginx",
"kind": "DOCKER",
"forcePull": false
},
"artifacts": [
{
"uri": "http://example.com/smth-non-existing-12345.tar.gz"
}
]
}
],
"scaling": {
"instances": 1,
"kind": "fixed"
},
"networks": [
{
"mode": "host"
}
],
"volumes": [],
"fetch": [],
"scheduling": {
"placement": {
"constraints": []
}
}
}
{code}
which results in the same status update as above.
Frameworks (and their users) should always receive meaningful task failures
reasons no matter where those failures happened. Otherwise, the only way to
find out what happened is to grep agent logs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)