Bruce Merry created MESOS-8129:
----------------------------------
Summary: Very large resource value crashes master
Key: MESOS-8129
URL: https://issues.apache.org/jira/browse/MESOS-8129
Project: Mesos
Issue Type: Bug
Components: agent, master
Affects Versions: 1.4.0
Environment: Ubuntu 14.04
Both apt packages from Mesosphere repo and Docker images
Reporter: Bruce Merry
Priority: Minor
I ran into a master that kept failing on this CHECK when destroying a task:
https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
I found that a combination of a misconfiguration and a suboptimal choice of
units had let to an agent with a custom scalar resource of capacity
4294967295000000. I believe what is happening is the pseudo-fixed-point
arithmetic isn't able to cope with such large numbers, because rounding errors
after arithmetic are bigger than 0.001. Examining the values in the debugger
that the CHECK failed due to a rounding error on the order of 0.2.
While this is probably a fundamental limitation of the fixed-point
implementation and such large resource values are probably a bad idea, it would
have helped if the agent had complained on startup, rather than having to debug
an internal assertion failure. I'd suggest that values larger than, say, 10^12
should be rejected when the agent starts (which is why I've added the agent
component), although someone familiar with the details of the fixed-point
implementation should probably verify that number.
I'm not sure where this needs to be fixed e.g. if it can just be validated on
agent startup or if it should be baked into the Resource class to prevent
accidents in requests from the user.
To reproduce the issue, start a master and an agent with a custom scalar
resource "thing:4294967295000000", then use mesos-execute to throw the
following task at it (it'll probably also work with a smaller Docker image -
that's just one I already had on the agent). When the sleep ends, the master
crashes.
{code:javascript}
{
"container": {
"docker": {
"image": "ubuntu:xenial-20161010"
},
"type": "DOCKER"
},
"name": "test-task",
"task_id": {
"value": "00000001"
},
"command": {
"shell": false,
"value": "sleep",
"arguments": [
"10"
]
},
"agent_id": {
"value": ""
},
"resources": [
{
"scalar": {
"value": 1
},
"type": "SCALAR",
"name": "cpus"
},
{
"scalar": {
"value": 4106.0
},
"type": "SCALAR",
"name": "mem"
},
{
"scalar": {
"value": 12465430.06012024
},
"type": "SCALAR",
"name": "thing"
}
]
}
{code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)