Bruce Merry created MESOS-8129:
----------------------------------

             Summary: Very large resource value crashes master
                 Key: MESOS-8129
                 URL: https://issues.apache.org/jira/browse/MESOS-8129
             Project: Mesos
          Issue Type: Bug
          Components: agent, master
    Affects Versions: 1.4.0
         Environment: Ubuntu 14.04
Both apt packages from Mesosphere repo and Docker images
            Reporter: Bruce Merry
            Priority: Minor


I ran into a master that kept failing on this CHECK when destroying a task:
https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367

I found that a combination of a misconfiguration and a suboptimal choice of 
units had let to an agent with a custom scalar resource of capacity 
4294967295000000. I believe what is happening is the pseudo-fixed-point 
arithmetic isn't able to cope with such large numbers, because rounding errors 
after arithmetic are bigger than 0.001. Examining the values in the debugger 
that the CHECK failed due to a rounding error on the order of 0.2.

While this is probably a fundamental limitation of the fixed-point 
implementation and such large resource values are probably a bad idea, it would 
have helped if the agent had complained on startup, rather than having to debug 
an internal assertion failure. I'd suggest that values larger than, say, 10^12 
should be rejected when the agent starts (which is why I've added the agent 
component), although someone familiar with the details of the fixed-point 
implementation should probably verify that number.

I'm not sure where this needs to be fixed e.g. if it can just be validated on 
agent startup or if it should be baked into the Resource class to prevent 
accidents in requests from the user.

To reproduce the issue, start a master and an agent with a custom scalar 
resource "thing:4294967295000000", then use mesos-execute to throw the 
following task at it (it'll probably also work with a smaller Docker image - 
that's just one I already had on the agent). When the sleep ends, the master 
crashes.

{code:javascript}
{
  "container": {
    "docker": {
      "image": "ubuntu:xenial-20161010"
    }, 
    "type": "DOCKER"
  }, 
  "name": "test-task", 
  "task_id": {
    "value": "00000001"
  }, 
  "command": {
    "shell": false, 
    "value": "sleep", 
    "arguments": [
      "10"
    ]
  }, 
  "agent_id": {
    "value": ""
  }, 
  "resources": [
    {
      "scalar": {
        "value": 1
      }, 
      "type": "SCALAR", 
      "name": "cpus"
    }, 
    {
      "scalar": {
        "value": 4106.0
      }, 
      "type": "SCALAR", 
      "name": "mem"
    }, 
    {
      "scalar": {
        "value": 12465430.06012024
      }, 
      "type": "SCALAR", 
      "name": "thing"
    }
  ]
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to