[ https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kone reassigned MESOS-8129: --------------------------------- Assignee: Benjamin Mahler > Very large resource value crashes master > ---------------------------------------- > > Key: MESOS-8129 > URL: https://issues.apache.org/jira/browse/MESOS-8129 > Project: Mesos > Issue Type: Bug > Components: agent, master > Affects Versions: 1.4.0 > Environment: Ubuntu 14.04 > Both apt packages from Mesosphere repo and Docker images > Reporter: Bruce Merry > Assignee: Benjamin Mahler > Priority: Minor > > I ran into a master that kept failing on this CHECK when destroying a task: > https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367 > I found that a combination of a misconfiguration and a suboptimal choice of > units had let to an agent with a custom scalar resource of capacity > 4294967295000000. I believe what is happening is the pseudo-fixed-point > arithmetic isn't able to cope with such large numbers, because rounding > errors after arithmetic are bigger than 0.001. Examining the values in the > debugger that the CHECK failed due to a rounding error on the order of 0.2. > While this is probably a fundamental limitation of the fixed-point > implementation and such large resource values are probably a bad idea, it > would have helped if the agent had complained on startup, rather than having > to debug an internal assertion failure. I'd suggest that values larger than, > say, 10^12 should be rejected when the agent starts (which is why I've added > the agent component), although someone familiar with the details of the > fixed-point implementation should probably verify that number. > I'm not sure where this needs to be fixed e.g. if it can just be validated on > agent startup or if it should be baked into the Resource class to prevent > accidents in requests from the user. > To reproduce the issue, start a master and an agent with a custom scalar > resource "thing:4294967295000000", then use mesos-execute to throw the > following task at it (it'll probably also work with a smaller Docker image - > that's just one I already had on the agent). When the sleep ends, the master > crashes. > {code:javascript} > { > "container": { > "docker": { > "image": "ubuntu:xenial-20161010" > }, > "type": "DOCKER" > }, > "name": "test-task", > "task_id": { > "value": "00000001" > }, > "command": { > "shell": false, > "value": "sleep", > "arguments": [ > "10" > ] > }, > "agent_id": { > "value": "" > }, > "resources": [ > { > "scalar": { > "value": 1 > }, > "type": "SCALAR", > "name": "cpus" > }, > { > "scalar": { > "value": 4106.0 > }, > "type": "SCALAR", > "name": "mem" > }, > { > "scalar": { > "value": 12465430.06012024 > }, > "type": "SCALAR", > "name": "thing" > } > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)