[
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kone reassigned MESOS-8129:
---------------------------------
Assignee: Benjamin Mahler
> Very large resource value crashes master
> ----------------------------------------
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
> Issue Type: Bug
> Components: agent, master
> Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
> Reporter: Bruce Merry
> Assignee: Benjamin Mahler
> Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of
> units had let to an agent with a custom scalar resource of capacity
> 4294967295000000. I believe what is happening is the pseudo-fixed-point
> arithmetic isn't able to cope with such large numbers, because rounding
> errors after arithmetic are bigger than 0.001. Examining the values in the
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point
> implementation and such large resource values are probably a bad idea, it
> would have helped if the agent had complained on startup, rather than having
> to debug an internal assertion failure. I'd suggest that values larger than,
> say, 10^12 should be rejected when the agent starts (which is why I've added
> the agent component), although someone familiar with the details of the
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on
> agent startup or if it should be baked into the Resource class to prevent
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar
> resource "thing:4294967295000000", then use mesos-execute to throw the
> following task at it (it'll probably also work with a smaller Docker image -
> that's just one I already had on the agent). When the sleep ends, the master
> crashes.
> {code:javascript}
> {
> "container": {
> "docker": {
> "image": "ubuntu:xenial-20161010"
> },
> "type": "DOCKER"
> },
> "name": "test-task",
> "task_id": {
> "value": "00000001"
> },
> "command": {
> "shell": false,
> "value": "sleep",
> "arguments": [
> "10"
> ]
> },
> "agent_id": {
> "value": ""
> },
> "resources": [
> {
> "scalar": {
> "value": 1
> },
> "type": "SCALAR",
> "name": "cpus"
> },
> {
> "scalar": {
> "value": 4106.0
> },
> "type": "SCALAR",
> "name": "mem"
> },
> {
> "scalar": {
> "value": 12465430.06012024
> },
> "type": "SCALAR",
> "name": "thing"
> }
> ]
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)