[ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8129:
---------------------------------

    Assignee: Benjamin Mahler

> Very large resource value crashes master
> ----------------------------------------
>
>                 Key: MESOS-8129
>                 URL: https://issues.apache.org/jira/browse/MESOS-8129
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.4.0
>         Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>            Reporter: Bruce Merry
>            Assignee: Benjamin Mahler
>            Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 4294967295000000. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:4294967295000000", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
>     "docker": {
>       "image": "ubuntu:xenial-20161010"
>     }, 
>     "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
>     "value": "00000001"
>   }, 
>   "command": {
>     "shell": false, 
>     "value": "sleep", 
>     "arguments": [
>       "10"
>     ]
>   }, 
>   "agent_id": {
>     "value": ""
>   }, 
>   "resources": [
>     {
>       "scalar": {
>         "value": 1
>       }, 
>       "type": "SCALAR", 
>       "name": "cpus"
>     }, 
>     {
>       "scalar": {
>         "value": 4106.0
>       }, 
>       "type": "SCALAR", 
>       "name": "mem"
>     }, 
>     {
>       "scalar": {
>         "value": 12465430.06012024
>       }, 
>       "type": "SCALAR", 
>       "name": "thing"
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to