[jira] [Commented] (MESOS-8129) Very large resource value crashes master

Bruce Merry (JIRA) Mon, 13 Nov 2017 22:05:25 -0800

    [ 
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250928#comment-16250928
 ]


Bruce Merry commented on MESOS-8129:
------------------------------------

> Also, I'm curious what your use case is, can you tell me?

We have tasks with high but very predictable network bandwidth. I create a 
resource for incoming and outgoing bandwidth on each interface (not isolated). 
I used bits/second as the unit, because I was tired of writing multiplications 
and divisions by 10^6 for "mem" resources, which is why the numbers get high. 
Then a script sets the resources for each agent by reading 
/sys/class/net/<interface>/speed and multiplying by 10^6. It turns out that 
unplugging the NIC causes that file to contain 4294967295, which resulted in a 
resource being set as 4294967295000000.

Here's my theoretical analysis. Let's say that we set the limit to X, which is 
a power of 2. Numbers slightly less than X have X/2 as the implicit 1, and 
X/2^53 as ULP. Thus, rounding a multiple of 0.001 to the nearest float 
introduces error up to X/2^54. Adding/subtracting two such values then has 
error up to X/2^53. This needs to be less than 0.0005, so that we can turn it 
into the proper multiple of 0.001. The largest power-of-2 X satisfying this is 
2^42.

I agree that a more conservative limit of 2^40 is probably a better idea, and 
would still be large enough for my use case (we have 40Gb/s NICs, so we'd still 
have 10x headroom).

> Bruce Merry would you be able to send a pull request with the added 
> validation?

Eventually maybe, but the next several months are a crunch time for us so I 
definitely won't be able to until April, and there is still another issue where 
I want to contribute code first. There is also still the question of where the 
validation should happen: only in agent startup, or also in processing client 
requests?

> Very large resource value crashes master
> ----------------------------------------
>
>                 Key: MESOS-8129
>                 URL: https://issues.apache.org/jira/browse/MESOS-8129
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.4.0
>         Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
>            Reporter: Bruce Merry
>            Assignee: Benjamin Mahler
>            Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of 
> units had let to an agent with a custom scalar resource of capacity 
> 4294967295000000. I believe what is happening is the pseudo-fixed-point 
> arithmetic isn't able to cope with such large numbers, because rounding 
> errors after arithmetic are bigger than 0.001. Examining the values in the 
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point 
> implementation and such large resource values are probably a bad idea, it 
> would have helped if the agent had complained on startup, rather than having 
> to debug an internal assertion failure. I'd suggest that values larger than, 
> say, 10^12 should be rejected when the agent starts (which is why I've added 
> the agent component), although someone familiar with the details of the 
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on 
> agent startup or if it should be baked into the Resource class to prevent 
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar 
> resource "thing:4294967295000000", then use mesos-execute to throw the 
> following task at it (it'll probably also work with a smaller Docker image - 
> that's just one I already had on the agent). When the sleep ends, the master 
> crashes.
> {code:javascript}
> {
>   "container": {
>     "docker": {
>       "image": "ubuntu:xenial-20161010"
>     }, 
>     "type": "DOCKER"
>   }, 
>   "name": "test-task", 
>   "task_id": {
>     "value": "00000001"
>   }, 
>   "command": {
>     "shell": false, 
>     "value": "sleep", 
>     "arguments": [
>       "10"
>     ]
>   }, 
>   "agent_id": {
>     "value": ""
>   }, 
>   "resources": [
>     {
>       "scalar": {
>         "value": 1
>       }, 
>       "type": "SCALAR", 
>       "name": "cpus"
>     }, 
>     {
>       "scalar": {
>         "value": 4106.0
>       }, 
>       "type": "SCALAR", 
>       "name": "mem"
>     }, 
>     {
>       "scalar": {
>         "value": 12465430.06012024
>       }, 
>       "type": "SCALAR", 
>       "name": "thing"
>     }
>   ]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MESOS-8129) Very large resource value crashes master

Reply via email to