[
https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250928#comment-16250928
]
Bruce Merry commented on MESOS-8129:
------------------------------------
> Also, I'm curious what your use case is, can you tell me?
We have tasks with high but very predictable network bandwidth. I create a
resource for incoming and outgoing bandwidth on each interface (not isolated).
I used bits/second as the unit, because I was tired of writing multiplications
and divisions by 10^6 for "mem" resources, which is why the numbers get high.
Then a script sets the resources for each agent by reading
/sys/class/net/<interface>/speed and multiplying by 10^6. It turns out that
unplugging the NIC causes that file to contain 4294967295, which resulted in a
resource being set as 4294967295000000.
Here's my theoretical analysis. Let's say that we set the limit to X, which is
a power of 2. Numbers slightly less than X have X/2 as the implicit 1, and
X/2^53 as ULP. Thus, rounding a multiple of 0.001 to the nearest float
introduces error up to X/2^54. Adding/subtracting two such values then has
error up to X/2^53. This needs to be less than 0.0005, so that we can turn it
into the proper multiple of 0.001. The largest power-of-2 X satisfying this is
2^42.
I agree that a more conservative limit of 2^40 is probably a better idea, and
would still be large enough for my use case (we have 40Gb/s NICs, so we'd still
have 10x headroom).
> Bruce Merry would you be able to send a pull request with the added
> validation?
Eventually maybe, but the next several months are a crunch time for us so I
definitely won't be able to until April, and there is still another issue where
I want to contribute code first. There is also still the question of where the
validation should happen: only in agent startup, or also in processing client
requests?
> Very large resource value crashes master
> ----------------------------------------
>
> Key: MESOS-8129
> URL: https://issues.apache.org/jira/browse/MESOS-8129
> Project: Mesos
> Issue Type: Bug
> Components: agent, master
> Affects Versions: 1.4.0
> Environment: Ubuntu 14.04
> Both apt packages from Mesosphere repo and Docker images
> Reporter: Bruce Merry
> Assignee: Benjamin Mahler
> Priority: Minor
>
> I ran into a master that kept failing on this CHECK when destroying a task:
> https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367
> I found that a combination of a misconfiguration and a suboptimal choice of
> units had let to an agent with a custom scalar resource of capacity
> 4294967295000000. I believe what is happening is the pseudo-fixed-point
> arithmetic isn't able to cope with such large numbers, because rounding
> errors after arithmetic are bigger than 0.001. Examining the values in the
> debugger that the CHECK failed due to a rounding error on the order of 0.2.
> While this is probably a fundamental limitation of the fixed-point
> implementation and such large resource values are probably a bad idea, it
> would have helped if the agent had complained on startup, rather than having
> to debug an internal assertion failure. I'd suggest that values larger than,
> say, 10^12 should be rejected when the agent starts (which is why I've added
> the agent component), although someone familiar with the details of the
> fixed-point implementation should probably verify that number.
> I'm not sure where this needs to be fixed e.g. if it can just be validated on
> agent startup or if it should be baked into the Resource class to prevent
> accidents in requests from the user.
> To reproduce the issue, start a master and an agent with a custom scalar
> resource "thing:4294967295000000", then use mesos-execute to throw the
> following task at it (it'll probably also work with a smaller Docker image -
> that's just one I already had on the agent). When the sleep ends, the master
> crashes.
> {code:javascript}
> {
> "container": {
> "docker": {
> "image": "ubuntu:xenial-20161010"
> },
> "type": "DOCKER"
> },
> "name": "test-task",
> "task_id": {
> "value": "00000001"
> },
> "command": {
> "shell": false,
> "value": "sleep",
> "arguments": [
> "10"
> ]
> },
> "agent_id": {
> "value": ""
> },
> "resources": [
> {
> "scalar": {
> "value": 1
> },
> "type": "SCALAR",
> "name": "cpus"
> },
> {
> "scalar": {
> "value": 4106.0
> },
> "type": "SCALAR",
> "name": "mem"
> },
> {
> "scalar": {
> "value": 12465430.06012024
> },
> "type": "SCALAR",
> "name": "thing"
> }
> ]
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)