[ https://issues.apache.org/jira/browse/MESOS-8129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16250507#comment-16250507 ]
Benjamin Mahler edited comment on MESOS-8129 at 11/14/17 12:00 AM: ------------------------------------------------------------------- Did a binary search and found that the largest 0.001 precision value such that precision appears to be lost when incrementing by another 0.001 is: 8,796,093,022,208.000. This is exactly 2^43 and seems to indicate that 9 bits are needed for enough precision to represent 0.001 without loss during increment. Another 0.001 increment moves this to 8,796,093,022,208.002. This was tested with Apple's clang {{Apple LLVM version 9.0.0 (clang-900.0.38)}}. We could perhaps validate a more conservative limit of 2^40 which is still over 1 billion. cc [~mcypark] [~bmerry] would you be able to send a pull request with the added validation? was (Author: bmahler): Did a binary search and found that the largest 0.001 precision value such that precision appears to be lost when incrementing by another 0.001 is: 8,796,093,022,208.000. Another 0.001 increment moves this to 8,796,093,022,208.002. This was tested with Apple's clang {{Apple LLVM version 9.0.0 (clang-900.0.38)}}. We could perhaps validate a limit of 2^32 which is a little less than half of this limit: 4,294,967,296. cc [~mcypark] [~bmerry] would you be able to send a pull request with the added validation? > Very large resource value crashes master > ---------------------------------------- > > Key: MESOS-8129 > URL: https://issues.apache.org/jira/browse/MESOS-8129 > Project: Mesos > Issue Type: Bug > Components: agent, master > Affects Versions: 1.4.0 > Environment: Ubuntu 14.04 > Both apt packages from Mesosphere repo and Docker images > Reporter: Bruce Merry > Assignee: Benjamin Mahler > Priority: Minor > > I ran into a master that kept failing on this CHECK when destroying a task: > https://github.com/apache/mesos/blob/1.4.0/src/master/allocator/sorter/drf/sorter.hpp#L367 > I found that a combination of a misconfiguration and a suboptimal choice of > units had let to an agent with a custom scalar resource of capacity > 4294967295000000. I believe what is happening is the pseudo-fixed-point > arithmetic isn't able to cope with such large numbers, because rounding > errors after arithmetic are bigger than 0.001. Examining the values in the > debugger that the CHECK failed due to a rounding error on the order of 0.2. > While this is probably a fundamental limitation of the fixed-point > implementation and such large resource values are probably a bad idea, it > would have helped if the agent had complained on startup, rather than having > to debug an internal assertion failure. I'd suggest that values larger than, > say, 10^12 should be rejected when the agent starts (which is why I've added > the agent component), although someone familiar with the details of the > fixed-point implementation should probably verify that number. > I'm not sure where this needs to be fixed e.g. if it can just be validated on > agent startup or if it should be baked into the Resource class to prevent > accidents in requests from the user. > To reproduce the issue, start a master and an agent with a custom scalar > resource "thing:4294967295000000", then use mesos-execute to throw the > following task at it (it'll probably also work with a smaller Docker image - > that's just one I already had on the agent). When the sleep ends, the master > crashes. > {code:javascript} > { > "container": { > "docker": { > "image": "ubuntu:xenial-20161010" > }, > "type": "DOCKER" > }, > "name": "test-task", > "task_id": { > "value": "00000001" > }, > "command": { > "shell": false, > "value": "sleep", > "arguments": [ > "10" > ] > }, > "agent_id": { > "value": "" > }, > "resources": [ > { > "scalar": { > "value": 1 > }, > "type": "SCALAR", > "name": "cpus" > }, > { > "scalar": { > "value": 4106.0 > }, > "type": "SCALAR", > "name": "mem" > }, > { > "scalar": { > "value": 12465430.06012024 > }, > "type": "SCALAR", > "name": "thing" > } > ] > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)