[ 
https://issues.apache.org/jira/browse/MESOS-9227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612634#comment-16612634
 ] 

Benjamin Bannier edited comment on MESOS-9227 at 9/12/18 9:42 PM:
------------------------------------------------------------------

I believe to some degree the way our fixed point math truncates away small 
fractions has prevented exactness issues for smaller values.

Since scalar resource values are stored as {{double}} internally in the 
{{Resource}} message, they can only hold around 15 significant digits. We want 
to guarantee correct fixed point math with up to three decimal places, so we 
can represent values exactly up to around 10¹² MB = 1 EB.

Such an amount of {{disk}} is unfortunately not that far from realistic even 
for a single agent where we might already run into correctness issues, but it 
should be possible to e.g., warn users that agent resources might not be 
representable. The issue is worse if the total capacity of {{disk}} in the 
cluster reaches exabyte scale (either with some agents with huge, but 
representable disks, or many agents with considerable disks). The sum of 
{{disk}} might be not representable in the master, but would be below the 
obviously problematic threshold for each agent, making it harder to diagnose 
such issues.

A possible short term mitigation might be to store disk resources in GB instead 
of MB which would buy us a few orders of magnitude at the cost of being unable 
to represent values less than around 1 kB, but it seems such a fix wouldn't go 
far enough.


was (Author: bbannier):
I believe to some degree the way our fixed point math truncates away small 
fractions has prevented exactness issues for smaller values.

Since scalar resource values are stored as {{double}} internally in the 
{{Resource}} message, they can only hold around 15 significant digits. We want 
to guarantee correct fixed point math with up to three decimal places, so we 
can represent values exactly up to around 10¹² kB = 1 PB.

Such an amount of {{disk}} is unfortunately not unrealistic even for a single 
agent where we might already run into correctness issues, but it should be 
possible to e.g., warn users that agent resources might not be representable. 
The issue is worse if the total capacity of {{disk}} in the cluster reaches 
petabyte scale (either with some agents with huge, but representable disks, or 
many agents with considerable disks). The sum of {{disk}} might be not 
representable in the master, but would be below the obviously problematic 
threshold for each agent, making it harder to diagnose such issues.

A possible short term mitigation might be to store disk resources in GB instead 
of kB which would by us a couple magnitudes at the cost of being unable to 
represent values less than around 1 MB.

> `Value::Scalar` cannot handle large value due to double limitations.
> --------------------------------------------------------------------
>
>                 Key: MESOS-9227
>                 URL: https://issues.apache.org/jira/browse/MESOS-9227
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Meng Zhu
>            Priority: Blocker
>
> While `scalar` holds a `double`, internally we convert floating point to 
> fixed point to ensure only three decimal digits:
> https://github.com/apache/mesos/blob/851ec9c5dca672ed4efc77545c86121463695e4f/src/common/values.cpp#L48-L53
> And all internal arithmetic calculations are done using `long long`, e.g.:
> https://github.com/apache/mesos/blob/851ec9c5dca672ed4efc77545c86121463695e4f/src/common/values.cpp#L123-L128
> This has the unexpected consequence of the inability to handle large values. 
> One impacted use case we are seeing is with exabytes of disks. This will 
> overflow the fixed point representation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to