[
https://issues.apache.org/jira/browse/FLINK-14594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhu Zhu updated FLINK-14594:
----------------------------
Description:
There are resources of double type values, like cpuCores in
ResourceSpec/ResourceProfiles or all extended resources. These values can be
generated via a merge or subtract, so that there can be small deltas.
Currently, in resource matching, these resources are matched without
considering the deltas, which may result in issues as below:
1. A shared slot cannot fulfill a slot request even if it should be able to
(because it is possible that {{(d1 + d2) - d1 < d2}} for double values)
2. if a shared slot is used up, an unexpected error may occur when calculating
its remaining resources in SlotSharingManager#listResolvedRootSlotInfo ->
ResourceProfile#subtract
3. an unexpected error may happen when releasing a single task slot from a
shared slot (in ResourceProfile#subtract)
To solve this issue, I'd propose to:
1. Change {{Resource}} to use {{BigDecimal}} to manage double values. This
enabled the values able to be strictly compared, and able to be additively
merged/subtracted with no precision loss. Extended resources can work correctly
with double values with this change.
2. Introduce {{CPUResource}} to represent cpu cores. It is based on {{Resource}}
3. Change ResourceSpec/ResourceProfile to use CPUResource for cpu cores
was:
There are resources of double type values, like cpuCores in
ResourceSpec/ResourceProfiles or all extended resources. These values can be
generated via a merge or subtract, so that there can be small deltas.
Currently, in resource matching, these resources are matched without
considering the deltas, which may result in issues as below:
1. A shared slot cannot fulfill a slot request even if it should be able to
(because it is possible that {{(d1 + d2) - d1 < d2}} for double values)
2. if a shared slot is used up, an unexpected error may occur when calculating
its remaining resources in SlotSharingManager#listResolvedRootSlotInfo ->
ResourceProfile#subtract
3. an unexpected error may happen when releasing a single task slot from a
shared slot (in ResourceProfile#subtract)
To solve this issue, I'd propose to:
1. Introduce a ResourceValue which stores a double value and its acceptable
precision (the same kind of resource should use the same precision). It
provides {{compareTo}} method, in which two ResourceValues are considered equal
if the subtracted abs does not exceed the precision. It also provides
merge/subtract/validation operations.
2. ResourceSpec/ResourceProfile uses ResourceValue for cpuCores and fix related
logics(ctor/validation/subtract/matching). The usages of {{equals}} should be
replaced with another method {{hasSameResources}} which considers the precision.
3. Resource uses ResourceValue to store its value. Also fix related logics.
cc [~trohrmann] [~azagrebin] [~xintongsong]
> Fix matching logics of ResourceSpec/ResourceProfile/Resource considering
> double values
> --------------------------------------------------------------------------------------
>
> Key: FLINK-14594
> URL: https://issues.apache.org/jira/browse/FLINK-14594
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / Coordination
> Affects Versions: 1.10.0
> Reporter: Zhu Zhu
> Assignee: Zhu Zhu
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.10.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> There are resources of double type values, like cpuCores in
> ResourceSpec/ResourceProfiles or all extended resources. These values can be
> generated via a merge or subtract, so that there can be small deltas.
> Currently, in resource matching, these resources are matched without
> considering the deltas, which may result in issues as below:
> 1. A shared slot cannot fulfill a slot request even if it should be able to
> (because it is possible that {{(d1 + d2) - d1 < d2}} for double values)
> 2. if a shared slot is used up, an unexpected error may occur when
> calculating its remaining resources in
> SlotSharingManager#listResolvedRootSlotInfo -> ResourceProfile#subtract
> 3. an unexpected error may happen when releasing a single task slot from a
> shared slot (in ResourceProfile#subtract)
> To solve this issue, I'd propose to:
> 1. Change {{Resource}} to use {{BigDecimal}} to manage double values. This
> enabled the values able to be strictly compared, and able to be additively
> merged/subtracted with no precision loss. Extended resources can work
> correctly with double values with this change.
> 2. Introduce {{CPUResource}} to represent cpu cores. It is based on
> {{Resource}}
> 3. Change ResourceSpec/ResourceProfile to use CPUResource for cpu cores
--
This message was sent by Atlassian Jira
(v8.3.4#803005)