[
https://issues.apache.org/jira/browse/FLINK-15300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrey Zagrebin updated FLINK-15300:
------------------------------------
Description:
If we have a configuration which results in setting shuffle memory size to its
min or max, not fraction during TM startup then starting TM parses generated
dynamic properties and while doing the sanity check
(TaskExecutorResourceUtils#sanityCheckShuffleMemory) it fails because it checks
the exact fraction for min/max value.
Example, start TM with the following Flink config:
{code:java}
taskmanager.memory.total-flink.size: 350m
taskmanager.memory.framework.heap.size: 16m
taskmanager.memory.shuffle.fraction: 0.1{code}
The calculation will happen for total Flink memory and will result in the
following extra program args:
{code:java}
taskmanager.memory.shuffle.max: 67108864b
taskmanager.memory.framework.off-heap.size: 134217728b
taskmanager.memory.managed.size: 146800642b
taskmanager.cpu.cores: 1.0
taskmanager.memory.task.heap.size: 2097150b
taskmanager.memory.task.off-heap.size: 0b
taskmanager.memory.shuffle.min: 67108864b{code}
where the calculation happens now for task heap and managed memory and the
derived fraction is less than shuffle memory min size (64mb),
so it was set to the min value: 64mb.
While TM starts, TaskExecutorResourceUtils#sanityCheckShuffleMemory trows the
following exception:
{code:java}
org.apache.flink.configuration.IllegalConfigurationException:
Derived Shuffle Memory size(64 Mb (67108864 bytes)) does not match configured
Shuffle Memory fraction (0.10000000149011612).
at
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.sanityCheckShuffleMemory(TaskExecutorResourceUtils.java:552)
at
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveResourceSpecWithExplicitTaskAndManagedMemory(TaskExecutorResourceUtils.java:183)
at
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:135)
{code}
This can be fixed by checking whether the fraction to assert is within the
min/max range.
was:
If we have a configuration which results in setting shuffle memory size to its
min or max, not fraction during TM startup then starting TM parses generated
dynamic properties and while doing the sanity check
(TaskExecutorResourceUtils#sanityCheckShuffleMemory) it fails because it checks
the exact fraction for min/max value.
Example, start TM with the following Flink config:
{code:java}
taskmanager.memory.total-flink.size: 350m
taskmanager.memory.framework.heap.size: 16m
taskmanager.memory.shuffle.fraction: 0.1{code}
It will result in the following extra program args:
{code:java}
taskmanager.memory.shuffle.max: 67108864b
taskmanager.memory.framework.off-heap.size: 134217728b
taskmanager.memory.managed.size: 146800642b
taskmanager.cpu.cores: 1.0
taskmanager.memory.task.heap.size: 2097150b
taskmanager.memory.task.off-heap.size: 0b
taskmanager.memory.shuffle.min: 67108864b{code}
where the derived fraction was less than shuffle memory min size (64mb),
so it was set to the min value: 64mb.
While TM starts, TaskExecutorResourceUtils#sanityCheckShuffleMemory trows the
following exception:
{code:java}
org.apache.flink.configuration.IllegalConfigurationException:
Derived Shuffle Memory size(64 Mb (67108864 bytes)) does not match configured
Shuffle Memory fraction (0.10000000149011612).
at
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.sanityCheckShuffleMemory(TaskExecutorResourceUtils.java:552)
at
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveResourceSpecWithExplicitTaskAndManagedMemory(TaskExecutorResourceUtils.java:183)
at
org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:135)
{code}
This can be fixed by checking whether the fraction to assert is within the
min/max range.
> Shuffle memory fraction sanity check does not account for its min/max limit
> ---------------------------------------------------------------------------
>
> Key: FLINK-15300
> URL: https://issues.apache.org/jira/browse/FLINK-15300
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Configuration
> Reporter: Andrey Zagrebin
> Assignee: Andrey Zagrebin
> Priority: Critical
> Fix For: 1.10.0
>
>
> If we have a configuration which results in setting shuffle memory size to
> its min or max, not fraction during TM startup then starting TM parses
> generated dynamic properties and while doing the sanity check
> (TaskExecutorResourceUtils#sanityCheckShuffleMemory) it fails because it
> checks the exact fraction for min/max value.
> Example, start TM with the following Flink config:
> {code:java}
> taskmanager.memory.total-flink.size: 350m
> taskmanager.memory.framework.heap.size: 16m
> taskmanager.memory.shuffle.fraction: 0.1{code}
> The calculation will happen for total Flink memory and will result in the
> following extra program args:
> {code:java}
> taskmanager.memory.shuffle.max: 67108864b
> taskmanager.memory.framework.off-heap.size: 134217728b
> taskmanager.memory.managed.size: 146800642b
> taskmanager.cpu.cores: 1.0
> taskmanager.memory.task.heap.size: 2097150b
> taskmanager.memory.task.off-heap.size: 0b
> taskmanager.memory.shuffle.min: 67108864b{code}
> where the calculation happens now for task heap and managed memory and the
> derived fraction is less than shuffle memory min size (64mb),
> so it was set to the min value: 64mb.
> While TM starts, TaskExecutorResourceUtils#sanityCheckShuffleMemory trows the
> following exception:
> {code:java}
> org.apache.flink.configuration.IllegalConfigurationException:
> Derived Shuffle Memory size(64 Mb (67108864 bytes)) does not match configured
> Shuffle Memory fraction (0.10000000149011612).
> at
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.sanityCheckShuffleMemory(TaskExecutorResourceUtils.java:552)
> at
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.deriveResourceSpecWithExplicitTaskAndManagedMemory(TaskExecutorResourceUtils.java:183)
> at
> org.apache.flink.runtime.clusterframework.TaskExecutorResourceUtils.resourceSpecFromConfig(TaskExecutorResourceUtils.java:135)
> {code}
> This can be fixed by checking whether the fraction to assert is within the
> min/max range.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)