[
https://issues.apache.org/jira/browse/FLINK-15178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhu Zhu updated FLINK-15178:
----------------------------
Description:
I met this issue when running testing batch(DataSet) job with 1000 parallelism.
Some TMs crashes due to error below:
{code:java}
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing
reserved memory.
[thread 139864559318784 also had an error]
[thread 139867407243008 also had an error]
{code}
With either of the following actions, this problem could be avoided:
1. changing ExecutionMode from BATCH_FORCED to PIPELINED
2. changing config "taskmanager.network.bounded-blocking-subpartition-type"
from default "auto" to "file"
So looks it is related to the mmap of BLOCKING shuffle.
And the issue is a bit weird that it would always happen in the beginning of a
job, and disappeared after several rounds of failovers, so the job would
finally succeed.
The job code and config file is attached.
The command to run it (on a yarn cluster) is
{code:java}
bin/flink run -d -m yarn-cluster -c
com.alibaba.blink.tests.MultiRegionBatchNumberCount
../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000
{code}
[~sewen] [~pnowojski] [~kevin.cyj] Do you have ideas why this issue could
happen?
was:
I met this issue when running testing batch(DataSet) job with 1000 parallelism.
Some TMs crashes due to error below:
{code:java}
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 12288 bytes for committing
reserved memory.
[thread 139864559318784 also had an error]
[thread 139867407243008 also had an error]
{code}
With either of the following actions, this problem would not happen:
1. changing ExecutionMode from BATCH_FORCED to PIPELINED
2. changing config "taskmanager.network.bounded-blocking-subpartition-type"
from default "auto" to "file"
So looks it is related to the mmap of BLOCKING shuffle.
This problem would always happen in the beginning of a job, and disappeared
after several rounds of failovers so the job would finally succeed.
The job code and conf is attached.
The command to run it (on a yarn cluster) is
{code:java}
bin/flink run -d -m yarn-cluster -c
com.alibaba.blink.tests.MultiRegionBatchNumberCount
../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000
{code}
[~sewen] [~pnowojski] [~kevin.cyj] Do you know why this issue could happen?
> TaskExecutor crashes due to mmap allocation failure for BLOCKING shuffle
> ------------------------------------------------------------------------
>
> Key: FLINK-15178
> URL: https://issues.apache.org/jira/browse/FLINK-15178
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.10.0
> Reporter: Zhu Zhu
> Priority: Major
> Fix For: 1.10.0
>
> Attachments: MultiRegionBatchNumberCount.java, flink-conf.yaml
>
>
> I met this issue when running testing batch(DataSet) job with 1000
> parallelism.
> Some TMs crashes due to error below:
> {code:java}
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 12288 bytes for committing
> reserved memory.
> [thread 139864559318784 also had an error]
> [thread 139867407243008 also had an error]
> {code}
> With either of the following actions, this problem could be avoided:
> 1. changing ExecutionMode from BATCH_FORCED to PIPELINED
> 2. changing config "taskmanager.network.bounded-blocking-subpartition-type"
> from default "auto" to "file"
> So looks it is related to the mmap of BLOCKING shuffle.
> And the issue is a bit weird that it would always happen in the beginning of
> a job, and disappeared after several rounds of failovers, so the job would
> finally succeed.
> The job code and config file is attached.
> The command to run it (on a yarn cluster) is
> {code:java}
> bin/flink run -d -m yarn-cluster -c
> com.alibaba.blink.tests.MultiRegionBatchNumberCount
> ../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000
> {code}
> [~sewen] [~pnowojski] [~kevin.cyj] Do you have ideas why this issue could
> happen?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)