[
https://issues.apache.org/jira/browse/FLINK-15178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16993300#comment-16993300
]
Zhu Zhu edited comment on FLINK-15178 at 12/11/19 8:31 AM:
-----------------------------------------------------------
[~pnowojski] I tried -XX:-UseCompressedOops but the issue still happens. And
the JDK issue seems to be different from this one as the error message is
different and that reporter met the issue when compressed Oops was enabled.
It's not very similar to FLINK-14952, since the TM is not killed by yarn but
crashes itself. Not sure if it's related to different configs of yarn clusters.
However, IIRC, even if the machine is lack of resources, memory eviction should
happen and allocation failure is still not expected.
I'm trying to snapshot the memory usage os the machine when this issue happens.
Not managed to do it yet since the issue happens and disappears fast.
You can also try it with the attached job code, flink conf and launching
command.
Need to mention that the job source produces very little data(3000 integer in
total), so it actually should not require much memory.
was (Author: zhuzh):
[~pnowojski] I tried -XX:-UseCompressedOops but the issue still happens. And
the JDK issue seems to be different from this one as the error message is
different and that reporter met the issue when compressed Oops was enabled.
It's not very similar to FLINK-14952, since the TM is not killed by yarn but
crashes itself. Not sure if it's related to different configs of yarn clusters.
However, IIRC, even if the machine is lack of resources, memory eviction should
happen and allocation failure is still not expected.
I'm trying to snapshot the memory usage os the machine when this issue happens.
Not managed to do it yet since the issue happens and disappears fast.
You can also try it with the attached job code, flink conf and launching
command.
Need to mention that the job produces very little data(3000 integer in total),
so it actually should not require much memory.
> TaskExecutor crashes due to mmap allocation failure for BLOCKING shuffle
> ------------------------------------------------------------------------
>
> Key: FLINK-15178
> URL: https://issues.apache.org/jira/browse/FLINK-15178
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.10.0
> Reporter: Zhu Zhu
> Priority: Major
> Fix For: 1.10.0
>
> Attachments: MultiRegionBatchNumberCount.java, flink-conf.yaml
>
>
> I met this issue when running testing batch(DataSet) job with 1000
> parallelism.
> Some TMs crashes due to error below:
> {code:java}
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 12288 bytes for committing
> reserved memory.
> [thread 139864559318784 also had an error]
> [thread 139867407243008 also had an error]
> {code}
> With either of the following actions, this problem could be avoided:
> 1. changing ExecutionMode from BATCH_FORCED to PIPELINED
> 2. changing config "taskmanager.network.bounded-blocking-subpartition-type"
> from default "auto" to "file"
> So looks it is related to the mmap of BLOCKING shuffle.
> And the issue is a bit weird that it would always happen in the beginning of
> a job, and disappeared after several rounds of failovers, so the job would
> finally succeed.
> The job code and config file is attached.
> The command to run it (on a yarn cluster) is
> {code:java}
> bin/flink run -d -m yarn-cluster -c
> com.alibaba.blink.tests.MultiRegionBatchNumberCount
> ../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000
> {code}
> [~sewen] [~pnowojski] [~kevin.cyj] Do you have ideas why this issue could
> happen?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)