[ 
https://issues.apache.org/jira/browse/FLINK-15178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992664#comment-16992664
 ] 

Piotr Nowojski edited comment on FLINK-15178 at 12/10/19 3:42 PM:
------------------------------------------------------------------

Hmm, I'm wondering if this is related to FLINK-14952 . The solution for that 
ticket will be either to chose file mode by default either for yarn or for all 
deployments.

Could you post more details. What was the memory/swap usage on the machine when 
this happened? 

Also quick google search revealed 
[this|https://bugs.openjdk.java.net/browse/JDK-8187709]. Could you try running 
the same job but with {{-XX:-UseCompressedOops)}} flag?


was (Author: pnowojski):
Hmm, I'm wondering if this is related to FLINK-14952 . The solution for that 
ticket will be either to chose file mode by default either for yarn or for all 
deployments.

> TaskExecutor crashes due to mmap allocation failure for BLOCKING shuffle
> ------------------------------------------------------------------------
>
>                 Key: FLINK-15178
>                 URL: https://issues.apache.org/jira/browse/FLINK-15178
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Priority: Major
>             Fix For: 1.10.0
>
>         Attachments: MultiRegionBatchNumberCount.java, flink-conf.yaml
>
>
> I met this issue when running testing batch(DataSet) job with 1000 
> parallelism.
> Some TMs crashes due to error below: 
> {code:java}
> # There is insufficient memory for the Java Runtime Environment to continue.
> # Native memory allocation (mmap) failed to map 12288 bytes for committing 
> reserved memory.
> [thread 139864559318784 also had an error]
> [thread 139867407243008 also had an error]
> {code}
> With either of the following actions, this problem could be avoided:
> 1. changing ExecutionMode from BATCH_FORCED to PIPELINED
> 2. changing config "taskmanager.network.bounded-blocking-subpartition-type" 
> from default "auto" to "file"
> So looks it is related to the mmap of BLOCKING shuffle.
> And the issue is a bit weird that it would always happen in the beginning of 
> a job, and disappeared after several rounds of failovers, so the job would 
> finally succeed.
> The job code and config file is attached.
> The command to run it (on a yarn cluster) is 
> {code:java}
> bin/flink run -d -m yarn-cluster -c 
> com.alibaba.blink.tests.MultiRegionBatchNumberCount 
> ../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000
> {code}
> [~sewen]  [~pnowojski]  [~kevin.cyj]  Do you have ideas why this issue could 
> happen?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to