Zhu Zhu created FLINK-15178: ------------------------------- Summary: Task crash due to mmap allocation failure for BLOCKING shuffle Key: FLINK-15178 URL: https://issues.apache.org/jira/browse/FLINK-15178 Project: Flink Issue Type: Bug Components: Runtime / Network Affects Versions: 1.10.0 Reporter: Zhu Zhu Fix For: 1.10.0 Attachments: MultiRegionBatchNumberCount.java, flink-conf.yaml
I met this issue when running testing batch(DataSet) job with 1000 parallelism. Some TMs crashes due to error below: {code:java} # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. [thread 139864559318784 also had an error] [thread 139867407243008 also had an error] {code} With either of the following actions, this problem would not happen: 1. changing ExecutionMode from BATCH_FORCED to PIPELINED 2. changing config "taskmanager.network.bounded-blocking-subpartition-type" from default "auto" to "file" So looks it is related to the mmap of BLOCKING shuffle. This problem would always happen in the beginning of a job, and disappeared after several rounds of failovers so the job would finally succeed. The job code and conf is attached. The command to run it (on a yarn cluster) is {code:java} bin/flink run -d -m yarn-cluster -c com.alibaba.blink.tests.MultiRegionBatchNumberCount ../flink-tests-1.0-SNAPSHOT-1.10.jar --parallelism 1000 {code} [~sewen] [~pnowojski] [~kevin.cyj] Do you know why this issue could happen? -- This message was sent by Atlassian Jira (v8.3.4#803005)