[ 
https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741961#comment-17741961
 ] 

Yunhong Zheng edited comment on FLINK-18356 at 7/11/23 11:55 AM:
-----------------------------------------------------------------

Hi, all. I think I found the root cause of table-planner exit 137 error under 
the guidance of [~lincoln.86xy] .  This error is similar to  issue FLINK-19125, 
both are caused by the incorrect memory fragmentation manage by {*}glibc{*}, 
which will not return memory to kernel gracefully. (refer to [glibc 
bugzilla|https://sourceware.org/bugzilla/show_bug.cgi?id=15321] and [glibc 
manual|https://www.gnu.org/software/libc/manual/html_mono/libc.html#Freeing-after-Malloc]).

When I run mvn verify for flink table-planner in azure CI and my own machine.  
I found that the heap memory and non-heap memory of JVM are stable and within 
the normal range. However, the total memory usage ({*}RES{*}) of the fork 
process is very high, as shown in the following figure(PID : 2958793 and 
2958794):

!image-2023-07-11-19-28-52-851.png|width=537,height=245!

I try to delve deeper into the specific memory allocation of these two 
processes:

 
{code:java}
pmap -p 2958793 {code}
I found that there are a lot of memory fragmentation here with a size close to 
*64MB* (>200 memory fragmentation):

 

!image-2023-07-11-19-35-54-530.png|width=237,height=413!

Based on past experience, this issue is likely to trigger the classic problem 
of the incorrect memory fragmentation manage by *glibc of JDK8.* So we 
downloaded *libjemalloc* and added the environment variable:

 
{code:java}
export LD_PRELOAD=${JAVA_HOME}/lib/amd64/libjemalloc.so.2{code}
After that, the overall memory of the fork process has become stable and meets 
expectations (5GB):

 

!image-2023-07-11-19-41-18-626.png|width=488,height=208!

!image-2023-07-11-19-41-37-105.png|width=228,height=287!

The solution to this problem requires modifying the CI execution Docker image 
[Docker image|[https://github.com/flink-ci/flink-ci-docker],]  replacing 
*glibc* with *libjemalloc* like FLINK-19125, cc [~chesnay] .
{code:java}
apt-get -y install libjemalloc-dev

ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so {code}
I have opened a new Jira (FLINK-32577) to track and fix this issue. cc 
[~mapohl]  [~jark].

 


was (Author: JIRAUSER287975):
Hi, all. I think I found the root cause of table-planner exit 137 error under 
the guidance of [~lincoln.86xy] .  This error is similar to  issue 
[FLINK-19125|https://issues.apache.org/jira/browse/FLINK-19125], both are 
caused by the incorrect memory fragmentation manage by {*}glibc{*}, which will 
not return memory to kernel gracefully. (refer to [glibc 
bugzilla|https://sourceware.org/bugzilla/show_bug.cgi?id=15321] and [glibc 
manual|https://www.gnu.org/software/libc/manual/html_mono/libc.html#Freeing-after-Malloc]).

When I run mvn verify for flink table-planner in azure CI and my own machine.  
I found that the heap memory and non-heap memory of JVM are stable and within 
the normal range. However, the total memory usage ({*}RES{*}) of the fork 
process is very high, as shown in the following figure(PID : 2958793 and 
2958794):

!image-2023-07-11-19-28-52-851.png|width=537,height=245!

I try to delve deeper into the specific memory allocation of these two 
processes:

 
{code:java}
pmap -p 2958793 {code}
I found that there are a lot of memory fragmentation here with a size close to 
*64MB* (>200 memory fragmentation):

 

!image-2023-07-11-19-35-54-530.png|width=237,height=413!

Based on past experience, this issue is likely to trigger the classic problem 
of the incorrect memory fragmentation manage by *glibc of JDK8.* So we 
downloaded *libjemalloc* and added the environment variable:

 
{code:java}
export LD_PRELOAD=${JAVA_HOME}/lib/amd64/libjemalloc.so.2{code}
After that, the overall memory of the fork process has become stable and meets 
expectations (5GB):

 

!image-2023-07-11-19-41-18-626.png|width=488,height=208!

!image-2023-07-11-19-41-37-105.png|width=228,height=287!

The solution to this problem requires modifying the CI execution Docker image 
[Docker image|[https://github.com/flink-ci/flink-ci-docker],]  replacing 
*glibc* with *libjemalloc* like FLINK-19125, cc [~chesnay] :{*}{*}
{code:java}
apt-get -y install libjemalloc-dev

ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so {code}
I have opened a new Jira (FLINK-32577) to track and fix this issue. cc 
[~mapohl]  [~jark].

 

> flink-table-planner Exit code 137 returned from process
> -------------------------------------------------------
>
>                 Key: FLINK-18356
>                 URL: https://issues.apache.org/jira/browse/FLINK-18356
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / Azure Pipelines, Tests
>    Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0, 1.16.0, 1.17.0, 1.18.0
>            Reporter: Piotr Nowojski
>            Assignee: Yunhong Zheng
>            Priority: Critical
>              Labels: pull-request-available, test-stability
>         Attachments: 1234.jpg, app-profiling_4.gif, 
> image-2023-01-11-22-21-57-784.png, image-2023-01-11-22-22-32-124.png, 
> image-2023-02-16-20-18-09-431.png, image-2023-07-11-19-28-52-851.png, 
> image-2023-07-11-19-35-54-530.png, image-2023-07-11-19-41-18-626.png, 
> image-2023-07-11-19-41-37-105.png
>
>
> {noformat}
> ============================= test session starts 
> ==============================
> platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
> cachedir: .tox/py37-cython/.pytest_cache
> rootdir: /__w/3/s/flink-python
> collected 568 items
> pyflink/common/tests/test_configuration.py ..........                    [  
> 1%]
> pyflink/common/tests/test_execution_config.py .......................    [  
> 5%]
> pyflink/dataset/tests/test_execution_environment.py .
> ##[error]Exit code 137 returned from process: file name '/bin/docker', 
> arguments 'exec -i -u 1002 
> 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - python
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729&view=logs&j=9cada3cb-c1d3-5621-16da-0f718fb86602&t=8d78fe4f-d658-5c70-12f8-4921589024c3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to