[jira] [Comment Edited] (FLINK-18356) Exit code 137 returned from process

Yun Gao (Jira) Mon, 31 Jan 2022 23:38:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485077#comment-17485077
 ]


Yun Gao edited comment on FLINK-18356 at 2/1/22, 7:37 AM:
----------------------------------------------------------

Hi~ with some more tests, the current finds are:
 # To avoid the impact of split tests between two processes randomly, I changed 
the process count to 1. For each test, I added a new empty test (sleep for 10 
minutes), and keep snapshots the process's state when running this test. To 
acquire a stable snapshot, the result is snapshot after multiple jmap 
-dump:live which triggers full gc.
 # By compare the result from java native memory tracking and pmap -x and also 
profiling with jemalloc, it seems the gap between the jvm usage and the actual 
memory consumption comes from the direct buffer, including the network buffers 
and the managed memory. The managed memory is allocated with UNSAFE directly, 
thus it is freed actively. But the network buffers relies is pure java 
DirectByteBuffer, and releasing depends on GC. Thus if we have enough heap 
memory (which is indeed the case), the network buffers would not be released in 
time, which contributes to high memory usage. A demonstration example is the 
TableEnvironmentITCase: the other cases share the same mini-cluster while this 
case creates a new mini-cluster for each test, thus when running this case the 
memory is increased much more rapidly.
 # The malloc() implementation would also cache a part of memory. Different 
implementation varies on the amount of cached memory. This explains the 
differents of using libc (cache about 2G) and jemalloc (cache about 500M). This 
is demonstrated by comparing the address of the allocated network buffers and 
managed memory buffers and the pmap -x of the process.
 # The second part is about the class metadata. With jemalloc and no other 
changes, by the end there wold be about 170000 classes and consumes about 
1.7GB. Most of these classes comes from the 
org.apache.flink.core.classloading.SubmoduleClassLoader: there are dozens of 
this classes loaders and each would load about 1700 classes.
 # By applying the shared akka patch provided by [~chesnay] 
([https://github.com/zentol/flink/commits/rpc_share]), the amount of classes 
decreased to about 110000 (1GB) and the number of SubmoduleClassLoader 
decreased to about 40.
 # We have suspect there is class leaks due to some classes loaded by 
AppClassLoader refers to other classloaders in static fields. We tried some of 
such cases: if we disable the 
org.apache.flink.table.runtime.generated.CompileUtils#COMPILED_CACHE by set the 
maximum allowed size to 1, the number of SubmoduleClassLoader decreased to 2 
and the number of classes is decreased to 40000 (~500MB). Disable 
org.apache.flink.table.runtime.generated.CompileUtils#COMPILED_EXPRESSION_CACHE 
and JaninoRelMetadataProvider#HANDLES does not have impact on the number of 
classes. But the root cause of why the COMPILED_CACHE affects the number of 
SubmoduleClassLoader is not confirmed yet.
 # Classes contained in ObjectStreamClass#Cache seems mostly loaded by the 
AppClassLoader. Other issue include JsonSerdeUtil#OBJECT_MAPPER_INSTANCE and 
org.apache.calcite.rel.type.RelDataTypeFactoryImpl#DATATYPE_CACHE. These issues 
might not related to the current case, but we might also have some 
investigation on them since they might hold user classloader in real cases.
 # The above results are stable across multiple runs.


was (Author: gaoyunhaii):
Hi~ with some more tests, the current finds are:

# To avoid the impact of split tests between two processes randomly, I changed 
the process count to 1. For each test, I added a new empty test (sleep for 10 
minutes), and keep snapshots the process's state when running this test. To 
acquire a stable snapshot, the result is snapshot after multiple jmap 
-dump:live which triggers full gc. 
# By compare the result from java native memory tracking and pmap -x and also 
profiling with jemalloc, it seems the gap between the jvm usage and the  actual 
memory consumption comes from the direct buffer, including the network buffers 
and the managed memory. The managed memory is allocated with UNSAFE directly, 
thus it is freed actively. But the network buffers relies is pure java 
DirectByteBuffer, and releasing depends on GC. Thus if we have enough heap 
memory (which is indeed the case), the network buffers would not be released in 
time, which contributes to high memory usage. A demonstration example is the 
TableEnvironmentITCase: the other cases share the same mini-cluster while this 
case creates a new mini-cluster for each test, thus when running this case the 
memory is increased much more rapidly. 
# The malloc() implementation would also cache a part of memory. Different 
implementation varies on the amount of cached memory. This explains the 
differents of using libc (cache about 2G) and jemalloc (cache about 500M). This 
is demonstrated by comparing the address of the allocated network buffers and 
managed memory buffers and the pmap -x of the process.
# The second part is about the class metadata. With jemalloc and no other 
changes, by the end there wold be about 170000 classes and consumes about 
1.7GB. Most of these classes comes from the 
org.apache.flink.core.classloading.SubmoduleClassLoader: there are dozens of 
this classes loaders and each would load about 1700 classes.
# By applying the shared akka patch provided by [~chesnay] 
(https://github.com/zentol/flink/commits/rpc_share), the amount of classes 
decreased to about 110000 (1GB) and the number of SubmoduleClassLoader 
decreased to about 40. 
# We have suspect there is class leaks due to some classes loaded by 
AppClassLoader refers to other classloaders in static fields. We tried some of 
such cases: if we disable the 
org.apache.flink.table.runtime.generated.CompileUtils#COMPILED_CACHE by set the 
maximum allowed size to 1, the number of SubmoduleClassLoader decreased to 2 
and the number of classes is decreased to 40000 (~500MB). Disable 
org.apache.flink.table.runtime.generated.CompileUtils#COMPILED_EXPRESSION_CACHE 
and JaninoRelMetadataProvider#HANDLES does not have impact on the number of 
classes. But the root cause of why the COMPILED_CACHE affects the number of 
SubmoduleClassLoader is not confirmed yet. 
# Classes contained in ObjectStreamClass#Cache seems mostly loaded by the 
AppClassLoader. Other issue include JsonSerdeUtil#OBJECT_MAPPER_INSTANCE and 
org.apache.calcite.rel.type.RelDataTypeFactoryImpl#DATATYPE_CACHE. These issues 
might not related to the current case, but we might also have some 
investigation on them.
# The above results are stable across multiple runs. 

> Exit code 137 returned from process
> -----------------------------------
>
>                 Key: FLINK-18356
>                 URL: https://issues.apache.org/jira/browse/FLINK-18356
>             Project: Flink
>          Issue Type: Bug
>          Components: Build System / Azure Pipelines, Tests
>    Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0
>            Reporter: Piotr Nowojski
>            Assignee: Chesnay Schepler
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>             Fix For: 1.15.0
>
>         Attachments: 1234.jpg, app-profiling_4.gif
>
>
> {noformat}
> ============================= test session starts 
> ==============================
> platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1
> cachedir: .tox/py37-cython/.pytest_cache
> rootdir: /__w/3/s/flink-python
> collected 568 items
> pyflink/common/tests/test_configuration.py ..........                    [  
> 1%]
> pyflink/common/tests/test_execution_config.py .......................    [  
> 5%]
> pyflink/dataset/tests/test_execution_environment.py .
> ##[error]Exit code 137 returned from process: file name '/bin/docker', 
> arguments 'exec -i -u 1002 
> 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb 
> /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'.
> Finishing: Test - python
> {noformat}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729&view=logs&j=9cada3cb-c1d3-5621-16da-0f718fb86602&t=8d78fe4f-d658-5c70-12f8-4921589024c3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (FLINK-18356) Exit code 137 returned from process

Reply via email to