[ 
https://issues.apache.org/jira/browse/FLINK-39924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Lee updated FLINK-39924:
------------------------------
    Description: 
We observed excessive memory fragmentation in production, using malloc_stats we 
identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident 
- 6.1 GB Active) which was significant as the pod has a limit of 16 GB. 

We also observed that jemalloc arena count was higher than expected default of 
4 x number_of_cpu_cores. 
h2. Why is high jemalloc arena count bad?

Large number of arenas leads to infrequently used arenas, infrequently used 
arenas hold dirty pages for dirty_decay_ms before releasing memory to OS. This 
leaves less memory for Flink process and OS page cache, impacting performance 
and cause higher likelihood to OOMKill.
h2. Root cause

Jemalloc by default configures narena using the 4 * number_of_cpu_core, however 
*jemalloc is not container aware and the value for number_of_cpu_core is 
obtained from the host machine* instead of pod CPU resource configuration. 

See jemalloc default: 
[https://github.com/jemalloc/jemalloc/blob/4de3a4c3d1bb4520acdc856ddab3e57a28eb7795/src/jemalloc_init.c#L379-L391]
h2. Reproduction and confirmation

Steps to reproduce can be found here: 
[https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]

The reproduction was ran on a 14 core Mac book pro. We find on a reduction of 
10.7 % in resident set size and a slight performance improvement when narena is 
configured to 4 * pod CPU count.

{{============================================================}}
{{[+] Per-image summary:}}
{{============================================================}}
{{  image                                          highest anon      avg anon   
lowest write-recs    avg write-recs}}
{{  flink:2.2.1-scala_2.12-java17                    1679.3 MiB    1522.6 MiB   
           186901            207614}}
{{  flink-2.2.1-narenas4                             1499.7 MiB    1301.9 MiB   
           200945            213198}}

  was:
We observed excessive memory fragmentation in production, using malloc_stats we 
identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident 
- 6.1 GB Active) which was significant as the pod has a limit of 16 GB. 

We also observed that jemalloc arena count was higher than expected default of 
4 x number_of_cpu_cores. 
h2. Why is high jemalloc arena count bad?

Large number of arenas leads to infrequently used arenas, infrequently used 
arenas hold dirty pages for dirty_decay_ms before releasing memory to OS. This 
leaves less memory for page cache, impacting performance and cause higher 
likelihood to OOMKill.
h2. Root cause

Jemalloc by default configures narena using the 4 * number_of_cpu_core, however 
*jemalloc is not container aware and the value for number_of_cpu_core is 
obtained from the host machine* instead of pod CPU resource configuration. 

See jemalloc default: 
[https://github.com/jemalloc/jemalloc/blob/4de3a4c3d1bb4520acdc856ddab3e57a28eb7795/src/jemalloc_init.c#L379-L391]
h2. Reproduction and confirmation

Steps to reproduce can be found here: 
[https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]

The reproduction was ran on a 14 core Mac book pro. We find on a reduction of 
10.7 % in resident set size and a slight performance improvement when narena is 
configured to 4 * pod CPU count.

{{============================================================}}
{{[+] Per-image summary:}}
{{============================================================}}
{{  image                                          highest anon      avg anon   
lowest write-recs    avg write-recs}}
{{  flink:2.2.1-scala_2.12-java17                    1679.3 MiB    1522.6 MiB   
           186901            207614}}
{{  flink-2.2.1-narenas4                             1499.7 MiB    1301.9 MiB   
           200945            213198}}


> Memory fragmentation from jemalloc misconfiguration
> ---------------------------------------------------
>
>                 Key: FLINK-39924
>                 URL: https://issues.apache.org/jira/browse/FLINK-39924
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Configuration
>    Affects Versions: 2.0.2, 2.2.1, 1.20.5, 2.1.3
>            Reporter: Keith Lee
>            Priority: Critical
>              Labels: pull-request-available
>
> We observed excessive memory fragmentation in production, using malloc_stats 
> we identified the most extreme case of fragmentation at 3.91 GB (10.01 GB 
> Resident - 6.1 GB Active) which was significant as the pod has a limit of 16 
> GB. 
> We also observed that jemalloc arena count was higher than expected default 
> of 4 x number_of_cpu_cores. 
> h2. Why is high jemalloc arena count bad?
> Large number of arenas leads to infrequently used arenas, infrequently used 
> arenas hold dirty pages for dirty_decay_ms before releasing memory to OS. 
> This leaves less memory for Flink process and OS page cache, impacting 
> performance and cause higher likelihood to OOMKill.
> h2. Root cause
> Jemalloc by default configures narena using the 4 * number_of_cpu_core, 
> however *jemalloc is not container aware and the value for number_of_cpu_core 
> is obtained from the host machine* instead of pod CPU resource configuration. 
> See jemalloc default: 
> [https://github.com/jemalloc/jemalloc/blob/4de3a4c3d1bb4520acdc856ddab3e57a28eb7795/src/jemalloc_init.c#L379-L391]
> h2. Reproduction and confirmation
> Steps to reproduce can be found here: 
> [https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]
> The reproduction was ran on a 14 core Mac book pro. We find on a reduction of 
> 10.7 % in resident set size and a slight performance improvement when narena 
> is configured to 4 * pod CPU count.
> {{============================================================}}
> {{[+] Per-image summary:}}
> {{============================================================}}
> {{  image                                          highest anon      avg anon 
>   lowest write-recs    avg write-recs}}
> {{  flink:2.2.1-scala_2.12-java17                    1679.3 MiB    1522.6 MiB 
>              186901            207614}}
> {{  flink-2.2.1-narenas4                             1499.7 MiB    1301.9 MiB 
>              200945            213198}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to