Keith Lee created FLINK-39924:
---------------------------------

             Summary: Memory fragmentation from jemalloc misconfiguration
                 Key: FLINK-39924
                 URL: https://issues.apache.org/jira/browse/FLINK-39924
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Configuration
    Affects Versions: 2.1.3, 1.20.5, 2.2.1, 2.0.2
            Reporter: Keith Lee


We observed excessive memory fragmentation in production, using malloc_stats we 
identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident 
- 6.1 GB Active) which was significant as the pod has a limit of 16 GB. 

This was caused by {*}jemalloc arena count misconfigured to higher than 
expected default of 4 x number_of_cpu_cores{*}. 
h2. Why is high jemalloc arena count bad?

Higher number of arena reduces thread contention during malloc at the cost of 
higher memory fragmentation and overall memory usage as memory freed by the 
process to jemalloc is less likely to be re-used as they are spread across 
higher number of arenas and has to go through decay of 10 seconds before being 
freed back to operating system.

The fragmentation leaves less memory for page cache, impacting performance and 
cause higher likelihood to OOMKill.
h2. Root cause

Jemalloc by default configures narena using the 4 * number_of_cpu_core, however 
the value for number_of_cpu_core is obtained from the host machine and not from 
the CPU resource configured for the pod. The misconfiguration happens when host 
machine CPU core count and pod CPU resource configuration mismatches.
h2. Reproduction and confirmation

Steps to reproduce can be found here: 
[https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]

The reproduction was ran on a 16 core Mac Studio. We find on a reduction of 
10.7 % in resident set size and a slight performance improvement when narena is 
configured correctly



{{============================================================}}
{{[+] Per-image summary:}}
{{============================================================}}
{{  image                                          highest anon      avg anon   
lowest write-recs    avg write-recs}}
{{  flink:2.2.1-scala_2.12-java17                    1679.3 MiB    1522.6 MiB   
           186901            207614}}
{{  flink-2.2.1-narenas4                             1499.7 MiB    1301.9 MiB   
           200945            213198}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to