[ 
https://issues.apache.org/jira/browse/FLINK-39924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39924:
-----------------------------------
    Labels: pull-request-available  (was: )

> Memory fragmentation from jemalloc misconfiguration
> ---------------------------------------------------
>
>                 Key: FLINK-39924
>                 URL: https://issues.apache.org/jira/browse/FLINK-39924
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Configuration
>    Affects Versions: 2.0.2, 2.2.1, 1.20.5, 2.1.3
>            Reporter: Keith Lee
>            Priority: Critical
>              Labels: pull-request-available
>
> We observed excessive memory fragmentation in production, using malloc_stats 
> we identified the most extreme case of fragmentation at 3.91 GB (10.01 GB 
> Resident - 6.1 GB Active) which was significant as the pod has a limit of 16 
> GB. 
> This was caused by {*}jemalloc arena count misconfigured to higher than 
> expected default of 4 x number_of_cpu_cores{*}. 
> h2. Why is high jemalloc arena count bad?
> Higher number of arena reduces thread contention during malloc at the cost of 
> higher memory fragmentation and overall memory usage as memory freed by the 
> process to jemalloc is less likely to be re-used as they are spread across 
> higher number of arenas and has to go through decay of 10 seconds before 
> being freed back to operating system.
> The fragmentation leaves less memory for page cache, impacting performance 
> and cause higher likelihood to OOMKill.
> h2. Root cause
> Jemalloc by default configures narena using the 4 * number_of_cpu_core, 
> however the value for number_of_cpu_core is obtained from the host machine 
> and not from the CPU resource configured for the pod. The misconfiguration 
> happens when host machine CPU core count and pod CPU resource configuration 
> mismatches.
> h2. Reproduction and confirmation
> Steps to reproduce can be found here: 
> [https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]
> The reproduction was ran on a 16 core Mac Studio. We find on a reduction of 
> 10.7 % in resident set size and a slight performance improvement when narena 
> is configured correctly
> {{============================================================}}
> {{[+] Per-image summary:}}
> {{============================================================}}
> {{  image                                          highest anon      avg anon 
>   lowest write-recs    avg write-recs}}
> {{  flink:2.2.1-scala_2.12-java17                    1679.3 MiB    1522.6 MiB 
>              186901            207614}}
> {{  flink-2.2.1-narenas4                             1499.7 MiB    1301.9 MiB 
>              200945            213198}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to