[
https://issues.apache.org/jira/browse/FLINK-39924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Lee updated FLINK-39924:
------------------------------
Description:
We observed excessive memory fragmentation in production, using malloc_stats we
identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident
- 6.1 GB Active) which was significant as the pod has a limit of 16 GB.
We also observed that jemalloc arena count was higher than expected default of
4 x number_of_cpu_cores.
h2. Why is high jemalloc arena count bad?
Large number of arenas leads to infrequently used arenas, infrequently used
arenas hold dirty pages for dirty_decay_ms before releasing memory to OS. This
leaves less memory for page cache, impacting performance and cause higher
likelihood to OOMKill.
h2. Root cause
Jemalloc by default configures narena using the 4 * number_of_cpu_core, however
*jemalloc is not container aware and the value for number_of_cpu_core is
obtained from the host machine* instead of pod CPU resource configuration.
See jemalloc default:
[https://github.com/jemalloc/jemalloc/blob/4de3a4c3d1bb4520acdc856ddab3e57a28eb7795/src/jemalloc_init.c#L379-L391]
h2. Reproduction and confirmation
Steps to reproduce can be found here:
[https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]
The reproduction was ran on a 14 core Mac book pro. We find on a reduction of
10.7 % in resident set size and a slight performance improvement when narena is
configured correctly
{{============================================================}}
{{[+] Per-image summary:}}
{{============================================================}}
{{ image highest anon avg anon
lowest write-recs avg write-recs}}
{{ flink:2.2.1-scala_2.12-java17 1679.3 MiB 1522.6 MiB
186901 207614}}
{{ flink-2.2.1-narenas4 1499.7 MiB 1301.9 MiB
200945 213198}}
was:
We observed excessive memory fragmentation in production, using malloc_stats we
identified the most extreme case of fragmentation at 3.91 GB (10.01 GB Resident
- 6.1 GB Active) which was significant as the pod has a limit of 16 GB.
This was caused by {*}jemalloc arena count misconfigured to higher than
expected default of 4 x number_of_cpu_cores{*}.
h2. Why is high jemalloc arena count bad?
Higher number of arena reduces thread contention during malloc at the cost of
higher memory fragmentation and overall memory usage as memory freed by the
process to jemalloc is less likely to be re-used as they are spread across
higher number of arenas and has to go through decay of 10 seconds before being
freed back to operating system.
The fragmentation leaves less memory for page cache, impacting performance and
cause higher likelihood to OOMKill.
h2. Root cause
Jemalloc by default configures narena using the 4 * number_of_cpu_core, however
the value for number_of_cpu_core is obtained from the host machine and not from
the CPU resource configured for the pod. The misconfiguration happens when host
machine CPU core count and pod CPU resource configuration mismatches.
h2. Reproduction and confirmation
Steps to reproduce can be found here:
[https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]
The reproduction was ran on a 16 core Mac Studio. We find on a reduction of
10.7 % in resident set size and a slight performance improvement when narena is
configured correctly
{{============================================================}}
{{[+] Per-image summary:}}
{{============================================================}}
{{ image highest anon avg anon
lowest write-recs avg write-recs}}
{{ flink:2.2.1-scala_2.12-java17 1679.3 MiB 1522.6 MiB
186901 207614}}
{{ flink-2.2.1-narenas4 1499.7 MiB 1301.9 MiB
200945 213198}}
> Memory fragmentation from jemalloc misconfiguration
> ---------------------------------------------------
>
> Key: FLINK-39924
> URL: https://issues.apache.org/jira/browse/FLINK-39924
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Configuration
> Affects Versions: 2.0.2, 2.2.1, 1.20.5, 2.1.3
> Reporter: Keith Lee
> Priority: Critical
> Labels: pull-request-available
>
> We observed excessive memory fragmentation in production, using malloc_stats
> we identified the most extreme case of fragmentation at 3.91 GB (10.01 GB
> Resident - 6.1 GB Active) which was significant as the pod has a limit of 16
> GB.
> We also observed that jemalloc arena count was higher than expected default
> of 4 x number_of_cpu_cores.
> h2. Why is high jemalloc arena count bad?
> Large number of arenas leads to infrequently used arenas, infrequently used
> arenas hold dirty pages for dirty_decay_ms before releasing memory to OS.
> This leaves less memory for page cache, impacting performance and cause
> higher likelihood to OOMKill.
> h2. Root cause
> Jemalloc by default configures narena using the 4 * number_of_cpu_core,
> however *jemalloc is not container aware and the value for number_of_cpu_core
> is obtained from the host machine* instead of pod CPU resource configuration.
> See jemalloc default:
> [https://github.com/jemalloc/jemalloc/blob/4de3a4c3d1bb4520acdc856ddab3e57a28eb7795/src/jemalloc_init.c#L379-L391]
> h2. Reproduction and confirmation
> Steps to reproduce can be found here:
> [https://github.com/leekeiabstraction/flink-docker/tree/reproduce-jemalloc-fragmentation/reproduce-jemalloc-fragmentation]
> The reproduction was ran on a 14 core Mac book pro. We find on a reduction of
> 10.7 % in resident set size and a slight performance improvement when narena
> is configured correctly
> {{============================================================}}
> {{[+] Per-image summary:}}
> {{============================================================}}
> {{ image highest anon avg anon
> lowest write-recs avg write-recs}}
> {{ flink:2.2.1-scala_2.12-java17 1679.3 MiB 1522.6 MiB
> 186901 207614}}
> {{ flink-2.2.1-narenas4 1499.7 MiB 1301.9 MiB
> 200945 213198}}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)