[ 
https://issues.apache.org/jira/browse/FLINK-39923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18088543#comment-18088543
 ] 

Keith Lee commented on FLINK-39923:
-----------------------------------

I am currently looking into how to fix this leak. Initial investigation seems 
to rule out leak of resource through unclosed resource on Java side. 
Additionally, I observe that this leak does not affect Flink 1.20

> RocksDB Statistics native memory leaks on state backend rebuild when ticker 
> metrics are enabled
> -----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39923
>                 URL: https://issues.apache.org/jira/browse/FLINK-39923
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 2.0.2, 2.2.1, 2.1.3
>            Reporter: Keith Lee
>            Priority: Major
>
> When any of the 11 RocksDB ticker-type metric options is enabled, the 
> TaskManager leaks native memory in proportion to the number of keyed state 
> backend rebuilds (job restarts, rescaling, recovery cascades).
> Ticker type metric:
> {quote}state.backend.rocksdb.metrics.block-cache-hit
> state.backend.rocksdb.metrics.block-cache-miss
> state.backend.rocksdb.metrics.bloom-filter-useful
> state.backend.rocksdb.metrics.bloom-filter-full-positive
> state.backend.rocksdb.metrics.bloom-filter-full-true-positive
> state.backend.rocksdb.metrics.bytes-read
> state.backend.rocksdb.metrics.iter-bytes-read
> state.backend.rocksdb.metrics.bytes-written
> state.backend.rocksdb.metrics.compaction-read-bytes
> state.backend.rocksdb.metrics.compaction-write-bytes
> state.backend.rocksdb.metrics.stall-micros
> {quote}
> This issue was reproduced and confirmed as OOMKill was observed within 80 
> seconds of submitting a continuously failing job to Flink cluster configured 
> with low restart delay and ticker style metrics enabled. See here for 
> reproduction instructions and scripts: 
> [https://github.com/leekeiabstraction/flink/tree/reproduce-rocksdb-statistics-leak/reproduce-rocksdb-statistics-leak]
> See dotfile output of jeprof (jemalloc profiling needs to be enabled) points 
> to 770MB memory allocated in rocksdb StatisticsJni.
>  
> {quote}Legend 
> [shape=box,fontsize=24,shape=plaintext,label="/proc/307/exe\lTotal B: 
> 2855914662\lFocusing on: 2855914662\lDropped nodes with <= 
> [14279573|tel:14279573] abs(B)\lDropped edges with <= [2855914|tel:2855914] 
> B\l"];
> N1 [label="je_prof_backtrace\n0 (0.0%)\rof [2040910591|tel:2040910591] 
> (71.5%)\r",shape=box,fontsize=8.0];
> N2 [label="je_prof_tctx_create\n0 (0.0%)\rof [2040910591|tel:2040910591] 
> (71.5%)\r",shape=box,fontsize=8.0];
> N3 [label="prof_backtrace_impl\n2040910591 
> (71.5%)\r",shape=box,fontsize=50.3];
> N4 [label="je_malloc_default\n0 (0.0%)\rof [2032208910|tel:2032208910] 
> (71.2%)\r",shape=box,fontsize=8.0];
> N5 [label="Unsafe_AllocateMemory0\n0 (0.0%)\rof [1874666648|tel:1874666648] 
> (65.6%)\r",shape=box,fontsize=8.0];
> N6 [label="os\nmalloc@d01a60\n0 (0.0%)\rof [1874666648|tel:1874666648] 
> (65.6%)\r",shape=box,fontsize=8.0];
> N7 [label="0x00007fb705ffd460\n0 (0.0%)\rof [1874578289|tel:1874578289] 
> (65.6%)\r",shape=box,fontsize=8.0];
> N8 [label="Java_org_rocksdb_Statistics_newStatistics___3BJ\n0 (0.0%)\rof 
> [807469136|tel:807469136](28.3%)\r",shape=box,fontsize=8.0];
> N9 [label="rocksdb\nCoreLocalArray\nCoreLocalArray\n0 (0.0%)\rof 
> [807403520|tel:807403520] (28.3%)\r",shape=box,fontsize=8.0];
> N10 [label="rocksdb\nStatisticsImpl\nStatisticsImpl\n0 (0.0%)\rof 
> [807403520|tel:807403520] (28.3%)\r",shape=box,fontsize=8.0];
> N11 [label="rocksdb\nStatisticsJni\nStatisticsJni\n0 (0.0%)\rof 
> [807403520|tel:807403520] (28.3%)\r",shape=box,fontsize=8.0];
> N12 [label="rocksdb\nport\ncacheline_aligned_alloc\n807403520 
> (28.3%)\r",shape=box,fontsize=34.6];
> N13 [label="0x00007fb7068b4ceb\n0 (0.0%)\rof [578879568|tel:578879568] 
> (20.3%)\r",shape=box,fontsize=8.0];
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to