[
https://issues.apache.org/jira/browse/FLINK-25373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504686#comment-17504686
]
ren shangtao commented on FLINK-25373:
--------------------------------------
Hi Spongebob,Ihave the same question with you
I run a batch job serveral times in flink 1.14.3 use standalone mode, then the
linux system out of memory,it kill the task manger ultimately.
Mar 11 00:14:18 node10 kernel: dockerd invoked oom-killer: gfp_mask=0x201da,
order=0, oom_score_adj=-500
Mar 11 00:14:18 node10 kernel: dockerd cpuset=/ mems_allowed=0
Mar 11 00:14:18 node10 kernel: CPU: 28 PID: 4307 Comm: dockerd Kdump: loaded
Tainted: G OE ------------ T 3.10.0-1160.31.1.el7.x86_64 #1
Mar 11 00:14:18 node10 kernel: Hardware name: Dell Inc. PowerEdge T440/021KCD,
BIOS 2.8.2 08/31/2020
Mar 11 00:14:18 node10 kernel: Call Trace:
Mar 11 00:14:18 node10 kernel: [<ffffffff9dd835a9>] dump_stack+0x19/0x1b
Mar 11 00:14:18 node10 kernel: [<ffffffff9dd7e648>] dump_header+0x90/0x229
Mar 11 00:14:18 node10 kernel: [<ffffffff9d706492>] ? ktime_get_ts64+0x52/0xf0
Mar 11 00:14:18 node10 kernel: [<ffffffff9d75db1f>] ? delayacct_end+0x8f/0xb0
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7c204d>] oom_kill_process+0x2cd/0x490
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7c1a3d>] ?
oom_unkillable_task+0xcd/0x120
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7c273a>] out_of_memory+0x31a/0x500
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7c9354>]
__alloc_pages_nodemask+0xad4/0xbe0
Mar 11 00:14:18 node10 kernel: [<ffffffff9d818ea8>]
alloc_pages_current+0x98/0x110
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7bdb07>] __page_cache_alloc+0x97/0xb0
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7c0aa0>] filemap_fault+0x270/0x420
Mar 11 00:14:18 node10 kernel: [<ffffffffc059691e>]
__xfs_filemap_fault+0x7e/0x1d0 [xfs]
Mar 11 00:14:18 node10 kernel: [<ffffffffc0596b1c>] xfs_filemap_fault+0x2c/0x30
[xfs]
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7ee2aa>]
__do_fault.isra.61+0x8a/0x100
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7ee85c>]
do_read_fault.isra.63+0x4c/0x1b0
Mar 11 00:14:18 node10 kernel: [<ffffffff9d7f60a0>] handle_mm_fault+0xa20/0xfb0
Mar 11 00:14:18 node10 kernel: [<ffffffff9dd90653>] __do_page_fault+0x213/0x500
Mar 11 00:14:18 node10 kernel: [<ffffffff9dd90975>] do_page_fault+0x35/0x90
Mar 11 00:14:18 node10 kernel: [<ffffffff9dd8c778>] page_fault+0x28/0x30
Mar 11 00:14:18 node10 kernel: Mem-Info:
Mar 11 00:14:18 node10 kernel: active_anon:15929291 inactive_anon:2464
isolated_anon:0#012 active_file:775 inactive_file:1147 isolated_file:99#012
unevictable:0 dirty:38 writeback:859 unstable:0#012 slab_reclaimable:21310
slab_unreclaimable:42445#012 mapped:2114 shmem:2586 pagetables:45585
bounce:0#012 free:82929 free_pcp:89 free_cma:0
Mar 11 00:14:18 node10 kernel: Node 0 DMA free:14840kB min:16kB low:20kB
high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB
managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB
unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Mar 11 00:14:18 node10 kernel: lowmem_reserve[]: 0 1301 63712 63712
Mar 11 00:14:18 node10 kernel: Node 0 DMA32 free:250568kB min:1380kB low:1724kB
high:2068kB active_anon:1063892kB inactive_anon:488kB active_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:1566272kB managed:1332768kB mlocked:0kB dirty:0kB writeback:0kB
mapped:0kB shmem:516kB slab_reclaimable:1776kB slab_unreclaimable:3808kB
kernel_stack:704kB pagetables:2924kB unstable:0kB bounce:0kB free_pcp:0kB
local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable?
yes
Mar 11 00:14:18 node10 kernel: lowmem_reserve[]: 0 0 62410 62410
Mar 11 00:14:18 node10 kernel: Node 0 Normal free:66308kB min:66184kB
low:82728kB high:99276kB active_anon:62653272kB inactive_anon:9368kB
active_file:3100kB inactive_file:4588kB unevictable:0kB isolated(anon):0kB
isolated(file):396kB present:65011712kB managed:63911352kB mlocked:0kB
dirty:152kB writeback:3436kB mapped:8456kB shmem:9828kB
slab_reclaimable:83464kB slab_unreclaimable:165940kB kernel_stack:23776kB
pagetables:179416kB unstable:0kB bounce:0kB free_pcp:352kB local_pcp:0kB
free_cma:0kB writeback_tmp:0kB pages_scanned:816 all_unreclaimable? no
Mar 11 00:14:18 node10 kernel: lowmem_reserve[]: 0 0 0 0
Mar 11 00:14:18 node10 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 1*32kB
(U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB
(M) = 14840kB
Mar 11 00:14:18 node10 kernel: Node 0 DMA32: 507*4kB (UM) 360*8kB (UEM)
554*16kB (UEM) 508*32kB (UEM) 160*64kB (UEM) 91*128kB (UEM) 46*256kB (UEM)
13*512kB (UEM) 64*1024kB (UM) 4*2048kB (UEM) 26*4096kB (UM) = 250572kB
Mar 11 00:14:18 node10 kernel: Node 0 Normal: 1637*4kB (UEM) 5641*8kB (UEM)
1117*16kB (UEM) 1*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 69580kB
Mar 11 00:14:18 node10 kernel: Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
Mar 11 00:14:18 node10 kernel: Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
Mar 11 00:14:18 node10 kernel: 4889 total pagecache pages
Mar 11 00:14:18 node10 kernel: 0 pages in swap cache
Mar 11 00:14:18 node10 kernel: Swap cache stats: add 0, delete 0, find 0/0
Mar 11 00:14:18 node10 kernel: Free swap = 0kB
Mar 11 00:14:18 node10 kernel: Total swap = 0kB
Mar 11 00:14:18 node10 kernel: 16648491 pages RAM
Mar 11 00:14:18 node10 kernel: 0 pages HighMem/MovableOnly
Mar 11 00:14:18 node10 kernel: 333487 pages reserved
Mar 11 00:14:18 node10 kernel: [ pid ] uid tgid total_vm rss nr_ptes
swapents oom_score_adj name
Mar 11 00:14:18 node10 kernel: [ 782] 0 782 9764 1824 25
0 0 systemd-journal
Mar 11 00:14:18 node10 kernel: [ 804] 0 804 68076 118 32
0 0 lvmetad
Mar 11 00:14:18 node10 kernel: [ 819] 0 819 11383 161 24
0 -1000 systemd-udevd
Mar 11 00:14:18 node10 kernel: [ 1106] 0 1106 6596 80 18
0 0 systemd-logind
Mar 11 00:14:18 node10 kernel: [ 1108] 0 1108 5419 99 14
0 0 irqbalance
Mar 11 00:14:18 node10 kernel: [ 1111] 81 1111 14554 167 34
0 -900 dbus-daemon
Mar 11 00:14:18 node10 kernel: [ 1117] 0 1117 119115 530 83
0 0 NetworkManager
Mar 11 00:14:18 node10 kernel: [ 1122] 999 1122 153058 1605 61
0 0 polkitd
Mar 11 00:14:18 node10 kernel: [ 1456] 0 1456 852618 9016 133
0 -500 dockerd
Mar 11 00:14:18 node10 kernel: [ 1484] 0 1484 143571 3370 98
0 0 tuned
Mar 11 00:14:18 node10 kernel: [ 1487] 0 1487 28235 259 58
0 -1000 sshd
Mar 11 00:14:18 node10 kernel: [ 1490] 0 1490 179067 3494 23
0 0 nats-server
Mar 11 00:14:18 node10 kernel: [ 1493] 0 1493 54632 1332 43
0 0 rsyslogd
Mar 11 00:14:18 node10 kernel: [ 1753] 0 1753 31598 156 19
0 0 crond
Mar 11 00:14:18 node10 kernel: [ 1813] 0 1813 27552 34 9
0 0 agetty
Mar 11 00:14:18 node10 kernel: [ 1875] 0 1875 827410 6912 116
0 -500 containerd
Mar 11 00:14:18 node10 kernel: [18022] 0 18022 40514 430 81
0 0 sshd
Mar 11 00:14:18 node10 kernel: [18039] 0 18039 40429 347 81
0 0 sshd
Mar 11 00:14:18 node10 kernel: [18041] 0 18041 28920 132 14
0 0 bash
Mar 11 00:14:18 node10 kernel: [18216] 0 18216 18073 184 38
0 0 sftp-server
Mar 11 00:14:18 node10 kernel: [18811] 0 18811 4716599 121763 382
0 0 java
Mar 11 00:14:18 node10 kernel: [19181] 0 19181 4701076 1900016 3824
0 0 java
Mar 11 00:14:18 node10 kernel: [20204] 0 20204 4743884 111675 466
0 0 java
Mar 11 00:14:18 node10 kernel: [20779] 0 20779 40624 268 35
0 0 top
Mar 11 00:14:18 node10 kernel: [74904] 0 74904 23838477 13752532 39743
0 0 java
Mar 11 00:14:18 node10 kernel: [98890] 26 98890 28322 50 11
0 0 .systemd-privat
Mar 11 00:14:18 node10 kernel: [98894] 26 98894 28322 55 11
0 0 bash
Mar 11 00:14:18 node10 kernel: [99125] 26 99125 28322 54 11
0 0 bash
Mar 11 00:14:18 node10 kernel: [99243] 26 99243 63171 221 42
0 0 curl
Mar 11 00:14:18 node10 kernel: [99245] 0 99245 34790 169 25
0 0 crond
Mar 11 00:14:18 node10 kernel: Out of memory: Kill process 74904 (java) score
820 or sacrifice child
Mar 11 00:14:18 node10 kernel: Killed process 74904 (java), UID 0,
total-vm:95353908kB, anon-rss:55010692kB, file-rss:448kB, shmem-rss:0kB
Mar 11 00:14:19 node10 systemd: Started Session 787 of user root.
Mar 11 00:14:21 node10 systemd-logind: Removed session 699.
Mar 11 00:15:01 node10 systemd: Started Session 788 of user root.
Mar 11 00:16:01 node10 systemd: Started Session 789 of user root.
Mar 11 00:17:01 node10 systemd: Started Session 790 of user root.
Mar 11 00:18:01 node10 systemd: Started Session 791 of user root.
Mar 11 00:19:01 node10 systemd: Started Session 792 of user root.
Mar 11 00:19:15 node10 systemd: Removed slice User Slice of postgres.
Mar 11 00:20:01 node10 systemd: Started Session 793 of user root.
Mar 11 00:20:01 node10 systemd: Started Session 794 of user root.
Mar 11 00:20:03 node10 crond: sendmail: fatal: parameter inet_interfaces: no
local interface found for ::1
> task manager can not free memory when jobs are finished
> -------------------------------------------------------
>
> Key: FLINK-25373
> URL: https://issues.apache.org/jira/browse/FLINK-25373
> Project: Flink
> Issue Type: Bug
> Components: API / Core
> Affects Versions: 1.14.0
> Environment: flink 1.14.0
> Reporter: Spongebob
> Priority: Major
> Attachments: image-2021-12-19-11-48-33-622.png,
> image-2022-03-11-10-06-19-499.png
>
>
> I submit my Flinksql jobs to the Flink standalone cluster and what out of my
> expectation is that TaskManagers could not free memory when all jobs are
> finished whether normally or not.
> And I found that there were many threads named like `
> flink-taskexecutor-io-thread-x` and their states were waiting on conditions.
> here's the detail of these threads:
>
> "flink-taskexecutor-io-thread-31" Id=5386 WAITING on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2da8b14c
> at sun.misc.Unsafe.park(Native Method)
> - waiting on
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2da8b14c
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
> at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> !image-2021-12-19-11-48-33-622.png!
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)