[
https://issues.apache.org/jira/browse/FLINK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
liuchenhong updated FLINK-26602:
--------------------------------
Attachment: [email protected]
[email protected]
> The Rocksdb task failed savepoint, and then checkpoint failed several times
> ---------------------------------------------------------------------------
>
> Key: FLINK-26602
> URL: https://issues.apache.org/jira/browse/FLINK-26602
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.11.2
> Reporter: liuchenhong
> Priority: Minor
> Attachments: [email protected], [email protected]
>
>
> The Rocksdb task failed savepoint (2022-03-10 19:55:**), and then checkpoint
> failed several times (2022-03-11)。Savepoint fails because it is Out Of
> Memory. But I'd like to know why checkpoint fails and why it goes “beyond
> physical Memory limits”. I checked the number of data sources and there was
> no exception . Could it be that savePoint failed, but memory was never freed?
> {code:java}
> //代码占位符
> job manager log
> 2022-03-11 00:58:24,891 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 108412 (type=CHECKPOINT) @ 1646931504738 for job
> d90b4aca73c5802e0dbbd50ca8af97e0.
> 2022-03-11 00:58:27,605 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 108412 for job d90b4aca73c5802e0dbbd50ca8af97e0 (9815989304 bytes
> in 2801 ms).
> 2022-03-11 01:00:06,603 INFO org.apache.flink.yarn.YarnResourceManager
> [] - Closing TaskExecutor connection
> container_e06_1603181034156_0493_01_000023 because: Container
> [pid=177263,containerID=container_e06_1603181034156_0493_01_000023] is
> running beyond physical memory limits. Current usage: 12.0 GB of 12 GB
> physical memory used; 14.3 GB of 25.2 GB virtual memory used. Killing
> container.
> Dump of the process-tree for container_e06_1603181034156_0493_01_000023 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 177263 177261 177263 177263 (bash) 2 2 116015104 357 /bin/bash -c
> /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756
> -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456
> -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -Xloggc:/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/jobmanager-gc.log
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M
> -Dlog4j2.formatMsgNoLookups=true
> -Dlog.file=/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> -Dlog4j.configurationFile=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.network.max=1073741824b -D
> taskmanager.memory.network.min=1073741824b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=2652142028b -D
> taskmanager.memory.task.off-heap.size=536870912b --configDir .
> -Djobmanager.rpc.address=''
> -Dweb.tmpdir='/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc'
> -Dsecurity.kerberos.login.keytab='/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab'
> -Dweb.port='0' -Djobmanager.rpc.port='41239' -Drest.address='' 1>
> /mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.out
> 2>
> /mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.err
>
> |- 177416 177263 177263 177263 (java) 484303004 122930506 15252447232
> 3145560 /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756
> -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456
> -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -Xloggc:/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/jobmanager-gc.log
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M
> -Dlog4j2.formatMsgNoLookups=true
> -Dlog.file=/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> -Dlog4j.configurationFile=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.network.max=1073741824b -D
> taskmanager.memory.network.min=1073741824b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=2652142028b -D
> taskmanager.memory.task.off-heap.size=536870912b --configDir .
> -Djobmanager.rpc.address=
> -Dweb.tmpdir=/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc
> -Dsecurity.kerberos.login.keytab=/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab
> -Dweb.port=0 -Djobmanager.rpc.port=41239 -Drest.address{code}
> {code:java}
> //job manager日志
> 022-03-11 07:04:54,253 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 108594 (type=CHECKPOINT) @ 1646953494183 for job
> d90b4aca73c5802e0dbbd50ca8af97e0. 2022-03-11 07:04:55,334 INFO
> org.apache.flink.yarn.YarnResourceManager [] - Closing
> TaskExecutor connection container_e06_1603181034156_0493_01_000021 because:
> Container [pid=17068,containerID=container_e06_1603181034156_0493_01_000021]
> is running beyond physical memory limits. Current usage: 12.0 GB of 12 GB
> physical memory used; 14.2 GB of 25.2 GB virtual memory used. Killing
> container. Dump of the process-tree for
> container_e06_1603181034156_0493_01_000021 : |- PID PPID PGRPID SESSID
> CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES)
> RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 17068 17061 17068 17068 (bash) 1 2
> 116015104 356 /bin/bash -c /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756
> -Xms2786359756 -XX:MaxDirectMemorySize=1744830464
> -XX:MaxMetaspaceSize=268435456 -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps
> -Xloggc:/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/jobmanager-gc.log
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M
> -Dlog4j2.formatMsgNoLookups=true
> -Dlog.file=/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> -Dlog4j.configurationFile=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.network.max=1073741824b -D
> taskmanager.memory.network.min=1073741824b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=2652142028b -D
> taskmanager.memory.task.off-heap.size=536870912b --configDir .
> -Djobmanager.rpc.address=''
> -Dweb.tmpdir='/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc'
> -Dsecurity.kerberos.login.keytab='/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab'
> -Dweb.port='0' -Djobmanager.rpc.port='41239' -Drest.address='' 1>
> /mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.out
> 2>
> /mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.err
> |- 17442 17068 17068 17068 (java) 476051309 120830693 15178711040
> 3145582 /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756
> -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456
> -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps
> -Xloggc:/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/jobmanager-gc.log
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M
> -Dlog4j2.formatMsgNoLookups=true
> -Dlog.file=/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> -Dlog4j.configurationFile=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.network.max=1073741824b -D
> taskmanager.memory.network.min=1073741824b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=2652142028b -D
> taskmanager.memory.task.off-heap.size=536870912b --configDir .
> -Djobmanager.rpc.address=
> -Dweb.tmpdir=/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc
> -Dsecurity.kerberos.login.keytab=/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab
> -Dweb.port=0 -Djobmanager.rpc.port=41239 -Drest.address=
>
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)