[ 
https://issues.apache.org/jira/browse/FLINK-26602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liuchenhong updated FLINK-26602:
--------------------------------
    Attachment: [email protected]
                [email protected]

> The Rocksdb task failed savepoint, and then checkpoint failed several times
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-26602
>                 URL: https://issues.apache.org/jira/browse/FLINK-26602
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.11.2
>            Reporter: liuchenhong
>            Priority: Minor
>         Attachments: [email protected], [email protected]
>
>
> The Rocksdb task failed savepoint (2022-03-10 19:55:**), and then checkpoint 
> failed several times (2022-03-11)。Savepoint fails because it is Out Of 
> Memory. But I'd like to know why checkpoint fails and why it goes “beyond 
> physical Memory limits”. I checked the number of data sources and there was 
> no exception . Could it be that savePoint failed, but memory was never freed?
> {code:java}
> //代码占位符
> job manager log
> 2022-03-11 00:58:24,891 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 108412 (type=CHECKPOINT) @ 1646931504738 for job 
> d90b4aca73c5802e0dbbd50ca8af97e0.
> 2022-03-11 00:58:27,605 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed 
> checkpoint 108412 for job d90b4aca73c5802e0dbbd50ca8af97e0 (9815989304 bytes 
> in 2801 ms).
> 2022-03-11 01:00:06,603 INFO  org.apache.flink.yarn.YarnResourceManager       
>              [] - Closing TaskExecutor connection 
> container_e06_1603181034156_0493_01_000023 because: Container 
> [pid=177263,containerID=container_e06_1603181034156_0493_01_000023] is 
> running beyond physical memory limits. Current usage: 12.0 GB of 12 GB 
> physical memory used; 14.3 GB of 25.2 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_e06_1603181034156_0493_01_000023 :
>     |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>     |- 177263 177261 177263 177263 (bash) 2 2 116015104 357 /bin/bash -c 
> /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756 
> -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456 
> -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -Xloggc:/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/jobmanager-gc.log
>  -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M 
> -Dlog4j2.formatMsgNoLookups=true 
> -Dlog.file=/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> -Dlog4j.configurationFile=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.network.max=1073741824b -D 
> taskmanager.memory.network.min=1073741824b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=2652142028b -D 
> taskmanager.memory.task.off-heap.size=536870912b --configDir . 
> -Djobmanager.rpc.address='' 
> -Dweb.tmpdir='/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc' 
> -Dsecurity.kerberos.login.keytab='/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab'
>  -Dweb.port='0' -Djobmanager.rpc.port='41239' -Drest.address='' 1> 
> /mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.out
>  2> 
> /mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.err
>  
>     |- 177416 177263 177263 177263 (java) 484303004 122930506 15252447232 
> 3145560 /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756 
> -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456 
> -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -Xloggc:/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/jobmanager-gc.log
>  -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M 
> -Dlog4j2.formatMsgNoLookups=true 
> -Dlog.file=/mnt/ssd/3/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000023/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> -Dlog4j.configurationFile=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.network.max=1073741824b -D 
> taskmanager.memory.network.min=1073741824b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=2652142028b -D 
> taskmanager.memory.task.off-heap.size=536870912b --configDir . 
> -Djobmanager.rpc.address= 
> -Dweb.tmpdir=/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc 
> -Dsecurity.kerberos.login.keytab=/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab
>  -Dweb.port=0 -Djobmanager.rpc.port=41239 -Drest.address{code}
> {code:java}
> //job manager日志
> 022-03-11 07:04:54,253 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering 
> checkpoint 108594 (type=CHECKPOINT) @ 1646953494183 for job 
> d90b4aca73c5802e0dbbd50ca8af97e0. 2022-03-11 07:04:55,334 INFO  
> org.apache.flink.yarn.YarnResourceManager                    [] - Closing 
> TaskExecutor connection container_e06_1603181034156_0493_01_000021 because: 
> Container [pid=17068,containerID=container_e06_1603181034156_0493_01_000021] 
> is running beyond physical memory limits. Current usage: 12.0 GB of 12 GB 
> physical memory used; 14.2 GB of 25.2 GB virtual memory used. Killing 
> container. Dump of the process-tree for 
> container_e06_1603181034156_0493_01_000021 :     |- PID PPID PGRPID SESSID 
> CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) 
> RSSMEM_USAGE(PAGES) FULL_CMD_LINE     |- 17068 17061 17068 17068 (bash) 1 2 
> 116015104 356 /bin/bash -c /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 
> -Xms2786359756 -XX:MaxDirectMemorySize=1744830464 
> -XX:MaxMetaspaceSize=268435456 -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails 
> -XX:+PrintGCDateStamps 
> -Xloggc:/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/jobmanager-gc.log
>  -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M 
> -Dlog4j2.formatMsgNoLookups=true 
> -Dlog.file=/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> -Dlog4j.configurationFile=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.network.max=1073741824b -D 
> taskmanager.memory.network.min=1073741824b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=2652142028b -D 
> taskmanager.memory.task.off-heap.size=536870912b --configDir . 
> -Djobmanager.rpc.address='' 
> -Dweb.tmpdir='/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc' 
> -Dsecurity.kerberos.login.keytab='/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab'
>  -Dweb.port='0' -Djobmanager.rpc.port='41239' -Drest.address='' 1> 
> /mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.out
>  2> 
> /mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.err
>       |- 17442 17068 17068 17068 (java) 476051309 120830693 15178711040 
> 3145582 /usr/jdk64/jdk1.8.0_152/bin/java -Xmx2786359756 -Xms2786359756 
> -XX:MaxDirectMemorySize=1744830464 -XX:MaxMetaspaceSize=268435456 
> -XX:+UseG1GC -XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
> -Xloggc:/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/jobmanager-gc.log
>  -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=1 -XX:GCLogFileSize=128M 
> -Dlog4j2.formatMsgNoLookups=true 
> -Dlog.file=/mnt/ssd/1/yarn/log/application_1603181034156_0493/container_e06_1603181034156_0493_01_000021/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> -Dlog4j.configurationFile=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.network.max=1073741824b -D 
> taskmanager.memory.network.min=1073741824b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=6796786004b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=2652142028b -D 
> taskmanager.memory.task.off-heap.size=536870912b --configDir . 
> -Djobmanager.rpc.address= 
> -Dweb.tmpdir=/tmp/flink-web-cd3b923f-86f9-463c-9524-40f357bd9afc 
> -Dsecurity.kerberos.login.keytab=/mnt/ssd/8/yarn/local/usercache/portal/appcache/application_1603181034156_0493/container_e06_1603181034156_0493_01_000001/krb5.keytab
>  -Dweb.port=0 -Djobmanager.rpc.port=41239 -Drest.address= 
>  
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to