[ https://issues.apache.org/jira/browse/FLINK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372919#comment-16372919 ]
Piotr Nowojski edited comment on FLINK-8707 at 2/22/18 3:14 PM: ---------------------------------------------------------------- If you take a look into the attached results of lsof, 60% are regular files: {noformat} cat box2-taskmgr-lsof | cut -c40-55 | sort -n | uniq -c 406 CHR 116 DIR 8294 REG 3596 FIFO 348 IPv6 116 unix 0xffff 1798 a_inode {noformat} and those files repeat them selves 116 times: {noformat} 116 /opt/app/wily/agent/Agent.jar 116 /opt/app/wily/agent/core/ext/AppMap.jar 116 /opt/app/wily/agent/core/ext/BasicDirectiveLoader.jar 116 /opt/app/wily/agent/core/ext/BizDef.jar 116 /opt/app/wily/agent/core/ext/BizTrxHttp.jar 116 /opt/app/wily/agent/core/ext/ChangeDetector-Agent_Server.jar 116 /opt/app/wily/agent/core/ext/ChangeDetector-CommonAll.jar 116 /opt/app/wily/agent/core/ext/ChangeDetectorAgent.jar 116 /opt/app/wily/agent/core/ext/DynInstrBootstrap.jar 116 /opt/app/wily/agent/core/ext/DynInstrSupport15.jar 116 /opt/app/wily/agent/core/ext/GCMonitor.jar 116 /opt/app/wily/agent/core/ext/HPC-GcMonitorAgent.jar 116 /opt/app/wily/agent/core/ext/Inheritance.jar 116 /opt/app/wily/agent/core/ext/Java15DynamicInstrumentation.jar 116 /opt/app/wily/agent/core/ext/LeakHunter.jar 116 /opt/app/wily/agent/core/ext/ProbeBuilder.jar 116 /opt/app/wily/agent/core/ext/RegexNormalizerExtension.jar 116 /opt/app/wily/agent/core/ext/SQLAgent.jar 116 /opt/app/wily/agent/core/ext/ServletHeaderDecorator.jar 116 /opt/app/wily/agent/core/ext/ServletHelper.jar 116 /opt/app/wily/agent/core/ext/Supportability-Agent.jar 116 /opt/app/wily/agent/core/ext/ThreadDumpGen.jar 116 /opt/app/wily/agent/core/ext/TomcatMonitoring.jar 116 /opt/app/wily/agent/core/ext/WebAppSupport.jar 116 /opt/app/wily/agent/core/ext/introscopeAIXPSeries32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeAIXPSeries64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxItanium32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxItanium64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxParisc32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxParisc64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisAmd32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisAmd64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisSparc32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisSparc64Stats.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-dist_2.11-1.3.2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-shaded-hadoop2-uber-1.3.2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/log4j-over-slf4j-1.7.25.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-classic-1.2.3.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-core-1.2.3.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/slf4j-api-1.7.25.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.21-20171130.111758-2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.27-20171205.110224-2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.28.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.30.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.32.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.33.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.35.jar 116 /opt/app/xxxxx/dev/pkgs/flink/var/log/flink-flinkuser-taskmanager-0-box2.out 116 /usr/java/jdk1.8.0_131/jre/lib/ext/sunec.jar 116 /usr/java/jdk1.8.0_131/jre/lib/ext/sunpkcs11.jar 116 /usr/java/jdk1.8.0_131/jre/lib/jce.jar 116 /usr/java/jdk1.8.0_131/jre/lib/jsse.jar 116 /usr/java/jdk1.8.0_131/jre/lib/resources.jar 116 /usr/java/jdk1.8.0_131/jre/lib/rt.jar {noformat} was (Author: pnowojski): If you take a look into the attached results of lsof, 60% are regular files: {noformat} cat box2-taskmgr-lsof | cut -c40-55 | sort -n | uniq -c 406 CHR 116 DIR 8294 REG 3596 FIFO 348 IPv6 116 unix 0xffff 1798 a_inode {noformat} and those files repeat them selves 116 times: {noformat} 116 /opt/app/wily/agent/Agent.jar 116 /opt/app/wily/agent/core/ext/AppMap.jar 116 /opt/app/wily/agent/core/ext/BasicDirectiveLoader.jar 116 /opt/app/wily/agent/core/ext/BizDef.jar 116 /opt/app/wily/agent/core/ext/BizTrxHttp.jar 116 /opt/app/wily/agent/core/ext/ChangeDetector-Agent_Server.jar 116 /opt/app/wily/agent/core/ext/ChangeDetector-CommonAll.jar 116 /opt/app/wily/agent/core/ext/ChangeDetectorAgent.jar 116 /opt/app/wily/agent/core/ext/DynInstrBootstrap.jar 116 /opt/app/wily/agent/core/ext/DynInstrSupport15.jar 116 /opt/app/wily/agent/core/ext/GCMonitor.jar 116 /opt/app/wily/agent/core/ext/HPC-GcMonitorAgent.jar 116 /opt/app/wily/agent/core/ext/Inheritance.jar 116 /opt/app/wily/agent/core/ext/Java15DynamicInstrumentation.jar 116 /opt/app/wily/agent/core/ext/LeakHunter.jar 116 /opt/app/wily/agent/core/ext/ProbeBuilder.jar 116 /opt/app/wily/agent/core/ext/RegexNormalizerExtension.jar 116 /opt/app/wily/agent/core/ext/SQLAgent.jar 116 /opt/app/wily/agent/core/ext/ServletHeaderDecorator.jar 116 /opt/app/wily/agent/core/ext/ServletHelper.jar 116 /opt/app/wily/agent/core/ext/Supportability-Agent.jar 116 /opt/app/wily/agent/core/ext/ThreadDumpGen.jar 116 /opt/app/wily/agent/core/ext/TomcatMonitoring.jar 116 /opt/app/wily/agent/core/ext/WebAppSupport.jar 116 /opt/app/wily/agent/core/ext/introscopeAIXPSeries32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeAIXPSeries64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxItanium32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxItanium64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxParisc32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeHpuxParisc64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeLinuxIntelAmd64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisAmd32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisAmd64Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisSparc32Stats.jar 116 /opt/app/wily/agent/core/ext/introscopeSolarisSparc64Stats.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-dist_2.11-1.3.2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-python_2.11-1.3.2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/flink-shaded-hadoop2-uber-1.3.2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/log4j-over-slf4j-1.7.25.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-classic-1.2.3.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/logback-core-1.2.3.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/slf4j-api-1.7.25.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.21-20171130.111758-2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.27-20171205.110224-2.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.28.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.30.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.32.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.33.jar 116 /opt/app/xxxxx/dev/pkgs/flink/flink-1.3.2/lib/xxxxx-flink-job-monitor-1.35.jar 116 /opt/app/xxxxx/dev/pkgs/flink/var/log/flink-flinkuser-taskmanager-0-box2.out 116 /usr/java/jdk1.8.0_131/jre/lib/ext/sunec.jar 116 /usr/java/jdk1.8.0_131/jre/lib/ext/sunpkcs11.jar 116 /usr/java/jdk1.8.0_131/jre/lib/jce.jar 116 /usr/java/jdk1.8.0_131/jre/lib/jsse.jar 116 /usr/java/jdk1.8.0_131/jre/lib/resources.jar 116 /usr/java/jdk1.8.0_131/jre/lib/rt.jar {noformat} > Excessive amount of files opened by flink task manager > ------------------------------------------------------ > > Key: FLINK-8707 > URL: https://issues.apache.org/jira/browse/FLINK-8707 > Project: Flink > Issue Type: Bug > Components: TaskManager > Affects Versions: 1.3.2 > Environment: NAME="Red Hat Enterprise Linux Server" > VERSION="7.3 (Maipo)" > Two boxes, each with a Job Manager & Task Manager, using Zookeeper for HA. > flink.yaml below with some settings (removed exact box names) etc: > env.log.dir: ...some dir...residing on the same box > env.pid.dir: some dir...residing on the same box > metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter > metrics.reporters: jmx > state.backend: filesystem > state.backend.fs.checkpointdir: file:///some_nfs_mount > state.checkpoints.dir: file:///some_nfs_mount > state.checkpoints.num-retained: 3 > high-availability.cluster-id: /tst > high-availability.storageDir: file:///some_nfs_mount/ha > high-availability: zookeeper > high-availability.zookeeper.path.root: /flink > high-availability.zookeeper.quorum: ...list of zookeeper boxes > env.java.opts.jobmanager: ...some extra jar args > jobmanager.archive.fs.dir: some dir...residing on the same box > jobmanager.web.submit.enable: true > jobmanager.web.tmpdir: some dir...residing on the same box > env.java.opts.taskmanager: some extra jar args > taskmanager.tmp.dirs: some dir...residing on the same box/var/tmp > taskmanager.network.memory.min: 1024MB > taskmanager.network.memory.max: 2048MB > blob.storage.directory: some dir...residing on the same box > Reporter: Alexander Gardner > Priority: Blocker > Fix For: 1.5.0 > > Attachments: box1-jobmgr-lsof, box1-taskmgr-lsof, box2-jobmgr-lsof, > box2-taskmgr-lsof > > > The job manager has less FDs than the task manager. > > Hi > A support alert indicated that there were a lot of open files for the boxes > running Flink. > There were 4 flink jobs that were dormant but had consumed a number of msgs > from Kafka using the FlinkKafkaConsumer010. > A simple general lsof: > $ lsof | wc -l -> returned 153114 open file descriptors. > Focusing on the TaskManager process (process ID = 12154): > $ lsof | grep 12154 | wc -l- > returned 129322 open FDs > $ lsof -p 12154 | wc -l -> returned 531 FDs > There were 228 threads running for the task manager. > > Drilling down a bit further, looking at a_inode and FIFO entries: > $ lsof -p 12154 | grep a_inode | wc -l = 100 FDs > $ lsof -p 12154 | grep FIFO | wc -l = 200 FDs > $ /proc/12154/maps = 920 entries. > Apart from lsof identifying lots of JARs and SOs being referenced there were > also 244 child processes for the task manager process. > Noticed that in each environment, a creep of file descriptors...are the above > figures deemed excessive for the no of FDs in use? I know Flink uses Netty - > is it using a separate Selector for reads & writes? > Additionally Flink uses memory mapped files? or direct bytebuffers are these > skewing the numbers of FDs shown? > Example of one child process ID 6633: > java 12154 6633 dfdev 387u a_inode 0,9 0 5869 [eventpoll] > java 12154 6633 dfdev 388r FIFO 0,8 0t0 459758080 pipe > java 12154 6633 dfdev 389w FIFO 0,8 0t0 459758080 pipe > Lasty, cannot identify yet the reason for the creep in FDs even if Flink is > pretty dormant or has dormant jobs. Production nodes are not experiencing > excessive amounts of throughput yet either. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)