Alexander Gardner created FLINK-8707:

             Summary: Excessive amount of files opened by flink task manager
                 Key: FLINK-8707
             Project: Flink
          Issue Type: Bug
          Components: JobManager
    Affects Versions: 1.3.2
         Environment: NAME="Red Hat Enterprise Linux Server"
VERSION="7.3 (Maipo)"

Two boxes, each with a Job Manager & Task Manager, using Zookeeper for HA.

flink.yaml below with some settings (removed exact box names) etc:

env.log.dir: ...some dir...residing on the same box some dir...residing on the same box
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporters: jmx
state.backend: filesystem
state.backend.fs.checkpointdir: file:///some_nfs_mount
state.checkpoints.dir: file:///some_nfs_mount
state.checkpoints.num-retained: 3
high-availability.cluster-id: /tst
high-availability.storageDir: file:///some_nfs_mount/ha
high-availability: zookeeper
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.quorum: ...list of zookeeper boxes ...some extra jar args
jobmanager.archive.fs.dir: some dir...residing on the same box
jobmanager.web.submit.enable: true
jobmanager.web.tmpdir:  some dir...residing on the same box some extra jar args
taskmanager.tmp.dirs:  some dir...residing on the same box/var/tmp 1024MB 2048MB  some dir...residing on the same box
            Reporter: Alexander Gardner


The job manager has less FDs than the task manager.



A support alert indicated that there were a lot of open files for the boxes 
running Flink.

There were 4 flink jobs that were dormant but had consumed a number of msgs 
from Kafka using the FlinkKafkaConsumer010.

A simple general lsof:

$ lsof | wc -l       ->  returned 153114 open file descriptors.

Focusing on the TaskManager process (process ID = 12154):

$ lsof | grep 12154 | wc -l-    > returned 129322 open FDs

$ lsof -p 12154 | wc -l   -> returned 531 FDs

There were 228 threads running for the task manager.


Drilling down a bit further, looking at a_inode and FIFO entries: 

$ lsof -p 12154 | grep a_inode | wc -l = 100 FDs

$ lsof -p 12154 | grep FIFO | wc -l  = 200 FDs

$ /proc/12154/maps = 920 entries.

Apart from lsof identifying lots of JARs and SOs being referenced there were 
also 244 child processes for the task manager process.

Noticed that in each environment, a creep of file descriptors...are the above 
figures deemed excessive for the no of FDs in use? I know Flink uses Netty - is 
it using a separate Selector for reads & writes? 

Additionally Flink uses memory mapped files? or direct bytebuffers are these 
skewing the numbers of FDs shown?

Example of one child process ID 6633:

java 12154 6633 dfdev 387u a_inode 0,9 0 5869 [eventpoll]
 java 12154 6633 dfdev 388r FIFO 0,8 0t0 459758080 pipe
 java 12154 6633 dfdev 389w FIFO 0,8 0t0 459758080 pipe

Lasty, cannot identify yet the reason for the creep in FDs even if Flink is 
pretty dormant or has dormant jobs. Production nodes are not experiencing 
excessive amounts of throughput yet either.




This message was sent by Atlassian JIRA

Reply via email to