Hi,
I have two questions about flink client in interactive mode. One is for yarn-session.sh, once the session CLI can't get cluster stauts (jobmanager is down), it will try to shutdown the cluster and cleanup related files even if new jobmanager will be created soon. As result, yarn will fail to start a new jobmanager due to missing files on HDFS. As a workround, I can config `akka.lookup.timeout` to wait a bit longer, say 60 seconds. But I'm wondering if it will affect other components. Second is about flink cli. If cluster is down after submiting job using 'flink run xx.jar', cli hangs there only showing "New JobManager elected. Connecting to null " instead of cleanup and close itself. After some digging, I found the main logic is in JobClientActor. It receives jobmanager status changes from two sources: zookeeper and akka deathwatch. It would terminate itself once receiving message `ConnectionTimeout`. Client sets current leaderSessionId and unwatch previous jobmanager from ZK; it receives `Teminated` of previous jobmanager from akka deathwatch and send `ConnectionTimeout` to itself after 60s. In a great chance, they would interfere with each other. Situation1: 1. client get notified from zk, set leaderSessionId to null 2. client unwatch previous jobmanager 3. msg `Teminated` of previous jobmanager never got received Situation 2: 1. msg `Teminated` of current jobmanager is received 2. schedule msg ConnectionTimeout after 60s 3. client get notified from zk, set `leaderSessionId` to null in less than 60s 4. `ConnectionTimeout` will be filtered out due to different `leaderSessionId` Both of the two problems only happen in interactive mode, not in detached mode. I wonder if it's issues for interactive mode, or we should only use detached mode in production environment? Any insight is appreciated. Thanks, Yelei