[
https://issues.apache.org/jira/browse/FLINK-12437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836032#comment-16836032
]
Abdul Qadeer commented on FLINK-12437:
--------------------------------------
[~rmetzger] I don't expect a fix to be made available for this in 1.4.0. I
would like to know if this is a known issue fixed in newer versions.
> Taskmanager doesn't initiate registration after jobmanager marks it terminated
> ------------------------------------------------------------------------------
>
> Key: FLINK-12437
> URL: https://issues.apache.org/jira/browse/FLINK-12437
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Reporter: Abdul Qadeer
> Priority: Major
>
> This issue is observed in Standalone cluster deployment mode with Zookeeper
> HA enabled in Flink 1.4.0. A few taskmanagers restarted due to Out of
> Metaspace.
> The offending taskmanager `pipelineruntime-taskmgr-6789dd578b-dcp4r` first
> successfully registers with jobmanager, and the remote watcher marks it
> terminated soon after as seen in logs. There were other taskmanagers that
> were terminated around same time but they had been quarantined by jobmanager
> with message similar to:
> {noformat}
> Association to [akka.tcp://[email protected]:8070] having UID [864976677] is
> irrecoverably failed. UID is now quarantined and all messages to this UID
> will be delivered to dead letters. Remote actorsystem must be restarted to
> recover from this situation.
> {noformat}
> They came back up and successfully registered with jobmanager. This didn't
> happen for the offending taskmanager:
>
> At JobManager:
> {noformat}
> {"timeMillis":1557073368155,"thread":"flink-akka.actor.default-dispatcher-49","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Registered
> TaskManager at pipelineruntime-taskmgr-6789dd578b-dcp4r
> (akka.tcp://[email protected]:8070/user/taskmanager) as
> ae61ac607f0ab35ab5066f7dc221e654. Current number of registered hosts is 8.
> Current number of alive task slots is
> 51.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":125,"threadPriority":5}
> ...
> ...
> {"timeMillis":1557073391386,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered
> task manager /10.60.5.85. Number of registered task managers 7. Number of
> available slots
> 45.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
> ...
> ...
> {"timeMillis":1557073391483,"thread":"flink-akka.actor.default-dispatcher-82","level":"INFO","loggerName":"org.apache.flink.runtime.instance.InstanceManager","message":"Unregistered
> task manager /10.60.5.85. Number of registered task managers 6. Number of
> available slots
> 39.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":159,"threadPriority":5}
> ...
> ...
> {"timeMillis":1557073370389,"thread":"flink-akka.actor.default-dispatcher-35","level":"INFO","loggerName":"akka.actor.LocalActorRef","message":"Message
> [akka.remote.ReliableDeliverySupervisor$Ungate$] from
> Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260]
> to
> Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.60.5.85%3A8070-3#1863607260]
> was not delivered. [22] dead letters encountered. This logging can be turned
> off or adjusted with configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":98,"threadPriority":5}
> {noformat}
> At TaskManager:
> {noformat}
> {"timeMillis":1557073366068,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
>
> TaskManager","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073366073,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
> TaskManager actor system at
> 10.60.5.85:8070.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073366077,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying
> to start actor system at
> 10.60.5.85:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073366510,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.event.slf4j.Slf4jLogger","message":"Slf4jLogger
>
> started","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073366694,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Starting
>
> remoting","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073367049,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting
> started; listening on addresses
> :[akka.tcp://[email protected]:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073367051,"thread":"flink-akka.actor.default-dispatcher-4","level":"INFO","loggerName":"akka.remote.Remoting","message":"Remoting
> now listens on addresses:
> [akka.tcp://[email protected]:8070]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":51,"threadPriority":5}
> {"timeMillis":1557073367089,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Actor
> system started at
> akka.tcp://[email protected]:8070","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367138,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Configuring
> FlinkMetricsReporter with
> {class=com.pipeline.processor.flink.metrics.FlinkMetricsReporter}.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"com.pipeline.processor.flink.metrics.FlinkMetricsReporter","message":"Metrics
> Reporter
> Open","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367139,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.metrics.MetricRegistryImpl","message":"Reporting
> metrics of type
> com.pipeline.processor.flink.metrics.FlinkMetricsReporter.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367142,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
> TaskManager
> actor","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367176,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyConfig","message":"NettyConfig
> [server address: /10.60.5.85, server port: 0, ssl enabled: false, memory
> segment size (bytes): 32768, transport type: NIO, number of server threads: 3
> (manual), number of client threads: 3 (manual), server connect backlog: 0
> (use Netty's default), client connect timeout (sec): 120, send/receive buffer
> size (bytes): 0 (use Netty's
> default)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367187,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration","message":"Messages
> have a max timeout of 100000
> ms","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367198,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Temporary
> file directory '/tmp': total 373 GB, usable 295 GB (79.09%
> usable)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367608,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.buffer.NetworkBufferPool","message":"Allocated
> 639 MB for network buffer pool (number of memory segments: 20467, bytes per
> segment:
> 32768).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367710,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could
> not load Queryable State Client Proxy. Probable reason:
> flink-queryable-state-runtime is not in the classpath. Please put the
> corresponding jar from the opt to the lib
> folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367711,"thread":"pool-2-thread-1","level":"WARN","loggerName":"org.apache.flink.runtime.query.QueryableStateUtils","message":"Could
> not load Queryable State Server. Probable reason:
> flink-queryable-state-runtime is not in the classpath. Please put the
> corresponding jar from the opt to the lib
> folder.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367712,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.NetworkEnvironment","message":"Starting
> the network environment and its
> components.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367753,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyClient","message":"Successful
> initialization (took 34
> ms).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367805,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.network.netty.NettyServer","message":"Successful
> initialization (took 51 ms). Listening on SocketAddress
> /10.60.5.85:38873.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367808,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.taskexecutor.TaskManagerServices","message":"Limiting
> managed memory to 0.7 of the currently free heap space (4005 MB), memory
> will be allocated
> lazily.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367819,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.io.disk.iomanager.IOManager","message":"I/O
> manager uses directory /tmp/flink-io-5f657721-13dd-40aa-9c00-2a15d5666280
> for spill
> files.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367826,"thread":"pool-2-thread-1","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User
> file cache uses directory
> /tmp/flink-dist-cache-30b1f2fd-9457-435b-a601-ae0b4e37dc6d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":40,"threadPriority":5}
> {"timeMillis":1557073367862,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.filecache.FileCache","message":"User
> file cache uses directory
> /tmp/flink-dist-cache-3dfb3cd5-b261-4df3-a662-a1cd91047c72","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367888,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Starting
> TaskManager actor at
> akka://flink/user/taskmanager#1157564383.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367889,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager
> data connection information:
> pipelineruntime-taskmgr-6789dd578b-dcp4r-57b5f60d8144eb16425ec5bd9666768f @
> pipelineruntime-taskmgr-6789dd578b-dcp4r
> (dataPort=38873)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367890,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"TaskManager
> has 6 task
> slot(s).","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Memory
> usage stats: [HEAP: 842/6554/6554 MB, NON HEAP: 62/64/1776 MB
> (used/committed/max)]","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367892,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Starting
>
> ZooKeeperLeaderRetrievalService.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073367965,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"Leader
> node has changed with
> Leader=akka.tcp://[email protected]:6123/user/jobmanager, session
> ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
> {"timeMillis":1557073367966,"thread":"pool-2-thread-1-EventThread","level":"INFO","loggerName":"org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService","message":"New
> leader information: Leader=akka.tcp://[email protected]:6123/user/jobmanager,
> session
> ID=270a3383-8f1e-4f2d-b1d6-f7af727e9ea0.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":46,"threadPriority":5}
> {"timeMillis":1557073367975,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Trying
> to register at JobManager akka.tcp://[email protected]:6123/user/jobmanager
> (attempt 1, timeout: 500
> milliseconds)","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368168,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Successful
> registration at JobManager
> (akka.tcp://[email protected]:6123/user/jobmanager), starting network stack
> and library
> cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368177,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.taskmanager.TaskManager","message":"Determined
> BLOB server address to be /10.60.5.53:43987. Starting BLOB
> cache.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368184,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.PermanentBlobCache","message":"Created
> BLOB cache storage directory
> /tmp/blobStore-ffdc49ba-e86f-4240-93ad-7566c43e9b0d","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073368189,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"org.apache.flink.runtime.blob.TransientBlobCache","message":"Created
> BLOB cache storage directory
> /tmp/blobStore-764277b6-6e46-4c8f-b7ee-80f746edefab","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391398,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$R4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [1] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$S4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [2] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391399,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$T4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [3] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$U4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [4] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391400,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$V4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [5] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$W4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [6] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391401,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$X4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [7] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391474,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$Y4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [8] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391475,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$Z4] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [9] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> {"timeMillis":1557073391477,"thread":"flink-akka.actor.default-dispatcher-3","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$04] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [10] dead
> letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":50,"threadPriority":5}
> ...
> ...
> ...
> {"timeMillis":1557073691534,"thread":"flink-akka.actor.default-dispatcher-2","level":"INFO","loggerName":"akka.actor.EmptyLocalActorRef","message":"Message
> [org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage]
> from Actor[akka.tcp://[email protected]:6123/temp/$sab] to
> Actor[akka://flink/user/taskmanager#-1883282689] was not delivered. [316]
> dead letters encountered. This logging can be turned off or adjusted with
> configuration settings 'akka.log-dead-letters' and
> 'akka.log-dead-letters-during-shutdown'.","endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","threadId":49,"threadPriority":5}
> {noformat}
> TCP dump at taskmanager:
> {noformat}
> 19:55:58.214944 IP 10.60.5.85.45008 > 10.60.5.53.6123: tcp 715
> 0x0000: 4500 02ff 2809 4000 4006 f0ee 0a3c 0555 E...(.@.@....<.U
> 0x0010: 0a3c 0535 afd0 17eb a107 10ac 0270 79da .<.5.........py.
> 0x0020: 8018 ce96 21f3 0000 0101 080a f2c0 c93f ....!..........?
> 0x0030: b74c ec05 0000 02c7 0ac4 0512 c105 0a3d .L.............=
> 0x0040: 0a3b 616b 6b61 2e74 6370 3a2f 2f66 6c69 .;akka.tcp://fli
> 0x0050: 6e6b 4031 302e 3630 2e35 2e35 333a 3631 [email protected]:61
> 0x0060: 3233 2f75 7365 722f 6a6f 626d 616e 6167 23/user/jobmanag
> 0x0070: 6572 2331 3231 3433 3237 3831 3312 bf04 er#1214327813...
> 0x0080: 0aba 04ac ed00 0573 7200 3f6f 7267 2e61 .......sr.?org.a
> 0x0090: 7061 6368 652e 666c 696e 6b2e 7275 6e74 pache.flink.runt
> 0x00a0: 696d 652e 6d65 7373 6167 6573 2e54 6173 ime.messages.Tas
> 0x00b0: 6b4d 616e 6167 6572 4d65 7373 6167 6573 kManagerMessages
> 0x00c0: 2448 6561 7274 6265 6174 1fb7 fffd 259b $Heartbeat....%.
> 0x00d0: c539 0200 024c 000c 6163 6375 6d75 6c61 .9...L..accumula
> 0x00e0: 746f 7273 7400 164c 7363 616c 612f 636f torst..Lscala/co
> 0x00f0: 6c6c 6563 7469 6f6e 2f53 6571 3b4c 000a llection/Seq;L..
> 0x0100: 696e 7374 616e 6365 4944 7400 2e4c 6f72 instanceIDt..Lor
> 0x0110: 672f 6170 6163 6865 2f66 6c69 6e6b 2f72 g/apache/flink/r
> 0x0120: 756e 7469 6d65 2f69 6e73 7461 6e63 652f untime/instance/
> 0x0130: 496e 7374 616e 6365 4944 3b78 7073 7200 InstanceID;xpsr.
> 0x0140: 2473 6361 6c61 2e63 6f6c 6c65 6374 696f $scala.collectio
> 0x0150: 6e2e 6d75 7461 626c 652e 4172 7261 7942 n.mutable.ArrayB
> 0x0160: 7566 6665 7215 38b0 5383 828e 7302 0003 uffer.8.S...s...
> 0x0170: 4900 0b69 6e69 7469 616c 5369 7a65 4900 I..initialSizeI.
> 0x0180: 0573 697a 6530 5b00 0561 7272 6179 7400 .size0[..arrayt.
> 0x0190: 135b 4c6a 6176 612f 6c61 6e67 2f4f 626a .[Ljava/lang/Obj
> 0x01a0: 6563 743b 7870 0000 0010 0000 0000 7572 ect;xp........ur
> 0x01b0: 0013 5b4c 6a61 7661 2e6c 616e 672e 4f62 ..[Ljava.lang.Ob
> 0x01c0: 6a65 6374 3b90 ce58 9f10 7329 6c02 0000 ject;..X..s)l...
> 0x01d0: 7870 0000 0010 7070 7070 7070 7070 7070 xp....pppppppppp
> 0x01e0: 7070 7070 7070 7372 002c 6f72 672e 6170 ppppppsr.,org.ap
> 0x01f0: 6163 6865 2e66 6c69 6e6b 2e72 756e 7469 ache.flink.runti
> 0x0200: 6d65 2e69 6e73 7461 6e63 652e 496e 7374 me.instance.Inst
> 0x0210: 616e 6365 4944 0000 0000 0000 0001 0200 anceID..........
> 0x0220: 0078 7200 206f 7267 2e61 7061 6368 652e .xr..org.apache.
> 0x0230: 666c 696e 6b2e 7574 696c 2e41 6273 7472 flink.util.Abstr
> 0x0240: 6163 7449 4400 0000 0000 0000 0102 0003 actID...........
> 0x0250: 4a00 096c 6f77 6572 5061 7274 4a00 0975 J..lowerPartJ..u
> 0x0260: 7070 6572 5061 7274 4c00 0874 6f53 7472 pperPartL..toStr
> 0x0270: 696e 6774 0012 4c6a 6176 612f 6c61 6e67 ingt..Ljava/lang
> 0x0280: 2f53 7472 696e 673b 7870 ae61 ac60 7f0a /String;xp.a.`..
> 0x0290: b35a b506 6f7d c221 e654 7400 2061 6536 .Z..o}.!.Tt..ae6
> 0x02a0: 3161 6336 3037 6630 6162 3335 6162 3530 1ac607f0ab35ab50
> 0x02b0: 3636 6637 6463 3232 3165 3635 3410 0122 66f7dc221e654.."
> 0x02c0: 3e0a 3c61 6b6b 612e 7463 703a 2f2f 666c >.<akka.tcp://fl
> 0x02d0: 696e 6b40 3130 2e36 302e 352e 3835 3a38 [email protected]:8
> 0x02e0: 3037 302f 7573 6572 2f74 6173 6b6d 616e 070/user/taskman
> 0x02f0: 6167 6572 2331 3135 3735 3634 3338 33 ager#1157564383
> 19:55:58.214996 IP 10.60.5.53.6123 > 10.60.5.85.45008: tcp 0
> 0x0000: 4500 0034 c1fe 4000 3f06 5ac4 0a3c 0535 E..4..@.?.Z..<.5
> 0x0010: 0a3c 0555 17eb afd0 0270 79da a107 1377 .<.U.....py....w
> 0x0020: 8010 ce93 1f28 0000 0101 080a b74c ff8d .....(.......L..
> 0x0030: f2c0 c93f ...?
> {noformat}
> After this, the taskmanager never registers again at the jobmanager.
> This run had the following akka configuration:
> akka.watch.heartbeat.pause: 60 s
> akka.ask.timeout: 100 s
> I noticed that akka.watch.heartbeat.interval defaults to ask.timeout if not
> specified in configuration. Is it possible for these kind of failures to
> happen due to the heartbeat-interval being more than heartbeat-pause?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)