Re: Flink HA on AWS: Network related issue

Deepak Jha Thu, 15 Sep 2016 09:41:53 -0700

Hi Till,
There is a way to shutdown actor systems by
setting  taskmanager.maxRegistrationDuration to a reasonable duration
(eg: 900 seconds). Default value sets it to Inf. In this case I noticed
that Taskmanager goes down and runit restarts the service and it gets
connected with Jobmanager.


 As I said earlier as well that retries to connect to Jobmanager does not
work even though telnet works at the same time to the same Jobmanager on
port 50050.  So retry does cache something which does not allow it to
reconnect. My flink cluster is on aws ( m4.large instances), not sure if
anyone else has observed this behavior.



On Thursday, September 15, 2016, Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Deepak,
>
> it seems that the JobManager's deathwatch declares the TaskManager to be
> unreachable which will automatically quarantine it. You're right that in
> such a case, the TaskManager should shut down and be restarted so that it
> can again reconnect to the JobManager. This is, however, not yet supported
> automatically.
>
> For the moment, I'd recommend you to make the deathwatch a little bit less
> aggressive via the following settings:
>
> - akka.watch.heartbeat.interval: Heartbeat interval for Akka’s DeathWatch
> mechanism to detect dead TaskManagers. If TaskManagers are wrongly marked
> dead because of lost or delayed heartbeat messages, then you should
> increase this value. A thorough description of Akka’s DeathWatch can be
> found here (DEFAULT: akka.ask.timeout/10).
> - akka.watch.heartbeat.pause: Acceptable heartbeat pause for Akka’s
> DeathWatch mechanism. A low value does not allow a irregular heartbeat. A
> thorough description of Akka’s DeathWatch can be found here (DEFAULT:
> akka.ask.timeout).
> - akka.watch.threshold: Threshold for the DeathWatch failure detector. A
> low value is prone to false positives whereas a high value increases the
> time to detect a dead TaskManager. A thorough description of Akka’s
> DeathWatch can be found here (DEFAULT: 12).
>
> I hope this helps you to work around the problem for the moment until we've
> added the automatic shut down and restart.
>
> Cheers,
> Till
>
> On Mon, Sep 12, 2016 at 5:55 AM, Deepak Jha <dkjhan...@gmail.com
> <javascript:;>> wrote:
>
> > Hi Till,
> > One more thing i noticed after looking into following message in
> > taskmanager log
> >
> > 2016-09-11 17:57:25,310 PDT [WARN]  ip-10-6-0-15
> > [flink-akka.actor.default-dispatcher-31] Remoting - Tried to associate
> > with
> > unreachable remote address [akka.tcp://flink@10.6.22.22:50050]. Address
> is
> > now gated for 5000 ms, all messages to this address will be delivered to
> > dead letters. Reason: *The remote system has quarantined this system. No
> > further associations to the remote system are possible until this system
> is
> > restarted*.
> >
> > So in this case ideally the local ActorSystem should go down so that
> > service supervisor/runit will restart the system and taskmanager will
> again
> > be able to connect to the remote system.. If it does not happen
> > automatically then we have to monitor logs in some way and then try to
> > ensure that it restarts. Ideally flink taskmanager Actor System should go
> > down. Please let me know if my understanding is wrong.
> >
> >
> >
> >
> >
> > On Fri, Sep 9, 2016 at 8:01 AM, Deepak Jha <dkjhan...@gmail.com
> <javascript:;>> wrote:
> >
> > > Hi Till,
> > > I'm getting following message in Jobmanager log
> > >
> > > 2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher -
> > *Detected
> > > unreachable: [akka.tcp://flink@10.8.4.57:6121
> > > <http://flink@10.8.4.57:6121>]*
> > > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] o.a.f.runtime.jobmanager.
> > JobManager
> > > - Task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager
> > > terminated.
> > > 2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.
> > InstanceManager
> > > - Unregistered task manager akka.tcp://flink@10.8.4.57 <javascript:;>:
> > > 6121/user/taskmanager. Number of registered task managers 2. Number of
> > > available slots 4.
> > > 2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-982] Remoting - Association to
> > > [akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is
> > > irrecoverably failed. *UID is now quarantined and all messages to this
> > > UID will be delivered to dead letters. Remote actorsystem must be
> > restarted
> > > to recover from this situation.*
> > > 2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
> > > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> > > endpointManager/reliableEndpointWriter-akka.
> > tcp%3A%2F%2Fflink%4010.8.4.57%
> > > 3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%
> > > 2Fflink%4010.8.4.57%3A6121-0#393939009] was not delivered. [54] dead
> > > letters encountered. This logging can be turned off or adjusted with
> > > configuration settings 'akka.log-dead-letters' and
> > > 'akka.log-dead-letters-during-shutdown'.
> > > 2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
> > > [flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
> > > Message [akka.remote.transport.AssociationHandle$Disassociated] from
> > > Actor[akka://flink/deadLetters] to Actor[akka://flink/system/
> transports/
> > > akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%
> > > 2Fflink%4010.8.4.57%3A51291-2#1151730456] was not delivered. [55] dead
> > > letters encountered. This logging can be turned off or adjusted with
> > > configuration settings 'akka.log-dead-letters' and
> > > 'akka.log-dead-letters-during-shutdown'.
> > > 2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
> > > [ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore
> -
> > > Recovering checkpoints from ZooKeeper.
> > >
> > > Hope it helps. I'm using Flink 1.0.2
> > >
> > > On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <trohrm...@apache.org
> <javascript:;>>
> > > wrote:
> > >
> > >> Hi Deepak,
> > >>
> > >> could you check the logs whether the JobManager has been quarantined
> and
> > >> thus, cannot be connected to anymore? The logs should at least
> contain a
> > >> hint why the TaskManager lost the connection initially.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <dkjhan...@gmail.com
> <javascript:;>> wrote:
> > >>
> > >> > Hi,
> > >> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each
> are
> > >> on
> > >> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology
> > works
> > >> > fine, but after few hours I do see that Taskmanagers gets detached
> > with
> > >> > Jobmanager. I tried to reach Jobmanager using telnet at the same
> time
> > >> and
> > >> > it worked but Taskmanager does not succeed in connecting again. It
> > >> attaches
> > >> > only after I restart it. I tried following settings but still the
> > >> problem
> > >> > persists.
> > >> >
> > >> > akka.ask.timeout: 20 s
> > >> > akka.lookup.timeout: 20 s
> > >> > akka.watch.heartbeat.interval: 20 s
> > >> >
> > >> > Please find attached snapshot on one of the Taskmanager. Is there
> any
> > >> > setting that I need to do ?
> > >> >
> > >> > --
> > >> > Thanks,
> > >> > Deepak Jha
> > >> >
> > >> >
> > >>
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Deepak Jha
> > >
> > >
> >
> >
> > --
> > Thanks,
> > Deepak Jha
> >
>


-- 
Sent from Gmail Mobile

Re: Flink HA on AWS: Network related issue

Reply via email to