Re: Zookeeper failure handling

Gyula Fóra Mon, 25 Sep 2017 06:41:26 -0700

I will try to check what Stephan suggested and get back to you!

Thanks for the feedback
Gyula


On Mon, Sep 25, 2017, 15:33 Stephan Ewen <[email protected]> wrote:

> I think the question is whether the connection should be lost in the case
> of a rolling ZK update.
>
> There should always be a quorum online, so Curator should always be able to
> connect. So there is no need to revoke leadership.
>
> @gyula - can you check whether there is an option in Curator to reconnect
> to another quorum peer if one goes down?
>
> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <[email protected]>
> wrote:
>
> > Hi Gyula,
> >
> > Flink uses internally the Curator LeaderLatch recipe to do leader
> election.
> > The LeaderLatch will revoke the leadership of a contender in case of a
> > SUSPENDED or LOST connection to the ZooKeeper quorum. The assumption here
> > is that if you cannot talk to ZooKeeper, then we can no longer be sure
> that
> > you are the leader.
> >
> > Consequently, if you do a rolling update of your ZooKeeper cluster which
> > causes client connections to be lost or suspended, then it will trigger a
> > restart of the Flink job upon reacquiring the leadership again.
> >
> > Cheers,
> > Till
> >
> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <[email protected]>
> wrote:
> >
> > > We are using 1.3.2
> > >
> > > Gyula
> > >
> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <[email protected]> wrote:
> > >
> > > > Which release are you using ?
> > > >
> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader election
> > issues.
> > > >
> > > > Mind giving 1.3.2 a try ?
> > > >
> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <[email protected]>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We have observed that in case some nodes of the ZK cluster are
> > > restarted
> > > > > (for a rolling restart) the Flink Streaming jobs fail (and
> restart).
> > > > >
> > > > > Log excerpt:
> > > > >
> > > > > 2017-09-22 12:54:41,426 INFO  org.apache.zookeeper.ClientCnxn
> > > > >                      - Unable to read additional data from server
> > > > > sessionid 0x15cba6e1a239774, likely server has closed socket,
> closing
> > > > > socket connection and attempting reconnect
> > > > > 2017-09-22 12:54:41,527 INFO
> > > > > org.apache.flink.shaded.org.apache.curator.framework.
> > > > > state.ConnectionStateManager
> > > > >  - State change: SUSPENDED
> > > > > 2017-09-22 12:54:41,528 WARN
> > > > > org.apache.flink.runtime.leaderelection.
> > ZooKeeperLeaderElectionService
> > > > >  - Connection to ZooKeeper suspended. The contender
> > > > > akka.tcp://[email protected]:42118/user/jobmanager
> no
> > > > > longer participates in the leader election.
> > > > > 2017-09-22 12:54:41,528 WARN
> > > > > org.apache.flink.runtime.leaderretrieval.
> > > ZooKeeperLeaderRetrievalService
> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
> > > > > leader from ZooKeeper.
> > > > > 2017-09-22 12:54:41,528 WARN
> > > > > org.apache.flink.runtime.leaderretrieval.
> > > ZooKeeperLeaderRetrievalService
> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
> > > > > leader from ZooKeeper.
> > > > > 2017-09-22 12:54:41,530 WARN
> > > > > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
> > -
> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted job graphs
> > > > > are not monitored (temporarily).
> > > > > 2017-09-22 12:54:41,530 INFO  org.apache.flink.yarn.YarnJobManager
> > > > >                      - JobManager
> > > > > akka://flink/user/jobmanager#-317276879 was revoked leadership.
> > > > > 2017-09-22 12:54:41,532 INFO
> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from
> > state
> > > > > RUNNING to SUSPENDED.
> > > > > java.lang.Exception: JobManager is no longer the leader.
> > > > >
> > > > >
> > > > > Is this the expected behaviour?
> > > > >
> > > > > Thanks,
> > > > > Gyula
> > > > >
> > > >
> > >
> >
>

Re: Zookeeper failure handling

Reply via email to