Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Jun Rao Wed, 23 Aug 2017 16:50:34 -0700

H, Collin,

That's a good point. I agree that Alter on Cluster resource is more
appropriate.


Thanks,

Jun


On Tue, Aug 22, 2017 at 1:42 PM, Colin McCabe <cmcc...@apache.org> wrote:

> Hi,
>
> Thanks for the KIP.  It looks good overall.
>
> On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote:
> > Hi Jun,
> >
> > Thanks for the feedback.
> >
> > I've updated the KIP to mention this new algorithm, which I agree will be
> > much better from the AdminClient PoV.
> >
> > I've also reverted the authorization to be against the cluster resource,
> > rather than the topic, since, on re-reading, Ewen wasn't particularly
> > advocating that it should be topic-based, merely wondering if that made
> > more sense in addition to the cluster-based check.
> >
>
> Hmm.  In general I thought we were using Cluster:ClusterAction for
> authorizing internal requests made by Kafka itself, and Cluster:Alter
> for administrative requests initiated by the admin.  For example, the
> RPCs added in KIP-140 use Cluster:Alter.
>
> Nitpick: I think you meant to remove this line from the KIP?
>
> > TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to
> the topic.
>
> Also... this command simply starts the leader election, but does not
> wait for it to complete, right?  What is the preferred method for an
> admin to check the final result or identify when the final result has
> been achieved?
>
> best,
> Colin
>
>
> > Cheers,
> >
> > Tom
> >
> > On 11 August 2017 at 18:13, Jun Rao <j...@confluent.io> wrote:
> >
> > > Hi, Tom,
> > >
> > > 2. Yes, that's how manual preferred leader election currently works. I
> > > think we could use this opportunity to simplify the process a bit
> though.
> > > Doing preferred leader election is a lot cheaper than partition
> > > reassignment since it only involves writing the new leader to ZK. So,
> we
> > > probably don't need to persist the request in /admin/preferred_replica_
> > > election
> > > in ZK. I was thinking of the following steps.
> > > (a) The controller receives PreferredLeaderElectionRequest.
> > > (b) The controller adds a PreferredReplicaLeaderEelection event in the
> > > queue.
> > > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to
> be
> > > processed by the controller and then send a response.
> > >
> > > If the controller changes and dies in the middle of this process, the
> > > client will get an error and the client has to reissue the request.
> > >
> > > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more
> than 1
> > > inflight preferred leader election request, in step (a) above, we can
> set a
> > > flag in the controller to indicate that a preferred leader election is
> in
> > > progress. The flag will be reset once the controller finishes
> processing
> > > the request. Until the flag is reset, any new
> > > PreferredLeaderElectionRequest
> > > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_
> PROGRESS
> > >  error.
> > >
> > > Does this sound good to you?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley <t.j.bent...@gmail.com>
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thanks for your reply, I've got a few comment inline...
> > > >
> > > > On 11 August 2017 at 01:51, Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Hi, Tom,
> > > > >
> > > > > Thanks for the KIP. Looks good overall. A few minor comments.
> > > > >
> > > > > 1. In most requests with topic partitions, we nest partitions
> inside
> > > > topic.
> > > > > So, instead of [topic partition_id], we do [topic, [partition_id]]
> to
> > > > save
> > > > > some space.
> > > > >
> > > >
> > > > Of course, now sorted.
> > > >
> > > >
> > > > > 2. The preferred leader election just tries to move the leader to
> the
> > > > > preferred replica. This may not succeed if the preferred replica
> is not
> > > > > in-sync yet. It would be useful to return an error code to reflect
> > > that.
> > > > >
> > > >
> > > > The algorithm the existing command uses is:
> > > >
> > > > 1. The command writes some JSON to the /admin/preferred_replica_
> election
> > > > znode. The command then returns asynchronously. It only signals an
> error
> > > if
> > > > the znode write failed.
> > > > 2. The controller is subscribed to changes to this znode and
> delegates to
> > > > the PartitionStateMachine to do the actual election.
> > > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> > > > StateChangeFailedException if the preferred leader in not in the
> ISR, or
> > > is
> > > > not alive.
> > > > 4. This exception will propagate to the ControllerEventManager which
> will
> > > > log it.
> > > >
> > > > In the event of controller failover, the JSON is read from the znode
> and
> > > > the election proceeds in a similar manner. So the central problem is
> the
> > > > asynchronicity.
> > > >
> > > > I agree it would be better if the return from the AdminClient were
> > > > synchronous. I'm still learning about how the internals of Kafka
> work.
> > > It's
> > > > not enough for the AdminManager to wait (with a timeout) for the
> metadata
> > > > cache to show that the leader has been elected, since we need want to
> > > know
> > > > the reason if the election is not possible. So would an appropriate
> > > > mechanism to pass this information back from the KafkaController to
> the
> > > > AdminManager be to use another znode, or is there another way to do
> it?
> > > > I've not found any precedents for this, so please point them out if
> they
> > > > exist.
> > > >
> > > >
> > > > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is
> > > > single
> > > > > threaded, is that just for timeout? If so, perhaps we can just
> return
> > > the
> > > > > Timeout error code.
> > > > >
> > > >
> > > > Again, this pertains to the current algorithm for the election: When
> the
> > > > elections have been done the znode is deleted. Thus the presence of
> the
> > > > znode indicates an election in progress, which causes the command to
> just
> > > > give up (the user can try again later).
> > > >
> > > > We still need the znode in order to honour the existing elections in
> the
> > > > event of controller failover part-way through, thus we need to do
> > > something
> > > > in the event that the znode exists at the time we want to write to
> it.
> > > >
> > > > I suppose it is possible to append to the JSON in the znode while
> > > elections
> > > > are already taking place, with the controller reading from the head
> of
> > > the
> > > > JSON (thus treating the JSON like a queue), since the ZooKeeper docs
> > > give a
> > > > recipe for this. But I don't see that pattern being used elsewhere in
> > > > Kafka, so I'm not sure if this is what you want.
> > > >
> > > > It is not clear (to me at least) that the current restriction
> preventing
> > > > calling new elections while elections are on-going is a problem in
> > > > practice. But since both you and Ewen have raised this issue I
> assume it
> > > > does need to be solved: I just need a bit more of a hint about
> whether
> > > > there's a preferred way to solve it.
> > > >
> > > >
> > > > > 4. Permission wise, it seems this should require ClusterAction on
> > > Cluster
> > > > > resource since it's changing things at the cluster level.
> > > > >
> > > >
> > > > As I said in my reply to Ewen, I can see that point of view (and it's
> > > what
> > > > I originally specified), so I don't mind changing it back, but it
> would
> > > be
> > > > good to hear more opinions on this.
> > > >
> > > > Cheers,
> > > >
> > > > Tom
> > > >
> > >
>

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Reply via email to