Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Colin McCabe Tue, 22 Aug 2017 13:42:27 -0700

Hi,

Thanks for the KIP.  It looks good overall.


On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote:
> Hi Jun,
> 
> Thanks for the feedback.
> 
> I've updated the KIP to mention this new algorithm, which I agree will be
> much better from the AdminClient PoV.
> 
> I've also reverted the authorization to be against the cluster resource,
> rather than the topic, since, on re-reading, Ewen wasn't particularly
> advocating that it should be topic-based, merely wondering if that made
> more sense in addition to the cluster-based check.
> 

Hmm.  In general I thought we were using Cluster:ClusterAction for
authorizing internal requests made by Kafka itself, and Cluster:Alter
for administrative requests initiated by the admin.  For example, the
RPCs added in KIP-140 use Cluster:Alter.

Nitpick: I think you meant to remove this line from the KIP?

> TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to the 
> topic.

Also... this command simply starts the leader election, but does not
wait for it to complete, right?  What is the preferred method for an
admin to check the final result or identify when the final result has
been achieved?

best,
Colin


> Cheers,
> 
> Tom
> 
> On 11 August 2017 at 18:13, Jun Rao <[email protected]> wrote:
> 
> > Hi, Tom,
> >
> > 2. Yes, that's how manual preferred leader election currently works. I
> > think we could use this opportunity to simplify the process a bit though.
> > Doing preferred leader election is a lot cheaper than partition
> > reassignment since it only involves writing the new leader to ZK. So, we
> > probably don't need to persist the request in /admin/preferred_replica_
> > election
> > in ZK. I was thinking of the following steps.
> > (a) The controller receives PreferredLeaderElectionRequest.
> > (b) The controller adds a PreferredReplicaLeaderEelection event in the
> > queue.
> > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be
> > processed by the controller and then send a response.
> >
> > If the controller changes and dies in the middle of this process, the
> > client will get an error and the client has to reissue the request.
> >
> > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1
> > inflight preferred leader election request, in step (a) above, we can set a
> > flag in the controller to indicate that a preferred leader election is in
> > progress. The flag will be reset once the controller finishes processing
> > the request. Until the flag is reset, any new
> > PreferredLeaderElectionRequest
> > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS
> >  error.
> >
> > Does this sound good to you?
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley <[email protected]>
> > wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for your reply, I've got a few comment inline...
> > >
> > > On 11 August 2017 at 01:51, Jun Rao <[email protected]> wrote:
> > >
> > > > Hi, Tom,
> > > >
> > > > Thanks for the KIP. Looks good overall. A few minor comments.
> > > >
> > > > 1. In most requests with topic partitions, we nest partitions inside
> > > topic.
> > > > So, instead of [topic partition_id], we do [topic, [partition_id]] to
> > > save
> > > > some space.
> > > >
> > >
> > > Of course, now sorted.
> > >
> > >
> > > > 2. The preferred leader election just tries to move the leader to the
> > > > preferred replica. This may not succeed if the preferred replica is not
> > > > in-sync yet. It would be useful to return an error code to reflect
> > that.
> > > >
> > >
> > > The algorithm the existing command uses is:
> > >
> > > 1. The command writes some JSON to the /admin/preferred_replica_election
> > > znode. The command then returns asynchronously. It only signals an error
> > if
> > > the znode write failed.
> > > 2. The controller is subscribed to changes to this znode and delegates to
> > > the PartitionStateMachine to do the actual election.
> > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> > > StateChangeFailedException if the preferred leader in not in the ISR, or
> > is
> > > not alive.
> > > 4. This exception will propagate to the ControllerEventManager which will
> > > log it.
> > >
> > > In the event of controller failover, the JSON is read from the znode and
> > > the election proceeds in a similar manner. So the central problem is the
> > > asynchronicity.
> > >
> > > I agree it would be better if the return from the AdminClient were
> > > synchronous. I'm still learning about how the internals of Kafka work.
> > It's
> > > not enough for the AdminManager to wait (with a timeout) for the metadata
> > > cache to show that the leader has been elected, since we need want to
> > know
> > > the reason if the election is not possible. So would an appropriate
> > > mechanism to pass this information back from the KafkaController to the
> > > AdminManager be to use another znode, or is there another way to do it?
> > > I've not found any precedents for this, so please point them out if they
> > > exist.
> > >
> > >
> > > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is
> > > single
> > > > threaded, is that just for timeout? If so, perhaps we can just return
> > the
> > > > Timeout error code.
> > > >
> > >
> > > Again, this pertains to the current algorithm for the election: When the
> > > elections have been done the znode is deleted. Thus the presence of the
> > > znode indicates an election in progress, which causes the command to just
> > > give up (the user can try again later).
> > >
> > > We still need the znode in order to honour the existing elections in the
> > > event of controller failover part-way through, thus we need to do
> > something
> > > in the event that the znode exists at the time we want to write to it.
> > >
> > > I suppose it is possible to append to the JSON in the znode while
> > elections
> > > are already taking place, with the controller reading from the head of
> > the
> > > JSON (thus treating the JSON like a queue), since the ZooKeeper docs
> > give a
> > > recipe for this. But I don't see that pattern being used elsewhere in
> > > Kafka, so I'm not sure if this is what you want.
> > >
> > > It is not clear (to me at least) that the current restriction preventing
> > > calling new elections while elections are on-going is a problem in
> > > practice. But since both you and Ewen have raised this issue I assume it
> > > does need to be solved: I just need a bit more of a hint about whether
> > > there's a preferred way to solve it.
> > >
> > >
> > > > 4. Permission wise, it seems this should require ClusterAction on
> > Cluster
> > > > resource since it's changing things at the cluster level.
> > > >
> > >
> > > As I said in my reply to Ewen, I can see that point of view (and it's
> > what
> > > I originally specified), so I don't mind changing it back, but it would
> > be
> > > good to hear more opinions on this.
> > >
> > > Cheers,
> > >
> > > Tom
> > >
> >

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

Reply via email to