H, Collin, That's a good point. I agree that Alter on Cluster resource is more appropriate.
Thanks, Jun On Tue, Aug 22, 2017 at 1:42 PM, Colin McCabe <cmcc...@apache.org> wrote: > Hi, > > Thanks for the KIP. It looks good overall. > > On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote: > > Hi Jun, > > > > Thanks for the feedback. > > > > I've updated the KIP to mention this new algorithm, which I agree will be > > much better from the AdminClient PoV. > > > > I've also reverted the authorization to be against the cluster resource, > > rather than the topic, since, on re-reading, Ewen wasn't particularly > > advocating that it should be topic-based, merely wondering if that made > > more sense in addition to the cluster-based check. > > > > Hmm. In general I thought we were using Cluster:ClusterAction for > authorizing internal requests made by Kafka itself, and Cluster:Alter > for administrative requests initiated by the admin. For example, the > RPCs added in KIP-140 use Cluster:Alter. > > Nitpick: I think you meant to remove this line from the KIP? > > > TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to > the topic. > > Also... this command simply starts the leader election, but does not > wait for it to complete, right? What is the preferred method for an > admin to check the final result or identify when the final result has > been achieved? > > best, > Colin > > > > Cheers, > > > > Tom > > > > On 11 August 2017 at 18:13, Jun Rao <j...@confluent.io> wrote: > > > > > Hi, Tom, > > > > > > 2. Yes, that's how manual preferred leader election currently works. I > > > think we could use this opportunity to simplify the process a bit > though. > > > Doing preferred leader election is a lot cheaper than partition > > > reassignment since it only involves writing the new leader to ZK. So, > we > > > probably don't need to persist the request in /admin/preferred_replica_ > > > election > > > in ZK. I was thinking of the following steps. > > > (a) The controller receives PreferredLeaderElectionRequest. > > > (b) The controller adds a PreferredReplicaLeaderEelection event in the > > > queue. > > > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to > be > > > processed by the controller and then send a response. > > > > > > If the controller changes and dies in the middle of this process, the > > > client will get an error and the client has to reissue the request. > > > > > > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more > than 1 > > > inflight preferred leader election request, in step (a) above, we can > set a > > > flag in the controller to indicate that a preferred leader election is > in > > > progress. The flag will be reset once the controller finishes > processing > > > the request. Until the flag is reset, any new > > > PreferredLeaderElectionRequest > > > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_ > PROGRESS > > > error. > > > > > > Does this sound good to you? > > > > > > Thanks, > > > > > > Jun > > > > > > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley <t.j.bent...@gmail.com> > > > wrote: > > > > > > > Hi Jun, > > > > > > > > Thanks for your reply, I've got a few comment inline... > > > > > > > > On 11 August 2017 at 01:51, Jun Rao <j...@confluent.io> wrote: > > > > > > > > > Hi, Tom, > > > > > > > > > > Thanks for the KIP. Looks good overall. A few minor comments. > > > > > > > > > > 1. In most requests with topic partitions, we nest partitions > inside > > > > topic. > > > > > So, instead of [topic partition_id], we do [topic, [partition_id]] > to > > > > save > > > > > some space. > > > > > > > > > > > > > Of course, now sorted. > > > > > > > > > > > > > 2. The preferred leader election just tries to move the leader to > the > > > > > preferred replica. This may not succeed if the preferred replica > is not > > > > > in-sync yet. It would be useful to return an error code to reflect > > > that. > > > > > > > > > > > > > The algorithm the existing command uses is: > > > > > > > > 1. The command writes some JSON to the /admin/preferred_replica_ > election > > > > znode. The command then returns asynchronously. It only signals an > error > > > if > > > > the znode write failed. > > > > 2. The controller is subscribed to changes to this znode and > delegates to > > > > the PartitionStateMachine to do the actual election. > > > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a > > > > StateChangeFailedException if the preferred leader in not in the > ISR, or > > > is > > > > not alive. > > > > 4. This exception will propagate to the ControllerEventManager which > will > > > > log it. > > > > > > > > In the event of controller failover, the JSON is read from the znode > and > > > > the election proceeds in a similar manner. So the central problem is > the > > > > asynchronicity. > > > > > > > > I agree it would be better if the return from the AdminClient were > > > > synchronous. I'm still learning about how the internals of Kafka > work. > > > It's > > > > not enough for the AdminManager to wait (with a timeout) for the > metadata > > > > cache to show that the leader has been elected, since we need want to > > > know > > > > the reason if the election is not possible. So would an appropriate > > > > mechanism to pass this information back from the KafkaController to > the > > > > AdminManager be to use another znode, or is there another way to do > it? > > > > I've not found any precedents for this, so please point them out if > they > > > > exist. > > > > > > > > > > > > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is > > > > single > > > > > threaded, is that just for timeout? If so, perhaps we can just > return > > > the > > > > > Timeout error code. > > > > > > > > > > > > > Again, this pertains to the current algorithm for the election: When > the > > > > elections have been done the znode is deleted. Thus the presence of > the > > > > znode indicates an election in progress, which causes the command to > just > > > > give up (the user can try again later). > > > > > > > > We still need the znode in order to honour the existing elections in > the > > > > event of controller failover part-way through, thus we need to do > > > something > > > > in the event that the znode exists at the time we want to write to > it. > > > > > > > > I suppose it is possible to append to the JSON in the znode while > > > elections > > > > are already taking place, with the controller reading from the head > of > > > the > > > > JSON (thus treating the JSON like a queue), since the ZooKeeper docs > > > give a > > > > recipe for this. But I don't see that pattern being used elsewhere in > > > > Kafka, so I'm not sure if this is what you want. > > > > > > > > It is not clear (to me at least) that the current restriction > preventing > > > > calling new elections while elections are on-going is a problem in > > > > practice. But since both you and Ewen have raised this issue I > assume it > > > > does need to be solved: I just need a bit more of a hint about > whether > > > > there's a preferred way to solve it. > > > > > > > > > > > > > 4. Permission wise, it seems this should require ClusterAction on > > > Cluster > > > > > resource since it's changing things at the cluster level. > > > > > > > > > > > > > As I said in my reply to Ewen, I can see that point of view (and it's > > > what > > > > I originally specified), so I don't mind changing it back, but it > would > > > be > > > > good to hear more opinions on this. > > > > > > > > Cheers, > > > > > > > > Tom > > > > > > > >