Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Jun, The patches that I've got currently wait for the elections to complete before returning the response. Is that the semantic you wanted? Cheers, Tom On 7 September 2017 at 22:14, Jun Raowrote: > Hi, Tom, > > It seems that it's useful to know whether the leader is balanced for each > partition in ElectPreferredLeadersResult, instead of just being attempted? > > Thanks, > > Jun > > On Wed, Sep 6, 2017 at 4:03 PM, Colin McCabe wrote: > > > On Wed, Sep 6, 2017, at 01:18, Tom Bentley wrote: > > > Hi Colin, > > > > > > Thanks for taking the time to respond. > > > > > > On 5 September 2017 at 22:22, Colin McCabe wrote: > > > > > > > ... > > > > Why does there need to be a map at all in the API? > > > > > > > > > From a purely technical PoV there doesn't, but doing something else > would > > > make the API inconsistent with other similar AdminClient *Results > > > classes, > > > which all expose a Map directly. > > > > > > > > > > Why not just have > > > > something like this: > > > > > > > > > > I agree this would be a better solution. I will update the KIP and ask > > > people to vote again. (Is that the right process?) > > > > > > It might be worth bearing this in mind for future AdminClient APIs: > > > Exposing a Map directly means you can't retrofit handling a null > argument > > > to mean "all the things", whereas wrapping the map would allow that. > > > > That's a good point. > > > > I guess the important thing to keep in mind is that if you return a map > > from a results class, it has to be instantiated eagerly. It has to be > > something you know before any RPCs are made, async actions are > > performed, etc. > > > > best, > > Colin > > > > > > > > Thanks again, > > > > > > Tom > > >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi, Tom, It seems that it's useful to know whether the leader is balanced for each partition in ElectPreferredLeadersResult, instead of just being attempted? Thanks, Jun On Wed, Sep 6, 2017 at 4:03 PM, Colin McCabewrote: > On Wed, Sep 6, 2017, at 01:18, Tom Bentley wrote: > > Hi Colin, > > > > Thanks for taking the time to respond. > > > > On 5 September 2017 at 22:22, Colin McCabe wrote: > > > > > ... > > > Why does there need to be a map at all in the API? > > > > > > From a purely technical PoV there doesn't, but doing something else would > > make the API inconsistent with other similar AdminClient *Results > > classes, > > which all expose a Map directly. > > > > > > > Why not just have > > > something like this: > > > > > > > I agree this would be a better solution. I will update the KIP and ask > > people to vote again. (Is that the right process?) > > > > It might be worth bearing this in mind for future AdminClient APIs: > > Exposing a Map directly means you can't retrofit handling a null argument > > to mean "all the things", whereas wrapping the map would allow that. > > That's a good point. > > I guess the important thing to keep in mind is that if you return a map > from a results class, it has to be instantiated eagerly. It has to be > something you know before any RPCs are made, async actions are > performed, etc. > > best, > Colin > > > > > Thanks again, > > > > Tom >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
On Wed, Sep 6, 2017, at 01:18, Tom Bentley wrote: > Hi Colin, > > Thanks for taking the time to respond. > > On 5 September 2017 at 22:22, Colin McCabewrote: > > > ... > > Why does there need to be a map at all in the API? > > > From a purely technical PoV there doesn't, but doing something else would > make the API inconsistent with other similar AdminClient *Results > classes, > which all expose a Map directly. > > > > Why not just have > > something like this: > > > > I agree this would be a better solution. I will update the KIP and ask > people to vote again. (Is that the right process?) > > It might be worth bearing this in mind for future AdminClient APIs: > Exposing a Map directly means you can't retrofit handling a null argument > to mean "all the things", whereas wrapping the map would allow that. That's a good point. I guess the important thing to keep in mind is that if you return a map from a results class, it has to be instantiated eagerly. It has to be something you know before any RPCs are made, async actions are performed, etc. best, Colin > > Thanks again, > > Tom
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Colin, Thanks for taking the time to respond. On 5 September 2017 at 22:22, Colin McCabewrote: > ... > Why does there need to be a map at all in the API? >From a purely technical PoV there doesn't, but doing something else would make the API inconsistent with other similar AdminClient *Results classes, which all expose a Map directly. > Why not just have > something like this: > I agree this would be a better solution. I will update the KIP and ask people to vote again. (Is that the right process?) It might be worth bearing this in mind for future AdminClient APIs: Exposing a Map directly means you can't retrofit handling a null argument to mean "all the things", whereas wrapping the map would allow that. Thanks again, Tom
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
On Mon, Sep 4, 2017, at 04:54, Tom Bentley wrote: > The KIP has been adopted after a successful vote. Thanks for working on this, Tom. It's a nice improvement. > > Unfortunately I've discovered that there's an annoying detail in the > handling of the case that electPreferredLeaders() is called with a null > partitions argument. As discussed with Ewen, this is supposed to mean > "all > partitions", but we don't know all the partitions in the AdminClient, yet > we have to return a ElectPreferredLeadersResults instance, supposedly > with > the partitions as keys. > > We could handle this by passing a KafkaFuture
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
The KIP has been adopted after a successful vote. Unfortunately I've discovered that there's an annoying detail in the handling of the case that electPreferredLeaders() is called with a null partitions argument. As discussed with Ewen, this is supposed to mean "all partitions", but we don't know all the partitions in the AdminClient, yet we have to return a ElectPreferredLeadersResults instance, supposedly with the partitions as keys. We could handle this by passing a KafkaFuture
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
I've updated in the KIP. Thanks, Tom On 30 August 2017 at 16:42, Ismael Jumawrote: > If you agree with the change, yes, please rename. It's OK to make changes > after the VOTE thread starts. In cases where some people have already > voted, it's recommended to mention the changes in the VOTE thread as a > heads up. Generally, we don't restart the vote unless the changes are > significant. > > Ismael > > On Wed, Aug 30, 2017 at 4:26 PM, Tom Bentley > wrote: > > > Hi Ismael, > > > > I agree that `electPreferredReplicaLeader` is a mouthful and am happy to > > change it to `electPreferredLeaders`. I'd rename the correspond request > and > > response similarly. > > > > Should I rename it in the KIP now, even though I initiated a VOTE thread > > yesterday? > > > > Cheers, > > > > Tom > > > > On 30 August 2017 at 16:01, Ismael Juma wrote: > > > > > Hi Tom, > > > > > > Thanks for the KIP, it's a useful one. I find the proposed method name > > > `electPreferredReplicaLeader` a little hard to read. It seems that a > > small > > > change would make it clearer: `electPreferredReplicaAsLeader`. The > next > > > point is that this is a batch API, so it should ideally be plural like > > the > > > other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`, > but > > > that's quite a mouthful. Maybe we should shorten it to > > > `electPreferredLeaders`. Thoughts? > > > > > > Ismael > > > > > > On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley > > wrote: > > > > > > > In a similar vein to KIP-179 I've created KIP-183 ( > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+ > > > > PreferredReplicaLeaderElectionCommand+to+use+AdminClient) > > > > which is about deprecating the --zookeeper option to > > > > kafka-preferred-replica-election.sh and replacing it with an option > > > which > > > > would use a new AdminClient-based API. > > > > > > > > As it stands the KIP is focussed on simply moving the existing > > > > functionality behind the AdminClient. > > > > > > > > I'd be grateful for any feedback people may have on this. > > > > > > > > Thanks, > > > > > > > > Tom > > > > > > > > > >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
If you agree with the change, yes, please rename. It's OK to make changes after the VOTE thread starts. In cases where some people have already voted, it's recommended to mention the changes in the VOTE thread as a heads up. Generally, we don't restart the vote unless the changes are significant. Ismael On Wed, Aug 30, 2017 at 4:26 PM, Tom Bentleywrote: > Hi Ismael, > > I agree that `electPreferredReplicaLeader` is a mouthful and am happy to > change it to `electPreferredLeaders`. I'd rename the correspond request and > response similarly. > > Should I rename it in the KIP now, even though I initiated a VOTE thread > yesterday? > > Cheers, > > Tom > > On 30 August 2017 at 16:01, Ismael Juma wrote: > > > Hi Tom, > > > > Thanks for the KIP, it's a useful one. I find the proposed method name > > `electPreferredReplicaLeader` a little hard to read. It seems that a > small > > change would make it clearer: `electPreferredReplicaAsLeader`. The next > > point is that this is a batch API, so it should ideally be plural like > the > > other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`, but > > that's quite a mouthful. Maybe we should shorten it to > > `electPreferredLeaders`. Thoughts? > > > > Ismael > > > > On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley > wrote: > > > > > In a similar vein to KIP-179 I've created KIP-183 ( > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+ > > > PreferredReplicaLeaderElectionCommand+to+use+AdminClient) > > > which is about deprecating the --zookeeper option to > > > kafka-preferred-replica-election.sh and replacing it with an option > > which > > > would use a new AdminClient-based API. > > > > > > As it stands the KIP is focussed on simply moving the existing > > > functionality behind the AdminClient. > > > > > > I'd be grateful for any feedback people may have on this. > > > > > > Thanks, > > > > > > Tom > > > > > >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Ismael, I agree that `electPreferredReplicaLeader` is a mouthful and am happy to change it to `electPreferredLeaders`. I'd rename the correspond request and response similarly. Should I rename it in the KIP now, even though I initiated a VOTE thread yesterday? Cheers, Tom On 30 August 2017 at 16:01, Ismael Jumawrote: > Hi Tom, > > Thanks for the KIP, it's a useful one. I find the proposed method name > `electPreferredReplicaLeader` a little hard to read. It seems that a small > change would make it clearer: `electPreferredReplicaAsLeader`. The next > point is that this is a batch API, so it should ideally be plural like the > other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`, but > that's quite a mouthful. Maybe we should shorten it to > `electPreferredLeaders`. Thoughts? > > Ismael > > On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley wrote: > > > In a similar vein to KIP-179 I've created KIP-183 ( > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+ > > PreferredReplicaLeaderElectionCommand+to+use+AdminClient) > > which is about deprecating the --zookeeper option to > > kafka-preferred-replica-election.sh and replacing it with an option > which > > would use a new AdminClient-based API. > > > > As it stands the KIP is focussed on simply moving the existing > > functionality behind the AdminClient. > > > > I'd be grateful for any feedback people may have on this. > > > > Thanks, > > > > Tom > > >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Tom, Thanks for the KIP, it's a useful one. I find the proposed method name `electPreferredReplicaLeader` a little hard to read. It seems that a small change would make it clearer: `electPreferredReplicaAsLeader`. The next point is that this is a batch API, so it should ideally be plural like the other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`, but that's quite a mouthful. Maybe we should shorten it to `electPreferredLeaders`. Thoughts? Ismael On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentleywrote: > In a similar vein to KIP-179 I've created KIP-183 ( > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+ > PreferredReplicaLeaderElectionCommand+to+use+AdminClient) > which is about deprecating the --zookeeper option to > kafka-preferred-replica-election.sh and replacing it with an option which > would use a new AdminClient-based API. > > As it stands the KIP is focussed on simply moving the existing > functionality behind the AdminClient. > > I'd be grateful for any feedback people may have on this. > > Thanks, > > Tom >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
H, Collin, That's a good point. I agree that Alter on Cluster resource is more appropriate. Thanks, Jun On Tue, Aug 22, 2017 at 1:42 PM, Colin McCabewrote: > Hi, > > Thanks for the KIP. It looks good overall. > > On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote: > > Hi Jun, > > > > Thanks for the feedback. > > > > I've updated the KIP to mention this new algorithm, which I agree will be > > much better from the AdminClient PoV. > > > > I've also reverted the authorization to be against the cluster resource, > > rather than the topic, since, on re-reading, Ewen wasn't particularly > > advocating that it should be topic-based, merely wondering if that made > > more sense in addition to the cluster-based check. > > > > Hmm. In general I thought we were using Cluster:ClusterAction for > authorizing internal requests made by Kafka itself, and Cluster:Alter > for administrative requests initiated by the admin. For example, the > RPCs added in KIP-140 use Cluster:Alter. > > Nitpick: I think you meant to remove this line from the KIP? > > > TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to > the topic. > > Also... this command simply starts the leader election, but does not > wait for it to complete, right? What is the preferred method for an > admin to check the final result or identify when the final result has > been achieved? > > best, > Colin > > > > Cheers, > > > > Tom > > > > On 11 August 2017 at 18:13, Jun Rao wrote: > > > > > Hi, Tom, > > > > > > 2. Yes, that's how manual preferred leader election currently works. I > > > think we could use this opportunity to simplify the process a bit > though. > > > Doing preferred leader election is a lot cheaper than partition > > > reassignment since it only involves writing the new leader to ZK. So, > we > > > probably don't need to persist the request in /admin/preferred_replica_ > > > election > > > in ZK. I was thinking of the following steps. > > > (a) The controller receives PreferredLeaderElectionRequest. > > > (b) The controller adds a PreferredReplicaLeaderEelection event in the > > > queue. > > > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to > be > > > processed by the controller and then send a response. > > > > > > If the controller changes and dies in the middle of this process, the > > > client will get an error and the client has to reissue the request. > > > > > > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more > than 1 > > > inflight preferred leader election request, in step (a) above, we can > set a > > > flag in the controller to indicate that a preferred leader election is > in > > > progress. The flag will be reset once the controller finishes > processing > > > the request. Until the flag is reset, any new > > > PreferredLeaderElectionRequest > > > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_ > PROGRESS > > > error. > > > > > > Does this sound good to you? > > > > > > Thanks, > > > > > > Jun > > > > > > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley > > > wrote: > > > > > > > Hi Jun, > > > > > > > > Thanks for your reply, I've got a few comment inline... > > > > > > > > On 11 August 2017 at 01:51, Jun Rao wrote: > > > > > > > > > Hi, Tom, > > > > > > > > > > Thanks for the KIP. Looks good overall. A few minor comments. > > > > > > > > > > 1. In most requests with topic partitions, we nest partitions > inside > > > > topic. > > > > > So, instead of [topic partition_id], we do [topic, [partition_id]] > to > > > > save > > > > > some space. > > > > > > > > > > > > > Of course, now sorted. > > > > > > > > > > > > > 2. The preferred leader election just tries to move the leader to > the > > > > > preferred replica. This may not succeed if the preferred replica > is not > > > > > in-sync yet. It would be useful to return an error code to reflect > > > that. > > > > > > > > > > > > > The algorithm the existing command uses is: > > > > > > > > 1. The command writes some JSON to the /admin/preferred_replica_ > election > > > > znode. The command then returns asynchronously. It only signals an > error > > > if > > > > the znode write failed. > > > > 2. The controller is subscribed to changes to this znode and > delegates to > > > > the PartitionStateMachine to do the actual election. > > > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a > > > > StateChangeFailedException if the preferred leader in not in the > ISR, or > > > is > > > > not alive. > > > > 4. This exception will propagate to the ControllerEventManager which > will > > > > log it. > > > > > > > > In the event of controller failover, the JSON is read from the znode > and > > > > the election proceeds in a similar manner. So the central problem is > the > > > > asynchronicity. > > > > > > > > I agree it would be better if the return from the AdminClient were > > > > synchronous. I'm still
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Colin, Thanks for your input. A couple of comments inline. On 22 August 2017 at 21:42, Colin McCabewrote: > Hi, > > Thanks for the KIP. It looks good overall. > > On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote: > > Hi Jun, > > > > Thanks for the feedback. > > > > I've updated the KIP to mention this new algorithm, which I agree will be > > much better from the AdminClient PoV. > > > > I've also reverted the authorization to be against the cluster resource, > > rather than the topic, since, on re-reading, Ewen wasn't particularly > > advocating that it should be topic-based, merely wondering if that made > > more sense in addition to the cluster-based check. > > > > Hmm. In general I thought we were using Cluster:ClusterAction for > authorizing internal requests made by Kafka itself, and Cluster:Alter > for administrative requests initiated by the admin. For example, the > RPCs added in KIP-140 use Cluster:Alter. > > Nitpick: I think you meant to remove this line from the KIP? > > > TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to > the topic. > > You're right, both now fixed. > Also... this command simply starts the leader election, but does not > wait for it to complete, right? What is the preferred method for an > admin to check the final result or identify when the final result has > been achieved? > > As suggested by Jun and mentioned in the section "Broker-side election algorithm" the KIP proposes (as of yesterday) to change how the election is performed. Rather than the current process of writing to the znode and having the controller react to that to perform the election, I am now proposing that the request from the AdminClient to the controller directly starts the election, and the response is not sent to the AdminClient until the election is complete. Thus, the API presented by the AdminClient is synchronous and the tool itself only returns when the elections have finished (or timed-out). Thanks again, Tom
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi, Thanks for the KIP. It looks good overall. On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote: > Hi Jun, > > Thanks for the feedback. > > I've updated the KIP to mention this new algorithm, which I agree will be > much better from the AdminClient PoV. > > I've also reverted the authorization to be against the cluster resource, > rather than the topic, since, on re-reading, Ewen wasn't particularly > advocating that it should be topic-based, merely wondering if that made > more sense in addition to the cluster-based check. > Hmm. In general I thought we were using Cluster:ClusterAction for authorizing internal requests made by Kafka itself, and Cluster:Alter for administrative requests initiated by the admin. For example, the RPCs added in KIP-140 use Cluster:Alter. Nitpick: I think you meant to remove this line from the KIP? > TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to the > topic. Also... this command simply starts the leader election, but does not wait for it to complete, right? What is the preferred method for an admin to check the final result or identify when the final result has been achieved? best, Colin > Cheers, > > Tom > > On 11 August 2017 at 18:13, Jun Raowrote: > > > Hi, Tom, > > > > 2. Yes, that's how manual preferred leader election currently works. I > > think we could use this opportunity to simplify the process a bit though. > > Doing preferred leader election is a lot cheaper than partition > > reassignment since it only involves writing the new leader to ZK. So, we > > probably don't need to persist the request in /admin/preferred_replica_ > > election > > in ZK. I was thinking of the following steps. > > (a) The controller receives PreferredLeaderElectionRequest. > > (b) The controller adds a PreferredReplicaLeaderEelection event in the > > queue. > > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be > > processed by the controller and then send a response. > > > > If the controller changes and dies in the middle of this process, the > > client will get an error and the client has to reissue the request. > > > > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1 > > inflight preferred leader election request, in step (a) above, we can set a > > flag in the controller to indicate that a preferred leader election is in > > progress. The flag will be reset once the controller finishes processing > > the request. Until the flag is reset, any new > > PreferredLeaderElectionRequest > > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS > > error. > > > > Does this sound good to you? > > > > Thanks, > > > > Jun > > > > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley > > wrote: > > > > > Hi Jun, > > > > > > Thanks for your reply, I've got a few comment inline... > > > > > > On 11 August 2017 at 01:51, Jun Rao wrote: > > > > > > > Hi, Tom, > > > > > > > > Thanks for the KIP. Looks good overall. A few minor comments. > > > > > > > > 1. In most requests with topic partitions, we nest partitions inside > > > topic. > > > > So, instead of [topic partition_id], we do [topic, [partition_id]] to > > > save > > > > some space. > > > > > > > > > > Of course, now sorted. > > > > > > > > > > 2. The preferred leader election just tries to move the leader to the > > > > preferred replica. This may not succeed if the preferred replica is not > > > > in-sync yet. It would be useful to return an error code to reflect > > that. > > > > > > > > > > The algorithm the existing command uses is: > > > > > > 1. The command writes some JSON to the /admin/preferred_replica_election > > > znode. The command then returns asynchronously. It only signals an error > > if > > > the znode write failed. > > > 2. The controller is subscribed to changes to this znode and delegates to > > > the PartitionStateMachine to do the actual election. > > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a > > > StateChangeFailedException if the preferred leader in not in the ISR, or > > is > > > not alive. > > > 4. This exception will propagate to the ControllerEventManager which will > > > log it. > > > > > > In the event of controller failover, the JSON is read from the znode and > > > the election proceeds in a similar manner. So the central problem is the > > > asynchronicity. > > > > > > I agree it would be better if the return from the AdminClient were > > > synchronous. I'm still learning about how the internals of Kafka work. > > It's > > > not enough for the AdminManager to wait (with a timeout) for the metadata > > > cache to show that the leader has been elected, since we need want to > > know > > > the reason if the election is not possible. So would an appropriate > > > mechanism to pass this information back from the KafkaController to the > > > AdminManager be to use another znode, or is there another way to do it? > > >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Jun, Thanks for the feedback. I've updated the KIP to mention this new algorithm, which I agree will be much better from the AdminClient PoV. I've also reverted the authorization to be against the cluster resource, rather than the topic, since, on re-reading, Ewen wasn't particularly advocating that it should be topic-based, merely wondering if that made more sense in addition to the cluster-based check. Cheers, Tom On 11 August 2017 at 18:13, Jun Raowrote: > Hi, Tom, > > 2. Yes, that's how manual preferred leader election currently works. I > think we could use this opportunity to simplify the process a bit though. > Doing preferred leader election is a lot cheaper than partition > reassignment since it only involves writing the new leader to ZK. So, we > probably don't need to persist the request in /admin/preferred_replica_ > election > in ZK. I was thinking of the following steps. > (a) The controller receives PreferredLeaderElectionRequest. > (b) The controller adds a PreferredReplicaLeaderEelection event in the > queue. > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be > processed by the controller and then send a response. > > If the controller changes and dies in the middle of this process, the > client will get an error and the client has to reissue the request. > > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1 > inflight preferred leader election request, in step (a) above, we can set a > flag in the controller to indicate that a preferred leader election is in > progress. The flag will be reset once the controller finishes processing > the request. Until the flag is reset, any new > PreferredLeaderElectionRequest > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS > error. > > Does this sound good to you? > > Thanks, > > Jun > > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley > wrote: > > > Hi Jun, > > > > Thanks for your reply, I've got a few comment inline... > > > > On 11 August 2017 at 01:51, Jun Rao wrote: > > > > > Hi, Tom, > > > > > > Thanks for the KIP. Looks good overall. A few minor comments. > > > > > > 1. In most requests with topic partitions, we nest partitions inside > > topic. > > > So, instead of [topic partition_id], we do [topic, [partition_id]] to > > save > > > some space. > > > > > > > Of course, now sorted. > > > > > > > 2. The preferred leader election just tries to move the leader to the > > > preferred replica. This may not succeed if the preferred replica is not > > > in-sync yet. It would be useful to return an error code to reflect > that. > > > > > > > The algorithm the existing command uses is: > > > > 1. The command writes some JSON to the /admin/preferred_replica_election > > znode. The command then returns asynchronously. It only signals an error > if > > the znode write failed. > > 2. The controller is subscribed to changes to this znode and delegates to > > the PartitionStateMachine to do the actual election. > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a > > StateChangeFailedException if the preferred leader in not in the ISR, or > is > > not alive. > > 4. This exception will propagate to the ControllerEventManager which will > > log it. > > > > In the event of controller failover, the JSON is read from the znode and > > the election proceeds in a similar manner. So the central problem is the > > asynchronicity. > > > > I agree it would be better if the return from the AdminClient were > > synchronous. I'm still learning about how the internals of Kafka work. > It's > > not enough for the AdminManager to wait (with a timeout) for the metadata > > cache to show that the leader has been elected, since we need want to > know > > the reason if the election is not possible. So would an appropriate > > mechanism to pass this information back from the KafkaController to the > > AdminManager be to use another znode, or is there another way to do it? > > I've not found any precedents for this, so please point them out if they > > exist. > > > > > > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is > > single > > > threaded, is that just for timeout? If so, perhaps we can just return > the > > > Timeout error code. > > > > > > > Again, this pertains to the current algorithm for the election: When the > > elections have been done the znode is deleted. Thus the presence of the > > znode indicates an election in progress, which causes the command to just > > give up (the user can try again later). > > > > We still need the znode in order to honour the existing elections in the > > event of controller failover part-way through, thus we need to do > something > > in the event that the znode exists at the time we want to write to it. > > > > I suppose it is possible to append to the JSON in the znode while > elections > > are already taking place, with the controller reading from the
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi, Tom, 2. Yes, that's how manual preferred leader election currently works. I think we could use this opportunity to simplify the process a bit though. Doing preferred leader election is a lot cheaper than partition reassignment since it only involves writing the new leader to ZK. So, we probably don't need to persist the request in /admin/preferred_replica_election in ZK. I was thinking of the following steps. (a) The controller receives PreferredLeaderElectionRequest. (b) The controller adds a PreferredReplicaLeaderEelection event in the queue. (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be processed by the controller and then send a response. If the controller changes and dies in the middle of this process, the client will get an error and the client has to reissue the request. Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1 inflight preferred leader election request, in step (a) above, we can set a flag in the controller to indicate that a preferred leader election is in progress. The flag will be reset once the controller finishes processing the request. Until the flag is reset, any new PreferredLeaderElectionRequest will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS error. Does this sound good to you? Thanks, Jun On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentleywrote: > Hi Jun, > > Thanks for your reply, I've got a few comment inline... > > On 11 August 2017 at 01:51, Jun Rao wrote: > > > Hi, Tom, > > > > Thanks for the KIP. Looks good overall. A few minor comments. > > > > 1. In most requests with topic partitions, we nest partitions inside > topic. > > So, instead of [topic partition_id], we do [topic, [partition_id]] to > save > > some space. > > > > Of course, now sorted. > > > > 2. The preferred leader election just tries to move the leader to the > > preferred replica. This may not succeed if the preferred replica is not > > in-sync yet. It would be useful to return an error code to reflect that. > > > > The algorithm the existing command uses is: > > 1. The command writes some JSON to the /admin/preferred_replica_election > znode. The command then returns asynchronously. It only signals an error if > the znode write failed. > 2. The controller is subscribed to changes to this znode and delegates to > the PartitionStateMachine to do the actual election. > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a > StateChangeFailedException if the preferred leader in not in the ISR, or is > not alive. > 4. This exception will propagate to the ControllerEventManager which will > log it. > > In the event of controller failover, the JSON is read from the znode and > the election proceeds in a similar manner. So the central problem is the > asynchronicity. > > I agree it would be better if the return from the AdminClient were > synchronous. I'm still learning about how the internals of Kafka work. It's > not enough for the AdminManager to wait (with a timeout) for the metadata > cache to show that the leader has been elected, since we need want to know > the reason if the election is not possible. So would an appropriate > mechanism to pass this information back from the KafkaController to the > AdminManager be to use another znode, or is there another way to do it? > I've not found any precedents for this, so please point them out if they > exist. > > > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is > single > > threaded, is that just for timeout? If so, perhaps we can just return the > > Timeout error code. > > > > Again, this pertains to the current algorithm for the election: When the > elections have been done the znode is deleted. Thus the presence of the > znode indicates an election in progress, which causes the command to just > give up (the user can try again later). > > We still need the znode in order to honour the existing elections in the > event of controller failover part-way through, thus we need to do something > in the event that the znode exists at the time we want to write to it. > > I suppose it is possible to append to the JSON in the znode while elections > are already taking place, with the controller reading from the head of the > JSON (thus treating the JSON like a queue), since the ZooKeeper docs give a > recipe for this. But I don't see that pattern being used elsewhere in > Kafka, so I'm not sure if this is what you want. > > It is not clear (to me at least) that the current restriction preventing > calling new elections while elections are on-going is a problem in > practice. But since both you and Ewen have raised this issue I assume it > does need to be solved: I just need a bit more of a hint about whether > there's a preferred way to solve it. > > > > 4. Permission wise, it seems this should require ClusterAction on Cluster > > resource since it's changing things at the cluster level. > > > > As I said in my
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Jun, Thanks for your reply, I've got a few comment inline... On 11 August 2017 at 01:51, Jun Raowrote: > Hi, Tom, > > Thanks for the KIP. Looks good overall. A few minor comments. > > 1. In most requests with topic partitions, we nest partitions inside topic. > So, instead of [topic partition_id], we do [topic, [partition_id]] to save > some space. > Of course, now sorted. > 2. The preferred leader election just tries to move the leader to the > preferred replica. This may not succeed if the preferred replica is not > in-sync yet. It would be useful to return an error code to reflect that. > The algorithm the existing command uses is: 1. The command writes some JSON to the /admin/preferred_replica_election znode. The command then returns asynchronously. It only signals an error if the znode write failed. 2. The controller is subscribed to changes to this znode and delegates to the PartitionStateMachine to do the actual election. 3. The PreferredReplicaPartitionLeaderSelector is used and throws a StateChangeFailedException if the preferred leader in not in the ISR, or is not alive. 4. This exception will propagate to the ControllerEventManager which will log it. In the event of controller failover, the JSON is read from the znode and the election proceeds in a similar manner. So the central problem is the asynchronicity. I agree it would be better if the return from the AdminClient were synchronous. I'm still learning about how the internals of Kafka work. It's not enough for the AdminManager to wait (with a timeout) for the metadata cache to show that the leader has been elected, since we need want to know the reason if the election is not possible. So would an appropriate mechanism to pass this information back from the KafkaController to the AdminManager be to use another znode, or is there another way to do it? I've not found any precedents for this, so please point them out if they exist. > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is single > threaded, is that just for timeout? If so, perhaps we can just return the > Timeout error code. > Again, this pertains to the current algorithm for the election: When the elections have been done the znode is deleted. Thus the presence of the znode indicates an election in progress, which causes the command to just give up (the user can try again later). We still need the znode in order to honour the existing elections in the event of controller failover part-way through, thus we need to do something in the event that the znode exists at the time we want to write to it. I suppose it is possible to append to the JSON in the znode while elections are already taking place, with the controller reading from the head of the JSON (thus treating the JSON like a queue), since the ZooKeeper docs give a recipe for this. But I don't see that pattern being used elsewhere in Kafka, so I'm not sure if this is what you want. It is not clear (to me at least) that the current restriction preventing calling new elections while elections are on-going is a problem in practice. But since both you and Ewen have raised this issue I assume it does need to be solved: I just need a bit more of a hint about whether there's a preferred way to solve it. > 4. Permission wise, it seems this should require ClusterAction on Cluster > resource since it's changing things at the cluster level. > As I said in my reply to Ewen, I can see that point of view (and it's what I originally specified), so I don't mind changing it back, but it would be good to hear more opinions on this. Cheers, Tom
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi, Tom, Thanks for the KIP. Looks good overall. A few minor comments. 1. In most requests with topic partitions, we nest partitions inside topic. So, instead of [topic partition_id], we do [topic, [partition_id]] to save some space. 2. The preferred leader election just tries to move the leader to the preferred replica. This may not succeed if the preferred replica is not in-sync yet. It would be useful to return an error code to reflect that. 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is single threaded, is that just for timeout? If so, perhaps we can just return the Timeout error code. 4. Permission wise, it seems this should require ClusterAction on Cluster resource since it's changing things at the cluster level. Thanks, Jun On Wed, Aug 9, 2017 at 6:16 AM, Ismael Jumawrote: > On Wed, Aug 9, 2017 at 11:40 AM, Tom Bentley > wrote: > > > > There are responses with detailed error messages as well as the codes, > > CreateTopicsResponse, {Describe|Alter}ConfigsResponse, and the responses > > for managing ACLs for instance. To be honest, I assumed including a > message > > was the norm. In this case, however, I don't think there is any extra > > detail to include beyond the error itself, so I've removed it from the > KIP. > > > > We started sending back actual error messages when we introduced the Create > Topic Policy as the errors are custom and much more helpful if a string can > be sent back. > > In general, I think it would be better if we had an optional error message > for all request types as error codes alone sometimes result in people > having to check the broker logs. Examples that come to mind are Group > Coordinator Not Available and Invalid Request. For the latter, we want to > include what was invalid and it's not practical to have an error code for > every possible validation error (this becomes more obvious once we start > adding admin protocol APIs). > > Ismael >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
On Wed, Aug 9, 2017 at 11:40 AM, Tom Bentleywrote: > > There are responses with detailed error messages as well as the codes, > CreateTopicsResponse, {Describe|Alter}ConfigsResponse, and the responses > for managing ACLs for instance. To be honest, I assumed including a message > was the norm. In this case, however, I don't think there is any extra > detail to include beyond the error itself, so I've removed it from the KIP. > We started sending back actual error messages when we introduced the Create Topic Policy as the errors are custom and much more helpful if a string can be sent back. In general, I think it would be better if we had an optional error message for all request types as error codes alone sometimes result in people having to check the broker logs. Examples that come to mind are Group Coordinator Not Available and Invalid Request. For the latter, we want to include what was invalid and it's not practical to have an error code for every possible validation error (this becomes more obvious once we start adding admin protocol APIs). Ismael
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi Ewen, Thanks for looking at the KIP. I've updated it for some of your comments, but see also a few replies inline. On 9 August 2017 at 06:02, Ewen Cheslack-Postavawrote: > Thanks for the KIP. Generally the move away from ZK and to native Kafka > requests is good, so I'm generally +1 on this. A couple of > comments/questions. > > * You gave the signature > electPreferredReplicaLeader(Collection partitions) for the > admin client. The old command allows not specifying the topic partitions > which results in election for all topic partitions. How would this be > expressed in the new API? Does the tool need to do a metadata request and > then issue the request for all topic partitions or would there be a > shortcut via the protocol (e.g. another electPreferredReplicaLeader() with > no args which translates to, e.g., an empty list or null in the request)? > I was trying to figure out the best way to do this. It seems there are no methods in the AdminClient setting a precedent on this. I would prefer to avoid having to do a metadata request to discover all the partitions. So the two ways I can think of are: 1. As you suggest, an empty or null list in the request would signify "all partitions". This is not intuitive, and using an empty list is downright surprising (an empty list should be a noop after all). 2. We could have an allPartitions flag in the Options class. This is also unintuitive because it would ignore whatever partitions were in the list, though I suppose the AdminClient itself could check for this and signal an error. Of these options, I think using a null list in the request is the least surprising. > * All the existing *Options classes for the AdminClient have at least a > timeout. Should we add that to this APIs options? > * Other AdminClient *Result classes provide some convenience methods when > you just want all the results. Should we have that as well? > * I don't think any other requests include error strings in the response, > do they? I'm pretty sure we just generate the error string on the client > side based on the error code. If an error code was ambiguous, we should > probably just split it across 2 or more codes. > There are responses with detailed error messages as well as the codes, CreateTopicsResponse, {Describe|Alter}ConfigsResponse, and the responses for managing ACLs for instance. To be honest, I assumed including a message was the norm. In this case, however, I don't think there is any extra detail to include beyond the error itself, so I've removed it from the KIP. > * re: permissions, would there also be a per-topic permission required? (I > haven't kept track of all the ACLs and required permissions so I'm not > sure, but a bunch of other operations check per-resource they are modifying > -- this operation could be considered to be against either Cluster or > Topics.) > I'm also unsure, and would welcome input from others about this. Note that there is some similarity here with some of the discussion on https://issues.apache.org/jira/browse/KAFKA-5638: To what extent should some permission on the Cluster imply some permission across all Topics in the cluster? My feeling is that there should be no such implication, because it will just make ACLs harder to reason about, and require more documentation about what implies what else. Getting back to this specific case, it would make some sense for Alter on the Topic to be necessary, though it could be argued that the leader of a partition is an operational concern, and not really part of the topic configuration. Changing the leader of a bunch of partitions could affect partitions of other topics. On the other hand, anyone with Alter on the topic would presumably have the freedom to set the preferred leader for a partition, so it would be a little strange if they couldn't actually give effect to that choice. On balance I think I prefer needing alter on the topic (and have adjusted the KIP accordingly), but I'm very happy to go with the consensus view. > * Should we consider making the request only valid against the broker and > having the command do a metadata request to discover the controller and > then send the request directly to it? I think this could a) simplify the > implementation a bit by taking the ZK node out of the equation b) avoid the > REPLICA_LEADER_ELECTION_IN_PROGRESS error since this is only required due > to using the ZK node to communicate with the controller afaict and c) > potentially open the door for a synchronous request/response in the future. > > That's a really good point. I agree that the request should be made to the controller, because it allows us to make improvements in future. I'm not totally sure about taking ZK out of the equation completely though: Right now during controller failover the new controller checks the znode and ensures those elections get honoured. So it seems we need to keep ZK in the picture, and without rewriting how we read and write
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Thanks for the KIP. Generally the move away from ZK and to native Kafka requests is good, so I'm generally +1 on this. A couple of comments/questions. * You gave the signature electPreferredReplicaLeader(Collection partitions) for the admin client. The old command allows not specifying the topic partitions which results in election for all topic partitions. How would this be expressed in the new API? Does the tool need to do a metadata request and then issue the request for all topic partitions or would there be a shortcut via the protocol (e.g. another electPreferredReplicaLeader() with no args which translates to, e.g., an empty list or null in the request)? * All the existing *Options classes for the AdminClient have at least a timeout. Should we add that to this APIs options? * Other AdminClient *Result classes provide some convenience methods when you just want all the results. Should we have that as well? * I don't think any other requests include error strings in the response, do they? I'm pretty sure we just generate the error string on the client side based on the error code. If an error code was ambiguous, we should probably just split it across 2 or more codes. * re: permissions, would there also be a per-topic permission required? (I haven't kept track of all the ACLs and required permissions so I'm not sure, but a bunch of other operations check per-resource they are modifying -- this operation could be considered to be against either Cluster or Topics.) * Should we consider making the request only valid against the broker and having the command do a metadata request to discover the controller and then send the request directly to it? I think this could a) simplify the implementation a bit by taking the ZK node out of the equation b) avoid the REPLICA_LEADER_ELECTION_IN_PROGRESS error since this is only required due to using the ZK node to communicate with the controller afaict and c) potentially open the door for a synchronous request/response in the future. -Ewen On Mon, Aug 7, 2017 at 8:21 AM, Tom Bentleywrote: > Hi, > > I've updated this KIP slightly, to clarify a couple of points. > > One thing in particular that I would like feedback on is what authorization > should be required for triggering the election of the preferred leader? > > Another thing would be whether the request and responses should be grouped > by topic name. This would make for smaller messages when triggering > elections for multiple partitions of the same topic. > > I'd be grateful for any feedback you may have. > > Cheers, > > Tom > > On 2 August 2017 at 18:34, Tom Bentley wrote: > > > In a similar vein to KIP-179 I've created KIP-183 ( > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+ > > PreferredReplicaLeaderElectionCommand+to+use+AdminClient) which is about > > deprecating the --zookeeper option to kafka-preferred-replica- > election.sh > > and replacing it with an option which would use a new AdminClient-based > API. > > > > As it stands the KIP is focussed on simply moving the existing > > functionality behind the AdminClient. > > > > I'd be grateful for any feedback people may have on this. > > > > Thanks, > > > > Tom > > >
Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
Hi, I've updated this KIP slightly, to clarify a couple of points. One thing in particular that I would like feedback on is what authorization should be required for triggering the election of the preferred leader? Another thing would be whether the request and responses should be grouped by topic name. This would make for smaller messages when triggering elections for multiple partitions of the same topic. I'd be grateful for any feedback you may have. Cheers, Tom On 2 August 2017 at 18:34, Tom Bentleywrote: > In a similar vein to KIP-179 I've created KIP-183 ( > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+ > PreferredReplicaLeaderElectionCommand+to+use+AdminClient) which is about > deprecating the --zookeeper option to kafka-preferred-replica-election.sh > and replacing it with an option which would use a new AdminClient-based API. > > As it stands the KIP is focussed on simply moving the existing > functionality behind the AdminClient. > > I'd be grateful for any feedback people may have on this. > > Thanks, > > Tom >
[DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient
In a similar vein to KIP-179 I've created KIP-183 ( https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+PreferredReplicaLeaderElectionCommand+to+use+AdminClient) which is about deprecating the --zookeeper option to kafka-preferred-replica-election.sh and replacing it with an option which would use a new AdminClient-based API. As it stands the KIP is focussed on simply moving the existing functionality behind the AdminClient. I'd be grateful for any feedback people may have on this. Thanks, Tom