Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-09-08 Thread Tom Bentley
Hi Jun,

The patches that I've got currently wait for the elections to complete
before returning the response. Is that the semantic you wanted?

Cheers,

Tom

On 7 September 2017 at 22:14, Jun Rao  wrote:

> Hi, Tom,
>
> It seems that it's useful to know whether the leader is balanced for each
> partition in ElectPreferredLeadersResult, instead of just being attempted?
>
> Thanks,
>
> Jun
>
> On Wed, Sep 6, 2017 at 4:03 PM, Colin McCabe  wrote:
>
> > On Wed, Sep 6, 2017, at 01:18, Tom Bentley wrote:
> > > Hi Colin,
> > >
> > > Thanks for taking the time to respond.
> > >
> > > On 5 September 2017 at 22:22, Colin McCabe  wrote:
> > >
> > > > ...
> > > > Why does there need to be a map at all in the API?
> > >
> > >
> > > From a purely technical PoV there doesn't, but doing something else
> would
> > > make the API inconsistent with other similar AdminClient *Results
> > > classes,
> > > which all expose a Map directly.
> > >
> > >
> > > > Why not just have
> > > > something like this:
> > > >
> > >
> > > I agree this would be a better solution. I will update the KIP and ask
> > > people to vote again. (Is that the right process?)
> > >
> > > It might be worth bearing this in mind for future AdminClient APIs:
> > > Exposing a Map directly means you can't retrofit handling a null
> argument
> > > to mean "all the things", whereas wrapping the map would allow that.
> >
> > That's a good point.
> >
> > I guess the important thing to keep in mind is that if you return a map
> > from a results class, it has to be instantiated eagerly.  It has to be
> > something you know before any RPCs are made, async actions are
> > performed, etc.
> >
> > best,
> > Colin
> >
> > >
> > > Thanks again,
> > >
> > > Tom
> >
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-09-07 Thread Jun Rao
Hi, Tom,

It seems that it's useful to know whether the leader is balanced for each
partition in ElectPreferredLeadersResult, instead of just being attempted?

Thanks,

Jun

On Wed, Sep 6, 2017 at 4:03 PM, Colin McCabe  wrote:

> On Wed, Sep 6, 2017, at 01:18, Tom Bentley wrote:
> > Hi Colin,
> >
> > Thanks for taking the time to respond.
> >
> > On 5 September 2017 at 22:22, Colin McCabe  wrote:
> >
> > > ...
> > > Why does there need to be a map at all in the API?
> >
> >
> > From a purely technical PoV there doesn't, but doing something else would
> > make the API inconsistent with other similar AdminClient *Results
> > classes,
> > which all expose a Map directly.
> >
> >
> > > Why not just have
> > > something like this:
> > >
> >
> > I agree this would be a better solution. I will update the KIP and ask
> > people to vote again. (Is that the right process?)
> >
> > It might be worth bearing this in mind for future AdminClient APIs:
> > Exposing a Map directly means you can't retrofit handling a null argument
> > to mean "all the things", whereas wrapping the map would allow that.
>
> That's a good point.
>
> I guess the important thing to keep in mind is that if you return a map
> from a results class, it has to be instantiated eagerly.  It has to be
> something you know before any RPCs are made, async actions are
> performed, etc.
>
> best,
> Colin
>
> >
> > Thanks again,
> >
> > Tom
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-09-06 Thread Colin McCabe
On Wed, Sep 6, 2017, at 01:18, Tom Bentley wrote:
> Hi Colin,
> 
> Thanks for taking the time to respond.
> 
> On 5 September 2017 at 22:22, Colin McCabe  wrote:
> 
> > ...
> > Why does there need to be a map at all in the API?
> 
> 
> From a purely technical PoV there doesn't, but doing something else would
> make the API inconsistent with other similar AdminClient *Results
> classes,
> which all expose a Map directly.
> 
> 
> > Why not just have
> > something like this:
> >
> 
> I agree this would be a better solution. I will update the KIP and ask
> people to vote again. (Is that the right process?)
> 
> It might be worth bearing this in mind for future AdminClient APIs:
> Exposing a Map directly means you can't retrofit handling a null argument
> to mean "all the things", whereas wrapping the map would allow that.

That's a good point.

I guess the important thing to keep in mind is that if you return a map
from a results class, it has to be instantiated eagerly.  It has to be
something you know before any RPCs are made, async actions are
performed, etc.

best,
Colin

> 
> Thanks again,
> 
> Tom


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-09-06 Thread Tom Bentley
Hi Colin,

Thanks for taking the time to respond.

On 5 September 2017 at 22:22, Colin McCabe  wrote:

> ...
> Why does there need to be a map at all in the API?


>From a purely technical PoV there doesn't, but doing something else would
make the API inconsistent with other similar AdminClient *Results classes,
which all expose a Map directly.


> Why not just have
> something like this:
>

I agree this would be a better solution. I will update the KIP and ask
people to vote again. (Is that the right process?)

It might be worth bearing this in mind for future AdminClient APIs:
Exposing a Map directly means you can't retrofit handling a null argument
to mean "all the things", whereas wrapping the map would allow that.

Thanks again,

Tom


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-09-05 Thread Colin McCabe
On Mon, Sep 4, 2017, at 04:54, Tom Bentley wrote:
> The KIP has been adopted after a successful vote.

Thanks for working on this, Tom.  It's a nice improvement.

> 
> Unfortunately I've discovered that there's an annoying detail in the
> handling of the case that electPreferredLeaders() is called with a null
> partitions argument. As discussed with Ewen, this is supposed to mean
> "all
> partitions", but we don't know all the partitions in the AdminClient, yet
> we have to return a ElectPreferredLeadersResults instance, supposedly
> with
> the partitions as keys.
> 
> We could handle this by passing a KafkaFuture KafkaFuture>> to the ctor of ElectPreferredLeadersResults, instead
> of
> an extant Map (the API of
> ElectPreferredLeadersResults would not change). In the case that the
> partitions argument was not null this future will already be completed.
> In
> the case where partitions argument was null this future will be completed
> when we have a response from which we discover the partitions; in the
> meantime the AdminClient can carry on being used for other calls. So in
> the
> normal case there's not really a problem.
> 
> The problem comes where there's an exception *before we get the
> response*,
> that means we still don't know the partitions to populate the map with.
> In
> practice this would mean that an exception could propagate out of
> ElectPreferredLeadersResults.values() rather than when the map was
> accessed
> element-wise. Since we control the API of ElectPreferredLeadersResults we
> could document that values() (and consequently all()) could throw,. We
> could even use checked exceptions, though since the exception would only
> happen in the case that the partitions argument was null that would feel
> rather heavy-handed to me.
> 
> Another alternative would be to block in
> AdminClient.electPreferredLeaders()
> in the case that the partitions argument was null, and if there was an
> error let the exception propagate out of electPreferredLeaders()
> directly.
> 
> Sorry about having to ask about this once people have already voted, but
> what do people think about these options?

I think we need to be very careful to keep the APIs asynchronous all the
time.  Having electPreferredLeaders() or values() sometimes block might
superficially seem reasonable, but it destroys the usefulness of the API
for true async programming.  Basically the async user is forced to put
the call into a thread pool in case it decides to block.  And they
cannot do nice things like chain Futures.

Why does there need to be a map at all in the API?  Why not just have
something like this:

  ElectPreferredLeadersResults {
/**
 * Get the result of the election for the given TopicPartition.
 * If there was not an election triggered for the given
 TopicPartition, the
 * future will complete with an error.
 */
public KafkaFuture partitionResult(TopicPartition
topicPartition);

/**
 * Get the topic partitions on which we attempted to trigger an
 election.
 * This tracks attempts, not successes.  A partition will appear in
 this result
 * even if the election was not successfully triggered.
 */
public KafkaFuture partitions();

/**
 * Return a future which gives an error result if we fail for any
 partition.
 */
public KafkaFuture all();
  }

We can fill in all this information when we actually know it.  In some
cases that will be later than others.  But all the calls can immediately
return a KafkaFuture, not block or throw an exception.

best,
Colin


> 
> Thanks,
> 
> Tom
> 
> On 30 August 2017 at 16:55, Tom Bentley  wrote:
> 
> > I've updated in the KIP.
> >
> > Thanks,
> >
> > Tom
> >
> > On 30 August 2017 at 16:42, Ismael Juma  wrote:
> >
> >> If you agree with the change, yes, please rename. It's OK to make changes
> >> after the VOTE thread starts. In cases where some people have already
> >> voted, it's recommended to mention the changes in the VOTE thread as a
> >> heads up. Generally, we don't restart the vote unless the changes are
> >> significant.
> >>
> >> Ismael
> >>
> >> On Wed, Aug 30, 2017 at 4:26 PM, Tom Bentley 
> >> wrote:
> >>
> >> > Hi Ismael,
> >> >
> >> > I agree that `electPreferredReplicaLeader` is a mouthful and am happy to
> >> > change it to `electPreferredLeaders`. I'd rename the correspond request
> >> and
> >> > response similarly.
> >> >
> >> > Should I rename it in the KIP now, even though I initiated a VOTE thread
> >> > yesterday?
> >> >
> >> > Cheers,
> >> >
> >> > Tom
> >> >
> >> > On 30 August 2017 at 16:01, Ismael Juma  wrote:
> >> >
> >> > > Hi Tom,
> >> > >
> >> > > Thanks for the KIP, it's a useful one. I find the proposed method name
> >> > > `electPreferredReplicaLeader` a little hard to read. It seems that a
> >> > small
> >> > > change would 

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-09-04 Thread Tom Bentley
The KIP has been adopted after a successful vote.

Unfortunately I've discovered that there's an annoying detail in the
handling of the case that electPreferredLeaders() is called with a null
partitions argument. As discussed with Ewen, this is supposed to mean "all
partitions", but we don't know all the partitions in the AdminClient, yet
we have to return a ElectPreferredLeadersResults instance, supposedly with
the partitions as keys.

We could handle this by passing a KafkaFuture> to the ctor of ElectPreferredLeadersResults, instead of
an extant Map (the API of
ElectPreferredLeadersResults would not change). In the case that the
partitions argument was not null this future will already be completed. In
the case where partitions argument was null this future will be completed
when we have a response from which we discover the partitions; in the
meantime the AdminClient can carry on being used for other calls. So in the
normal case there's not really a problem.

The problem comes where there's an exception *before we get the response*,
that means we still don't know the partitions to populate the map with. In
practice this would mean that an exception could propagate out of
ElectPreferredLeadersResults.values() rather than when the map was accessed
element-wise. Since we control the API of ElectPreferredLeadersResults we
could document that values() (and consequently all()) could throw,. We
could even use checked exceptions, though since the exception would only
happen in the case that the partitions argument was null that would feel
rather heavy-handed to me.

Another alternative would be to block in AdminClient.electPreferredLeaders()
in the case that the partitions argument was null, and if there was an
error let the exception propagate out of electPreferredLeaders() directly.

Sorry about having to ask about this once people have already voted, but
what do people think about these options?

Thanks,

Tom

On 30 August 2017 at 16:55, Tom Bentley  wrote:

> I've updated in the KIP.
>
> Thanks,
>
> Tom
>
> On 30 August 2017 at 16:42, Ismael Juma  wrote:
>
>> If you agree with the change, yes, please rename. It's OK to make changes
>> after the VOTE thread starts. In cases where some people have already
>> voted, it's recommended to mention the changes in the VOTE thread as a
>> heads up. Generally, we don't restart the vote unless the changes are
>> significant.
>>
>> Ismael
>>
>> On Wed, Aug 30, 2017 at 4:26 PM, Tom Bentley 
>> wrote:
>>
>> > Hi Ismael,
>> >
>> > I agree that `electPreferredReplicaLeader` is a mouthful and am happy to
>> > change it to `electPreferredLeaders`. I'd rename the correspond request
>> and
>> > response similarly.
>> >
>> > Should I rename it in the KIP now, even though I initiated a VOTE thread
>> > yesterday?
>> >
>> > Cheers,
>> >
>> > Tom
>> >
>> > On 30 August 2017 at 16:01, Ismael Juma  wrote:
>> >
>> > > Hi Tom,
>> > >
>> > > Thanks for the KIP, it's a useful one. I find the proposed method name
>> > > `electPreferredReplicaLeader` a little hard to read. It seems that a
>> > small
>> > > change would make it clearer: `electPreferredReplicaAsLeader`. The
>> next
>> > > point is that this is a batch API, so it should ideally be plural like
>> > the
>> > > other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`,
>> but
>> > > that's quite a mouthful. Maybe we should shorten it to
>> > > `electPreferredLeaders`. Thoughts?
>> > >
>> > > Ismael
>> > >
>> > > On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley 
>> > wrote:
>> > >
>> > > > In a similar vein to KIP-179 I've created KIP-183 (
>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+
>> > > > PreferredReplicaLeaderElectionCommand+to+use+AdminClient)
>> > > > which is about deprecating the --zookeeper option to
>> > > > kafka-preferred-replica-election.sh and replacing it with an option
>> > > which
>> > > > would use a new AdminClient-based API.
>> > > >
>> > > > As it stands the KIP is focussed on simply moving the existing
>> > > > functionality behind the AdminClient.
>> > > >
>> > > > I'd be grateful for any feedback people may have on this.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Tom
>> > > >
>> > >
>> >
>>
>
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-30 Thread Tom Bentley
I've updated in the KIP.

Thanks,

Tom

On 30 August 2017 at 16:42, Ismael Juma  wrote:

> If you agree with the change, yes, please rename. It's OK to make changes
> after the VOTE thread starts. In cases where some people have already
> voted, it's recommended to mention the changes in the VOTE thread as a
> heads up. Generally, we don't restart the vote unless the changes are
> significant.
>
> Ismael
>
> On Wed, Aug 30, 2017 at 4:26 PM, Tom Bentley 
> wrote:
>
> > Hi Ismael,
> >
> > I agree that `electPreferredReplicaLeader` is a mouthful and am happy to
> > change it to `electPreferredLeaders`. I'd rename the correspond request
> and
> > response similarly.
> >
> > Should I rename it in the KIP now, even though I initiated a VOTE thread
> > yesterday?
> >
> > Cheers,
> >
> > Tom
> >
> > On 30 August 2017 at 16:01, Ismael Juma  wrote:
> >
> > > Hi Tom,
> > >
> > > Thanks for the KIP, it's a useful one. I find the proposed method name
> > > `electPreferredReplicaLeader` a little hard to read. It seems that a
> > small
> > > change would make it clearer: `electPreferredReplicaAsLeader`. The
> next
> > > point is that this is a batch API, so it should ideally be plural like
> > the
> > > other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`,
> but
> > > that's quite a mouthful. Maybe we should shorten it to
> > > `electPreferredLeaders`. Thoughts?
> > >
> > > Ismael
> > >
> > > On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley 
> > wrote:
> > >
> > > > In a similar vein to KIP-179 I've created KIP-183 (
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+
> > > > PreferredReplicaLeaderElectionCommand+to+use+AdminClient)
> > > > which is about deprecating the --zookeeper option to
> > > > kafka-preferred-replica-election.sh and replacing it with an option
> > > which
> > > > would use a new AdminClient-based API.
> > > >
> > > > As it stands the KIP is focussed on simply moving the existing
> > > > functionality behind the AdminClient.
> > > >
> > > > I'd be grateful for any feedback people may have on this.
> > > >
> > > > Thanks,
> > > >
> > > > Tom
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-30 Thread Ismael Juma
If you agree with the change, yes, please rename. It's OK to make changes
after the VOTE thread starts. In cases where some people have already
voted, it's recommended to mention the changes in the VOTE thread as a
heads up. Generally, we don't restart the vote unless the changes are
significant.

Ismael

On Wed, Aug 30, 2017 at 4:26 PM, Tom Bentley  wrote:

> Hi Ismael,
>
> I agree that `electPreferredReplicaLeader` is a mouthful and am happy to
> change it to `electPreferredLeaders`. I'd rename the correspond request and
> response similarly.
>
> Should I rename it in the KIP now, even though I initiated a VOTE thread
> yesterday?
>
> Cheers,
>
> Tom
>
> On 30 August 2017 at 16:01, Ismael Juma  wrote:
>
> > Hi Tom,
> >
> > Thanks for the KIP, it's a useful one. I find the proposed method name
> > `electPreferredReplicaLeader` a little hard to read. It seems that a
> small
> > change would make it clearer: `electPreferredReplicaAsLeader`. The next
> > point is that this is a batch API, so it should ideally be plural like
> the
> > other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`, but
> > that's quite a mouthful. Maybe we should shorten it to
> > `electPreferredLeaders`. Thoughts?
> >
> > Ismael
> >
> > On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley 
> wrote:
> >
> > > In a similar vein to KIP-179 I've created KIP-183 (
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+
> > > PreferredReplicaLeaderElectionCommand+to+use+AdminClient)
> > > which is about deprecating the --zookeeper option to
> > > kafka-preferred-replica-election.sh and replacing it with an option
> > which
> > > would use a new AdminClient-based API.
> > >
> > > As it stands the KIP is focussed on simply moving the existing
> > > functionality behind the AdminClient.
> > >
> > > I'd be grateful for any feedback people may have on this.
> > >
> > > Thanks,
> > >
> > > Tom
> > >
> >
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-30 Thread Tom Bentley
Hi Ismael,

I agree that `electPreferredReplicaLeader` is a mouthful and am happy to
change it to `electPreferredLeaders`. I'd rename the correspond request and
response similarly.

Should I rename it in the KIP now, even though I initiated a VOTE thread
yesterday?

Cheers,

Tom

On 30 August 2017 at 16:01, Ismael Juma  wrote:

> Hi Tom,
>
> Thanks for the KIP, it's a useful one. I find the proposed method name
> `electPreferredReplicaLeader` a little hard to read. It seems that a small
> change would make it clearer: `electPreferredReplicaAsLeader`. The next
> point is that this is a batch API, so it should ideally be plural like the
> other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`, but
> that's quite a mouthful. Maybe we should shorten it to
> `electPreferredLeaders`. Thoughts?
>
> Ismael
>
> On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley  wrote:
>
> > In a similar vein to KIP-179 I've created KIP-183 (
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+
> > PreferredReplicaLeaderElectionCommand+to+use+AdminClient)
> > which is about deprecating the --zookeeper option to
> > kafka-preferred-replica-election.sh and replacing it with an option
> which
> > would use a new AdminClient-based API.
> >
> > As it stands the KIP is focussed on simply moving the existing
> > functionality behind the AdminClient.
> >
> > I'd be grateful for any feedback people may have on this.
> >
> > Thanks,
> >
> > Tom
> >
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-30 Thread Ismael Juma
Hi Tom,

Thanks for the KIP, it's a useful one. I find the proposed method name
`electPreferredReplicaLeader` a little hard to read. It seems that a small
change would make it clearer: `electPreferredReplicaAsLeader`. The next
point is that this is a batch API, so it should ideally be plural like the
other AdminClient methods. Maybe `electPreferredReplicasAsLeaders`, but
that's quite a mouthful. Maybe we should shorten it to
`electPreferredLeaders`. Thoughts?

Ismael

On Wed, Aug 2, 2017 at 6:34 PM, Tom Bentley  wrote:

> In a similar vein to KIP-179 I've created KIP-183 (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+
> PreferredReplicaLeaderElectionCommand+to+use+AdminClient)
> which is about deprecating the --zookeeper option to
> kafka-preferred-replica-election.sh and replacing it with an option which
> would use a new AdminClient-based API.
>
> As it stands the KIP is focussed on simply moving the existing
> functionality behind the AdminClient.
>
> I'd be grateful for any feedback people may have on this.
>
> Thanks,
>
> Tom
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-23 Thread Jun Rao
H, Collin,

That's a good point. I agree that Alter on Cluster resource is more
appropriate.

Thanks,

Jun


On Tue, Aug 22, 2017 at 1:42 PM, Colin McCabe  wrote:

> Hi,
>
> Thanks for the KIP.  It looks good overall.
>
> On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote:
> > Hi Jun,
> >
> > Thanks for the feedback.
> >
> > I've updated the KIP to mention this new algorithm, which I agree will be
> > much better from the AdminClient PoV.
> >
> > I've also reverted the authorization to be against the cluster resource,
> > rather than the topic, since, on re-reading, Ewen wasn't particularly
> > advocating that it should be topic-based, merely wondering if that made
> > more sense in addition to the cluster-based check.
> >
>
> Hmm.  In general I thought we were using Cluster:ClusterAction for
> authorizing internal requests made by Kafka itself, and Cluster:Alter
> for administrative requests initiated by the admin.  For example, the
> RPCs added in KIP-140 use Cluster:Alter.
>
> Nitpick: I think you meant to remove this line from the KIP?
>
> > TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to
> the topic.
>
> Also... this command simply starts the leader election, but does not
> wait for it to complete, right?  What is the preferred method for an
> admin to check the final result or identify when the final result has
> been achieved?
>
> best,
> Colin
>
>
> > Cheers,
> >
> > Tom
> >
> > On 11 August 2017 at 18:13, Jun Rao  wrote:
> >
> > > Hi, Tom,
> > >
> > > 2. Yes, that's how manual preferred leader election currently works. I
> > > think we could use this opportunity to simplify the process a bit
> though.
> > > Doing preferred leader election is a lot cheaper than partition
> > > reassignment since it only involves writing the new leader to ZK. So,
> we
> > > probably don't need to persist the request in /admin/preferred_replica_
> > > election
> > > in ZK. I was thinking of the following steps.
> > > (a) The controller receives PreferredLeaderElectionRequest.
> > > (b) The controller adds a PreferredReplicaLeaderEelection event in the
> > > queue.
> > > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to
> be
> > > processed by the controller and then send a response.
> > >
> > > If the controller changes and dies in the middle of this process, the
> > > client will get an error and the client has to reissue the request.
> > >
> > > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more
> than 1
> > > inflight preferred leader election request, in step (a) above, we can
> set a
> > > flag in the controller to indicate that a preferred leader election is
> in
> > > progress. The flag will be reset once the controller finishes
> processing
> > > the request. Until the flag is reset, any new
> > > PreferredLeaderElectionRequest
> > > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_
> PROGRESS
> > >  error.
> > >
> > > Does this sound good to you?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley 
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thanks for your reply, I've got a few comment inline...
> > > >
> > > > On 11 August 2017 at 01:51, Jun Rao  wrote:
> > > >
> > > > > Hi, Tom,
> > > > >
> > > > > Thanks for the KIP. Looks good overall. A few minor comments.
> > > > >
> > > > > 1. In most requests with topic partitions, we nest partitions
> inside
> > > > topic.
> > > > > So, instead of [topic partition_id], we do [topic, [partition_id]]
> to
> > > > save
> > > > > some space.
> > > > >
> > > >
> > > > Of course, now sorted.
> > > >
> > > >
> > > > > 2. The preferred leader election just tries to move the leader to
> the
> > > > > preferred replica. This may not succeed if the preferred replica
> is not
> > > > > in-sync yet. It would be useful to return an error code to reflect
> > > that.
> > > > >
> > > >
> > > > The algorithm the existing command uses is:
> > > >
> > > > 1. The command writes some JSON to the /admin/preferred_replica_
> election
> > > > znode. The command then returns asynchronously. It only signals an
> error
> > > if
> > > > the znode write failed.
> > > > 2. The controller is subscribed to changes to this znode and
> delegates to
> > > > the PartitionStateMachine to do the actual election.
> > > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> > > > StateChangeFailedException if the preferred leader in not in the
> ISR, or
> > > is
> > > > not alive.
> > > > 4. This exception will propagate to the ControllerEventManager which
> will
> > > > log it.
> > > >
> > > > In the event of controller failover, the JSON is read from the znode
> and
> > > > the election proceeds in a similar manner. So the central problem is
> the
> > > > asynchronicity.
> > > >
> > > > I agree it would be better if the return from the AdminClient were
> > > > synchronous. I'm still 

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-23 Thread Tom Bentley
Hi Colin,

Thanks for your input. A couple of comments inline.

On 22 August 2017 at 21:42, Colin McCabe  wrote:

> Hi,
>
> Thanks for the KIP.  It looks good overall.
>
> On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote:
> > Hi Jun,
> >
> > Thanks for the feedback.
> >
> > I've updated the KIP to mention this new algorithm, which I agree will be
> > much better from the AdminClient PoV.
> >
> > I've also reverted the authorization to be against the cluster resource,
> > rather than the topic, since, on re-reading, Ewen wasn't particularly
> > advocating that it should be topic-based, merely wondering if that made
> > more sense in addition to the cluster-based check.
> >
>
> Hmm.  In general I thought we were using Cluster:ClusterAction for
> authorizing internal requests made by Kafka itself, and Cluster:Alter
> for administrative requests initiated by the admin.  For example, the
> RPCs added in KIP-140 use Cluster:Alter.
>
> Nitpick: I think you meant to remove this line from the KIP?
>
> > TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to
> the topic.
>
>
You're right, both now fixed.


> Also... this command simply starts the leader election, but does not
> wait for it to complete, right?  What is the preferred method for an
> admin to check the final result or identify when the final result has
> been achieved?
>
>
As suggested by Jun and mentioned in the section "Broker-side election
algorithm" the KIP proposes (as of yesterday) to change how the election is
performed. Rather than the current process of writing to the znode and
having the controller react to that to perform the election, I am now
proposing that the request from the AdminClient to the controller directly
starts the election, and the response is not sent to the AdminClient until
the election is complete. Thus, the API presented by the AdminClient is
synchronous and the tool itself only returns when the elections have
finished (or timed-out).


Thanks again,

Tom


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-22 Thread Colin McCabe
Hi,

Thanks for the KIP.  It looks good overall.

On Tue, Aug 22, 2017, at 08:54, Tom Bentley wrote:
> Hi Jun,
> 
> Thanks for the feedback.
> 
> I've updated the KIP to mention this new algorithm, which I agree will be
> much better from the AdminClient PoV.
> 
> I've also reverted the authorization to be against the cluster resource,
> rather than the topic, since, on re-reading, Ewen wasn't particularly
> advocating that it should be topic-based, merely wondering if that made
> more sense in addition to the cluster-based check.
> 

Hmm.  In general I thought we were using Cluster:ClusterAction for
authorizing internal requests made by Kafka itself, and Cluster:Alter
for administrative requests initiated by the admin.  For example, the
RPCs added in KIP-140 use Cluster:Alter.

Nitpick: I think you meant to remove this line from the KIP?

> TOPIC_AUTHORIZATION_FAILED (29) If the user didn't have Alter access to the 
> topic.

Also... this command simply starts the leader election, but does not
wait for it to complete, right?  What is the preferred method for an
admin to check the final result or identify when the final result has
been achieved?

best,
Colin


> Cheers,
> 
> Tom
> 
> On 11 August 2017 at 18:13, Jun Rao  wrote:
> 
> > Hi, Tom,
> >
> > 2. Yes, that's how manual preferred leader election currently works. I
> > think we could use this opportunity to simplify the process a bit though.
> > Doing preferred leader election is a lot cheaper than partition
> > reassignment since it only involves writing the new leader to ZK. So, we
> > probably don't need to persist the request in /admin/preferred_replica_
> > election
> > in ZK. I was thinking of the following steps.
> > (a) The controller receives PreferredLeaderElectionRequest.
> > (b) The controller adds a PreferredReplicaLeaderEelection event in the
> > queue.
> > (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be
> > processed by the controller and then send a response.
> >
> > If the controller changes and dies in the middle of this process, the
> > client will get an error and the client has to reissue the request.
> >
> > Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1
> > inflight preferred leader election request, in step (a) above, we can set a
> > flag in the controller to indicate that a preferred leader election is in
> > progress. The flag will be reset once the controller finishes processing
> > the request. Until the flag is reset, any new
> > PreferredLeaderElectionRequest
> > will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS
> >  error.
> >
> > Does this sound good to you?
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley 
> > wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for your reply, I've got a few comment inline...
> > >
> > > On 11 August 2017 at 01:51, Jun Rao  wrote:
> > >
> > > > Hi, Tom,
> > > >
> > > > Thanks for the KIP. Looks good overall. A few minor comments.
> > > >
> > > > 1. In most requests with topic partitions, we nest partitions inside
> > > topic.
> > > > So, instead of [topic partition_id], we do [topic, [partition_id]] to
> > > save
> > > > some space.
> > > >
> > >
> > > Of course, now sorted.
> > >
> > >
> > > > 2. The preferred leader election just tries to move the leader to the
> > > > preferred replica. This may not succeed if the preferred replica is not
> > > > in-sync yet. It would be useful to return an error code to reflect
> > that.
> > > >
> > >
> > > The algorithm the existing command uses is:
> > >
> > > 1. The command writes some JSON to the /admin/preferred_replica_election
> > > znode. The command then returns asynchronously. It only signals an error
> > if
> > > the znode write failed.
> > > 2. The controller is subscribed to changes to this znode and delegates to
> > > the PartitionStateMachine to do the actual election.
> > > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> > > StateChangeFailedException if the preferred leader in not in the ISR, or
> > is
> > > not alive.
> > > 4. This exception will propagate to the ControllerEventManager which will
> > > log it.
> > >
> > > In the event of controller failover, the JSON is read from the znode and
> > > the election proceeds in a similar manner. So the central problem is the
> > > asynchronicity.
> > >
> > > I agree it would be better if the return from the AdminClient were
> > > synchronous. I'm still learning about how the internals of Kafka work.
> > It's
> > > not enough for the AdminManager to wait (with a timeout) for the metadata
> > > cache to show that the leader has been elected, since we need want to
> > know
> > > the reason if the election is not possible. So would an appropriate
> > > mechanism to pass this information back from the KafkaController to the
> > > AdminManager be to use another znode, or is there another way to do it?
> > > 

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-22 Thread Tom Bentley
Hi Jun,

Thanks for the feedback.

I've updated the KIP to mention this new algorithm, which I agree will be
much better from the AdminClient PoV.

I've also reverted the authorization to be against the cluster resource,
rather than the topic, since, on re-reading, Ewen wasn't particularly
advocating that it should be topic-based, merely wondering if that made
more sense in addition to the cluster-based check.

Cheers,

Tom

On 11 August 2017 at 18:13, Jun Rao  wrote:

> Hi, Tom,
>
> 2. Yes, that's how manual preferred leader election currently works. I
> think we could use this opportunity to simplify the process a bit though.
> Doing preferred leader election is a lot cheaper than partition
> reassignment since it only involves writing the new leader to ZK. So, we
> probably don't need to persist the request in /admin/preferred_replica_
> election
> in ZK. I was thinking of the following steps.
> (a) The controller receives PreferredLeaderElectionRequest.
> (b) The controller adds a PreferredReplicaLeaderEelection event in the
> queue.
> (c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be
> processed by the controller and then send a response.
>
> If the controller changes and dies in the middle of this process, the
> client will get an error and the client has to reissue the request.
>
> Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1
> inflight preferred leader election request, in step (a) above, we can set a
> flag in the controller to indicate that a preferred leader election is in
> progress. The flag will be reset once the controller finishes processing
> the request. Until the flag is reset, any new
> PreferredLeaderElectionRequest
> will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS
>  error.
>
> Does this sound good to you?
>
> Thanks,
>
> Jun
>
> On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley 
> wrote:
>
> > Hi Jun,
> >
> > Thanks for your reply, I've got a few comment inline...
> >
> > On 11 August 2017 at 01:51, Jun Rao  wrote:
> >
> > > Hi, Tom,
> > >
> > > Thanks for the KIP. Looks good overall. A few minor comments.
> > >
> > > 1. In most requests with topic partitions, we nest partitions inside
> > topic.
> > > So, instead of [topic partition_id], we do [topic, [partition_id]] to
> > save
> > > some space.
> > >
> >
> > Of course, now sorted.
> >
> >
> > > 2. The preferred leader election just tries to move the leader to the
> > > preferred replica. This may not succeed if the preferred replica is not
> > > in-sync yet. It would be useful to return an error code to reflect
> that.
> > >
> >
> > The algorithm the existing command uses is:
> >
> > 1. The command writes some JSON to the /admin/preferred_replica_election
> > znode. The command then returns asynchronously. It only signals an error
> if
> > the znode write failed.
> > 2. The controller is subscribed to changes to this znode and delegates to
> > the PartitionStateMachine to do the actual election.
> > 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> > StateChangeFailedException if the preferred leader in not in the ISR, or
> is
> > not alive.
> > 4. This exception will propagate to the ControllerEventManager which will
> > log it.
> >
> > In the event of controller failover, the JSON is read from the znode and
> > the election proceeds in a similar manner. So the central problem is the
> > asynchronicity.
> >
> > I agree it would be better if the return from the AdminClient were
> > synchronous. I'm still learning about how the internals of Kafka work.
> It's
> > not enough for the AdminManager to wait (with a timeout) for the metadata
> > cache to show that the leader has been elected, since we need want to
> know
> > the reason if the election is not possible. So would an appropriate
> > mechanism to pass this information back from the KafkaController to the
> > AdminManager be to use another znode, or is there another way to do it?
> > I've not found any precedents for this, so please point them out if they
> > exist.
> >
> >
> > > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is
> > single
> > > threaded, is that just for timeout? If so, perhaps we can just return
> the
> > > Timeout error code.
> > >
> >
> > Again, this pertains to the current algorithm for the election: When the
> > elections have been done the znode is deleted. Thus the presence of the
> > znode indicates an election in progress, which causes the command to just
> > give up (the user can try again later).
> >
> > We still need the znode in order to honour the existing elections in the
> > event of controller failover part-way through, thus we need to do
> something
> > in the event that the znode exists at the time we want to write to it.
> >
> > I suppose it is possible to append to the JSON in the znode while
> elections
> > are already taking place, with the controller reading from the 

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-11 Thread Jun Rao
Hi, Tom,

2. Yes, that's how manual preferred leader election currently works. I
think we could use this opportunity to simplify the process a bit though.
Doing preferred leader election is a lot cheaper than partition
reassignment since it only involves writing the new leader to ZK. So, we
probably don't need to persist the request in /admin/preferred_replica_election
in ZK. I was thinking of the following steps.
(a) The controller receives PreferredLeaderElectionRequest.
(b) The controller adds a PreferredReplicaLeaderEelection event in the
queue.
(c) Wait up to the timeout for the PreferredReplicaLeaderEelection to be
processed by the controller and then send a response.

If the controller changes and dies in the middle of this process, the
client will get an error and the client has to reissue the request.

Good point on REPLICA_LEADER_ELECTION_IN_PROGRESS. To prevent more than 1
inflight preferred leader election request, in step (a) above, we can set a
flag in the controller to indicate that a preferred leader election is in
progress. The flag will be reset once the controller finishes processing
the request. Until the flag is reset, any new PreferredLeaderElectionRequest
will be responded immediately with a REPLICA_LEADER_ELECTION_IN_PROGRESS
 error.

Does this sound good to you?

Thanks,

Jun

On Fri, Aug 11, 2017 at 4:55 AM, Tom Bentley  wrote:

> Hi Jun,
>
> Thanks for your reply, I've got a few comment inline...
>
> On 11 August 2017 at 01:51, Jun Rao  wrote:
>
> > Hi, Tom,
> >
> > Thanks for the KIP. Looks good overall. A few minor comments.
> >
> > 1. In most requests with topic partitions, we nest partitions inside
> topic.
> > So, instead of [topic partition_id], we do [topic, [partition_id]] to
> save
> > some space.
> >
>
> Of course, now sorted.
>
>
> > 2. The preferred leader election just tries to move the leader to the
> > preferred replica. This may not succeed if the preferred replica is not
> > in-sync yet. It would be useful to return an error code to reflect that.
> >
>
> The algorithm the existing command uses is:
>
> 1. The command writes some JSON to the /admin/preferred_replica_election
> znode. The command then returns asynchronously. It only signals an error if
> the znode write failed.
> 2. The controller is subscribed to changes to this znode and delegates to
> the PartitionStateMachine to do the actual election.
> 3. The PreferredReplicaPartitionLeaderSelector is used and throws a
> StateChangeFailedException if the preferred leader in not in the ISR, or is
> not alive.
> 4. This exception will propagate to the ControllerEventManager which will
> log it.
>
> In the event of controller failover, the JSON is read from the znode and
> the election proceeds in a similar manner. So the central problem is the
> asynchronicity.
>
> I agree it would be better if the return from the AdminClient were
> synchronous. I'm still learning about how the internals of Kafka work. It's
> not enough for the AdminManager to wait (with a timeout) for the metadata
> cache to show that the leader has been elected, since we need want to know
> the reason if the election is not possible. So would an appropriate
> mechanism to pass this information back from the KafkaController to the
> AdminManager be to use another znode, or is there another way to do it?
> I've not found any precedents for this, so please point them out if they
> exist.
>
>
> > 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is
> single
> > threaded, is that just for timeout? If so, perhaps we can just return the
> > Timeout error code.
> >
>
> Again, this pertains to the current algorithm for the election: When the
> elections have been done the znode is deleted. Thus the presence of the
> znode indicates an election in progress, which causes the command to just
> give up (the user can try again later).
>
> We still need the znode in order to honour the existing elections in the
> event of controller failover part-way through, thus we need to do something
> in the event that the znode exists at the time we want to write to it.
>
> I suppose it is possible to append to the JSON in the znode while elections
> are already taking place, with the controller reading from the head of the
> JSON (thus treating the JSON like a queue), since the ZooKeeper docs give a
> recipe for this. But I don't see that pattern being used elsewhere in
> Kafka, so I'm not sure if this is what you want.
>
> It is not clear (to me at least) that the current restriction preventing
> calling new elections while elections are on-going is a problem in
> practice. But since both you and Ewen have raised this issue I assume it
> does need to be solved: I just need a bit more of a hint about whether
> there's a preferred way to solve it.
>
>
> > 4. Permission wise, it seems this should require ClusterAction on Cluster
> > resource since it's changing things at the cluster level.
> >
>
> As I said in my 

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-11 Thread Tom Bentley
Hi Jun,

Thanks for your reply, I've got a few comment inline...

On 11 August 2017 at 01:51, Jun Rao  wrote:

> Hi, Tom,
>
> Thanks for the KIP. Looks good overall. A few minor comments.
>
> 1. In most requests with topic partitions, we nest partitions inside topic.
> So, instead of [topic partition_id], we do [topic, [partition_id]] to save
> some space.
>

Of course, now sorted.


> 2. The preferred leader election just tries to move the leader to the
> preferred replica. This may not succeed if the preferred replica is not
> in-sync yet. It would be useful to return an error code to reflect that.
>

The algorithm the existing command uses is:

1. The command writes some JSON to the /admin/preferred_replica_election
znode. The command then returns asynchronously. It only signals an error if
the znode write failed.
2. The controller is subscribed to changes to this znode and delegates to
the PartitionStateMachine to do the actual election.
3. The PreferredReplicaPartitionLeaderSelector is used and throws a
StateChangeFailedException if the preferred leader in not in the ISR, or is
not alive.
4. This exception will propagate to the ControllerEventManager which will
log it.

In the event of controller failover, the JSON is read from the znode and
the election proceeds in a similar manner. So the central problem is the
asynchronicity.

I agree it would be better if the return from the AdminClient were
synchronous. I'm still learning about how the internals of Kafka work. It's
not enough for the AdminManager to wait (with a timeout) for the metadata
cache to show that the leader has been elected, since we need want to know
the reason if the election is not possible. So would an appropriate
mechanism to pass this information back from the KafkaController to the
AdminManager be to use another znode, or is there another way to do it?
I've not found any precedents for this, so please point them out if they
exist.


> 3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is single
> threaded, is that just for timeout? If so, perhaps we can just return the
> Timeout error code.
>

Again, this pertains to the current algorithm for the election: When the
elections have been done the znode is deleted. Thus the presence of the
znode indicates an election in progress, which causes the command to just
give up (the user can try again later).

We still need the znode in order to honour the existing elections in the
event of controller failover part-way through, thus we need to do something
in the event that the znode exists at the time we want to write to it.

I suppose it is possible to append to the JSON in the znode while elections
are already taking place, with the controller reading from the head of the
JSON (thus treating the JSON like a queue), since the ZooKeeper docs give a
recipe for this. But I don't see that pattern being used elsewhere in
Kafka, so I'm not sure if this is what you want.

It is not clear (to me at least) that the current restriction preventing
calling new elections while elections are on-going is a problem in
practice. But since both you and Ewen have raised this issue I assume it
does need to be solved: I just need a bit more of a hint about whether
there's a preferred way to solve it.


> 4. Permission wise, it seems this should require ClusterAction on Cluster
> resource since it's changing things at the cluster level.
>

As I said in my reply to Ewen, I can see that point of view (and it's what
I originally specified), so I don't mind changing it back, but it would be
good to hear more opinions on this.

Cheers,

Tom


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-10 Thread Jun Rao
Hi, Tom,

Thanks for the KIP. Looks good overall. A few minor comments.

1. In most requests with topic partitions, we nest partitions inside topic.
So, instead of [topic partition_id], we do [topic, [partition_id]] to save
some space.

2. The preferred leader election just tries to move the leader to the
preferred replica. This may not succeed if the preferred replica is not
in-sync yet. It would be useful to return an error code to reflect that.

3. REPLICA_LEADER_ELECTION_IN_PROGRESS: Now that the controller is single
threaded, is that just for timeout? If so, perhaps we can just return the
Timeout error code.

4. Permission wise, it seems this should require ClusterAction on Cluster
resource since it's changing things at the cluster level.

Thanks,

Jun



On Wed, Aug 9, 2017 at 6:16 AM, Ismael Juma  wrote:

> On Wed, Aug 9, 2017 at 11:40 AM, Tom Bentley 
> wrote:
> >
> > There are responses with detailed error messages as well as the codes,
> > CreateTopicsResponse, {Describe|Alter}ConfigsResponse, and the responses
> > for managing ACLs for instance. To be honest, I assumed including a
> message
> > was the norm. In this case, however, I don't think there is any extra
> > detail to include beyond the error itself, so I've removed it from the
> KIP.
> >
>
> We started sending back actual error messages when we introduced the Create
> Topic Policy as the errors are custom and much more helpful if a string can
> be sent back.
>
> In general, I think it would be better if we had an optional error message
> for all request types as error codes alone sometimes result in people
> having to check the broker logs. Examples that come to mind are Group
> Coordinator Not Available and Invalid Request. For the latter, we want to
> include what was invalid and it's not practical to have an error code for
> every possible validation error (this becomes more obvious once we start
> adding admin protocol APIs).
>
> Ismael
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-09 Thread Ismael Juma
On Wed, Aug 9, 2017 at 11:40 AM, Tom Bentley  wrote:
>
> There are responses with detailed error messages as well as the codes,
> CreateTopicsResponse, {Describe|Alter}ConfigsResponse, and the responses
> for managing ACLs for instance. To be honest, I assumed including a message
> was the norm. In this case, however, I don't think there is any extra
> detail to include beyond the error itself, so I've removed it from the KIP.
>

We started sending back actual error messages when we introduced the Create
Topic Policy as the errors are custom and much more helpful if a string can
be sent back.

In general, I think it would be better if we had an optional error message
for all request types as error codes alone sometimes result in people
having to check the broker logs. Examples that come to mind are Group
Coordinator Not Available and Invalid Request. For the latter, we want to
include what was invalid and it's not practical to have an error code for
every possible validation error (this becomes more obvious once we start
adding admin protocol APIs).

Ismael


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-09 Thread Tom Bentley
Hi Ewen,

Thanks for looking at the KIP. I've updated it for some of your comments,
but see also a few replies inline.

On 9 August 2017 at 06:02, Ewen Cheslack-Postava  wrote:

> Thanks for the KIP. Generally the move away from ZK and to native Kafka
> requests is good, so I'm generally +1 on this. A couple of
> comments/questions.
>
> * You gave the signature
> electPreferredReplicaLeader(Collection partitions) for the
> admin client. The old command allows not specifying the topic partitions
> which results in election for all topic partitions. How would this be
> expressed in the new API? Does the tool need to do a metadata request and
> then issue the request for all topic partitions or would there be a
> shortcut via the protocol (e.g. another electPreferredReplicaLeader() with
> no args which translates to, e.g., an empty list or null in the request)?
>

I was trying to figure out the best way to do this. It seems there are no
methods in the AdminClient setting a precedent on this. I would prefer to
avoid having to do a metadata request to discover all the partitions. So
the two ways I can think of are:

1. As you suggest, an empty or null list in the request would signify "all
partitions". This is not intuitive, and using an empty list is downright
surprising (an empty list should be a noop after all).
2. We could have an allPartitions flag in the Options class. This is also
unintuitive because it would ignore whatever partitions were in the list,
though I suppose the AdminClient itself could check for this and signal an
error.

Of these options, I think using a null list in the request is the least
surprising.


> * All the existing *Options classes for the AdminClient have at least a
> timeout. Should we add that to this APIs options?
> * Other AdminClient *Result classes provide some convenience methods when
> you just want all the results. Should we have that as well?
> * I don't think any other requests include error strings in the response,
> do they? I'm pretty sure we just generate the error string on the client
> side based on the error code. If an error code was ambiguous, we should
> probably just split it across 2 or more codes.
>

There are responses with detailed error messages as well as the codes,
CreateTopicsResponse, {Describe|Alter}ConfigsResponse, and the responses
for managing ACLs for instance. To be honest, I assumed including a message
was the norm. In this case, however, I don't think there is any extra
detail to include beyond the error itself, so I've removed it from the KIP.


> * re: permissions, would there also be a per-topic permission required? (I
> haven't kept track of all the ACLs and required permissions so I'm not
> sure, but a bunch of other operations check per-resource they are modifying
> -- this operation could be considered to be against either Cluster or
> Topics.)
>

I'm also unsure, and would welcome input from others about this. Note that
there is some similarity here with some of the discussion on
https://issues.apache.org/jira/browse/KAFKA-5638: To what extent should
some permission on the Cluster imply some permission across all Topics in
the cluster? My feeling is that there should be no such implication,
because it will just make ACLs harder to reason about, and require more
documentation about what implies what else.

Getting back to this specific case, it would make some sense for Alter on
the Topic to be necessary, though it could be argued that the leader of a
partition is an operational concern, and not really part of the topic
configuration. Changing the leader of a bunch of partitions could affect
partitions of other topics. On the other hand, anyone with Alter on the
topic would presumably have the freedom to set the preferred leader for a
partition, so it would be a little strange if they couldn't actually give
effect to that choice.

On balance I think I prefer needing alter on the topic (and have adjusted
the KIP accordingly), but I'm very happy to go with the consensus view.



> * Should we consider making the request only valid against the broker and
> having the command do a metadata request to discover the controller and
> then send the request directly to it? I think this could a) simplify the
> implementation a bit by taking the ZK node out of the equation b) avoid the
> REPLICA_LEADER_ELECTION_IN_PROGRESS error since this is only required due
> to using the ZK node to communicate with the controller afaict and c)
> potentially open the door for a synchronous request/response in the future.
>
>
That's a really good point. I agree that the request should be made to the
controller, because it allows us to make improvements in future. I'm not
totally sure about taking ZK out of the equation completely though: Right
now during controller failover the new controller checks the znode and
ensures those elections get honoured. So it seems we need to keep ZK in the
picture, and without rewriting how we read and write 

Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-08 Thread Ewen Cheslack-Postava
Thanks for the KIP. Generally the move away from ZK and to native Kafka
requests is good, so I'm generally +1 on this. A couple of
comments/questions.

* You gave the signature
electPreferredReplicaLeader(Collection partitions) for the
admin client. The old command allows not specifying the topic partitions
which results in election for all topic partitions. How would this be
expressed in the new API? Does the tool need to do a metadata request and
then issue the request for all topic partitions or would there be a
shortcut via the protocol (e.g. another electPreferredReplicaLeader() with
no args which translates to, e.g., an empty list or null in the request)?
* All the existing *Options classes for the AdminClient have at least a
timeout. Should we add that to this APIs options?
* Other AdminClient *Result classes provide some convenience methods when
you just want all the results. Should we have that as well?
* I don't think any other requests include error strings in the response,
do they? I'm pretty sure we just generate the error string on the client
side based on the error code. If an error code was ambiguous, we should
probably just split it across 2 or more codes.
* re: permissions, would there also be a per-topic permission required? (I
haven't kept track of all the ACLs and required permissions so I'm not
sure, but a bunch of other operations check per-resource they are modifying
-- this operation could be considered to be against either Cluster or
Topics.)
* Should we consider making the request only valid against the broker and
having the command do a metadata request to discover the controller and
then send the request directly to it? I think this could a) simplify the
implementation a bit by taking the ZK node out of the equation b) avoid the
REPLICA_LEADER_ELECTION_IN_PROGRESS error since this is only required due
to using the ZK node to communicate with the controller afaict and c)
potentially open the door for a synchronous request/response in the future.

-Ewen

On Mon, Aug 7, 2017 at 8:21 AM, Tom Bentley  wrote:

> Hi,
>
> I've updated this KIP slightly, to clarify a couple of points.
>
> One thing in particular that I would like feedback on is what authorization
> should be required for triggering the election of the preferred leader?
>
> Another thing would be whether the request and responses should be grouped
> by topic name. This would make for smaller messages when triggering
> elections for multiple partitions of the same topic.
>
> I'd be grateful for any feedback you may have.
>
> Cheers,
>
> Tom
>
> On 2 August 2017 at 18:34, Tom Bentley  wrote:
>
> > In a similar vein to KIP-179 I've created KIP-183 (
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+
> > PreferredReplicaLeaderElectionCommand+to+use+AdminClient) which is about
> > deprecating the --zookeeper option to kafka-preferred-replica-
> election.sh
> > and replacing it with an option which would use a new AdminClient-based
> API.
> >
> > As it stands the KIP is focussed on simply moving the existing
> > functionality behind the AdminClient.
> >
> > I'd be grateful for any feedback people may have on this.
> >
> > Thanks,
> >
> > Tom
> >
>


Re: [DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-07 Thread Tom Bentley
Hi,

I've updated this KIP slightly, to clarify a couple of points.

One thing in particular that I would like feedback on is what authorization
should be required for triggering the election of the preferred leader?

Another thing would be whether the request and responses should be grouped
by topic name. This would make for smaller messages when triggering
elections for multiple partitions of the same topic.

I'd be grateful for any feedback you may have.

Cheers,

Tom

On 2 August 2017 at 18:34, Tom Bentley  wrote:

> In a similar vein to KIP-179 I've created KIP-183 (
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+
> PreferredReplicaLeaderElectionCommand+to+use+AdminClient) which is about
> deprecating the --zookeeper option to kafka-preferred-replica-election.sh
> and replacing it with an option which would use a new AdminClient-based API.
>
> As it stands the KIP is focussed on simply moving the existing
> functionality behind the AdminClient.
>
> I'd be grateful for any feedback people may have on this.
>
> Thanks,
>
> Tom
>


[DISCUSS] KIP-183 - Change PreferredReplicaLeaderElectionCommand to use AdminClient

2017-08-02 Thread Tom Bentley
In a similar vein to KIP-179 I've created KIP-183 (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-183+-+Change+PreferredReplicaLeaderElectionCommand+to+use+AdminClient)
which is about deprecating the --zookeeper option to
kafka-preferred-replica-election.sh and replacing it with an option which
would use a new AdminClient-based API.

As it stands the KIP is focussed on simply moving the existing
functionality behind the AdminClient.

I'd be grateful for any feedback people may have on this.

Thanks,

Tom