Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-08-06 Thread David Jacot
Hi all,

I would like to inform you that we have slightly changed our thoughts about
the implementation
of the Token Bucket algorithm. Our initial idea was to change our existing
Rate to behave like
a Token Bucket. That works as we expected but we have realized that the
value of the Rate is
not really a sampled Rate anymore. From an observability perspective, it is
not good to change
the behavior of our Rate. Instead, we propose to implement a new
MeasurableStat that implements
the Token Bucket algorithm and to use it alongside the Rate in the Sensor.
Which this, the Rate
remains there for observability purposes and the Token Bucket is used for
enforcing the quota.
The Token Bucket metrics is exposed via a new metric named "credits" that
represents the
remaining amount of credits in the bucket. As a first step, this will be
only used for the controller
quota.

I have updated the KIP to reflect this change.

We hope this change is fine for everyone. Please, raise your concerns if
not.

Best,
David

On Tue, Jul 28, 2020 at 1:48 PM David Jacot  wrote:

> Hi all,
>
> Just a quick update. We have made good progress regarding the
> implementation
> of this KIP. The major parts are already in trunk modulo the new "rate"
> implementation
> which is still under development.
>
> I would like to change the type of the `controller_mutations_rate` from a
> Long to
> a Double. I have made various experiments and the rate can be quite low,
> especially
> in clusters with a lot of tenants. Using a Long is quite limiting here to
> fine tune small
> rates. I hope that this change is fine for everyone.
>
> Best,
> David
>
> On Mon, Jun 15, 2020 at 9:26 AM David Jacot  wrote:
>
>> Hi all,
>>
>> The vote has passed with 5 binding votes (Gwen, Rajini, Mickael, Jun and
>> Colin)
>> and 2 non-binding votes (Tom, Anna).
>>
>> Thank you all for the fruitful discussion! I'd like to particularly thank
>> Anna who has
>> heavily contributed to the design of this KIP.
>>
>> Regards,
>> David
>>
>> On Fri, Jun 12, 2020 at 10:08 PM Colin McCabe  wrote:
>>
>>> +1.  Thanks, David!
>>>
>>> best,
>>> Colin
>>>
>>> On Thu, Jun 11, 2020, at 23:51, David Jacot wrote:
>>> > Colin, Jun,
>>> >
>>> > Do the proposed error code and the updated KIP look good to you guys?
>>> I’d
>>> > like to wrap up and close the vote.
>>> >
>>> > Thanks,
>>> > David
>>> >
>>> > Le mer. 10 juin 2020 à 14:50, David Jacot  a
>>> écrit :
>>> >
>>> > > Hi Colin and Jun,
>>> > >
>>> > > I have no problem if we have to rewrite part of it when the new
>>> controller
>>> > > comes
>>> > > out. I will be more than happy to help out.
>>> > >
>>> > > Regarding KIP-590, I think that we can cope with a principal as a
>>> string
>>> > > when the
>>> > > time comes. The user entity name is defined with a string already.
>>> > >
>>> > > Regarding the name of the error, you have made a good point. I do
>>> agree
>>> > > that it
>>> > > is important to differentiate the two cases. I propose the following
>>> two
>>> > > errors:
>>> > > - THROTTLING_QUOTA_EXCEEDED - Throttling is slightly better than
>>> rate as
>>> > > we have quotas which are not rate (e.g. request quota). This one is
>>> > > retryable
>>> > > once the throttle time is passed.
>>> > > - LIMIT_QUOTA_EXCEEDED - This one would indicate that the limit has
>>> been
>>> > > reached and is a final error.
>>> > > We only need the former in this KIP. What do you think?
>>> > >
>>> > > Jun, I have added a few examples in the KIP. The new name works
>>> exactly
>>> > > like
>>> > > the existing one once it is added to the accepted dynamic configs
>>> for the
>>> > > user
>>> > > and the client entities. I have added a "Kafka Config Command"
>>> chapter in
>>> > > the
>>> > > KIP. I will also open a Jira to not forget updating the AK
>>> documentation
>>> > > once
>>> > > the KIP gets merged.
>>> > >
>>> > > Thanks,
>>> > > David
>>> > >
>>> > > On Wed, Jun 10, 2020 at 3:03 AM Jun Rao  wrote:
>>> > >
>>> > >> Hi, Colin,
>>> > >>
>>> > >> Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this
>>> clear.
>>> > >>
>>> > >> Hi, David,
>>> > >>
>>> > >> We added a new quota name in the KIP. You chose not to bump up the
>>> version
>>> > >> of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok
>>> since
>>> > >> the
>>> > >> quota name is represented as a string. However, the new quota name
>>> can be
>>> > >> used in client tools for setting and listing the quota (
>>> > >> https://kafka.apache.org/documentation/#quotas). Could you
>>> document how
>>> > >> the
>>> > >> new name will be used in those tools?
>>> > >>
>>> > >> Thanks,
>>> > >>
>>> > >> Jun
>>> > >>
>>> > >> On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe 
>>> wrote:
>>> > >>
>>> > >> > On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
>>> > >> > > Hi Colin,
>>> > >> > >
>>> > >> > > Thank you for your feedback.
>>> > >> > >
>>> > >> > > Jun has summarized the situation pretty well. Thanks Jun! I
>>> would
>>> > >> like to
>>> > >> > > complement 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-07-28 Thread David Jacot
Hi all,

Just a quick update. We have made good progress regarding the implementation
of this KIP. The major parts are already in trunk modulo the new "rate"
implementation
which is still under development.

I would like to change the type of the `controller_mutations_rate` from a
Long to
a Double. I have made various experiments and the rate can be quite low,
especially
in clusters with a lot of tenants. Using a Long is quite limiting here to
fine tune small
rates. I hope that this change is fine for everyone.

Best,
David

On Mon, Jun 15, 2020 at 9:26 AM David Jacot  wrote:

> Hi all,
>
> The vote has passed with 5 binding votes (Gwen, Rajini, Mickael, Jun and
> Colin)
> and 2 non-binding votes (Tom, Anna).
>
> Thank you all for the fruitful discussion! I'd like to particularly thank
> Anna who has
> heavily contributed to the design of this KIP.
>
> Regards,
> David
>
> On Fri, Jun 12, 2020 at 10:08 PM Colin McCabe  wrote:
>
>> +1.  Thanks, David!
>>
>> best,
>> Colin
>>
>> On Thu, Jun 11, 2020, at 23:51, David Jacot wrote:
>> > Colin, Jun,
>> >
>> > Do the proposed error code and the updated KIP look good to you guys?
>> I’d
>> > like to wrap up and close the vote.
>> >
>> > Thanks,
>> > David
>> >
>> > Le mer. 10 juin 2020 à 14:50, David Jacot  a
>> écrit :
>> >
>> > > Hi Colin and Jun,
>> > >
>> > > I have no problem if we have to rewrite part of it when the new
>> controller
>> > > comes
>> > > out. I will be more than happy to help out.
>> > >
>> > > Regarding KIP-590, I think that we can cope with a principal as a
>> string
>> > > when the
>> > > time comes. The user entity name is defined with a string already.
>> > >
>> > > Regarding the name of the error, you have made a good point. I do
>> agree
>> > > that it
>> > > is important to differentiate the two cases. I propose the following
>> two
>> > > errors:
>> > > - THROTTLING_QUOTA_EXCEEDED - Throttling is slightly better than rate
>> as
>> > > we have quotas which are not rate (e.g. request quota). This one is
>> > > retryable
>> > > once the throttle time is passed.
>> > > - LIMIT_QUOTA_EXCEEDED - This one would indicate that the limit has
>> been
>> > > reached and is a final error.
>> > > We only need the former in this KIP. What do you think?
>> > >
>> > > Jun, I have added a few examples in the KIP. The new name works
>> exactly
>> > > like
>> > > the existing one once it is added to the accepted dynamic configs for
>> the
>> > > user
>> > > and the client entities. I have added a "Kafka Config Command"
>> chapter in
>> > > the
>> > > KIP. I will also open a Jira to not forget updating the AK
>> documentation
>> > > once
>> > > the KIP gets merged.
>> > >
>> > > Thanks,
>> > > David
>> > >
>> > > On Wed, Jun 10, 2020 at 3:03 AM Jun Rao  wrote:
>> > >
>> > >> Hi, Colin,
>> > >>
>> > >> Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this
>> clear.
>> > >>
>> > >> Hi, David,
>> > >>
>> > >> We added a new quota name in the KIP. You chose not to bump up the
>> version
>> > >> of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok
>> since
>> > >> the
>> > >> quota name is represented as a string. However, the new quota name
>> can be
>> > >> used in client tools for setting and listing the quota (
>> > >> https://kafka.apache.org/documentation/#quotas). Could you document
>> how
>> > >> the
>> > >> new name will be used in those tools?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Jun
>> > >>
>> > >> On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe 
>> wrote:
>> > >>
>> > >> > On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
>> > >> > > Hi Colin,
>> > >> > >
>> > >> > > Thank you for your feedback.
>> > >> > >
>> > >> > > Jun has summarized the situation pretty well. Thanks Jun! I would
>> > >> like to
>> > >> > > complement it with the following points:
>> > >> > >
>> > >> > > 1. Indeed, when the quota is exceeded, the broker will reject the
>> > >> topic
>> > >> > > creations, partition creations and topics deletions that are
>> exceeding
>> > >> > > with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
>> > >> > > be populated accordingly to let the client know how long it must
>> wait.
>> > >> > >
>> > >> > > 2. I do agree that we actually want a mechanism to apply back
>> pressure
>> > >> > > to the clients. The KIP basically proposes a mechanism to
>> control and
>> > >> to
>> > >> > > limit the rate of operations before entering the controller. I
>> think
>> > >> that
>> > >> > > it is similar to your thinking but is enforced based on a defined
>> > >> > > instead of relying on the number of pending items in the
>> controller.
>> > >> > >
>> > >> > > 3. You mentioned an alternative idea in your comments that, if I
>> > >> > understood
>> > >> > > correctly, would bound the queue to limit the overload and reject
>> > >> work if
>> > >> > > the queue is full. I have been thinking about this as well but I
>> don't
>> > >> > think
>> > >> > > that it  works well in our case.
>> > >> > > - The first reason is 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-15 Thread David Jacot
Hi all,

The vote has passed with 5 binding votes (Gwen, Rajini, Mickael, Jun and
Colin)
and 2 non-binding votes (Tom, Anna).

Thank you all for the fruitful discussion! I'd like to particularly thank
Anna who has
heavily contributed to the design of this KIP.

Regards,
David

On Fri, Jun 12, 2020 at 10:08 PM Colin McCabe  wrote:

> +1.  Thanks, David!
>
> best,
> Colin
>
> On Thu, Jun 11, 2020, at 23:51, David Jacot wrote:
> > Colin, Jun,
> >
> > Do the proposed error code and the updated KIP look good to you guys? I’d
> > like to wrap up and close the vote.
> >
> > Thanks,
> > David
> >
> > Le mer. 10 juin 2020 à 14:50, David Jacot  a écrit
> :
> >
> > > Hi Colin and Jun,
> > >
> > > I have no problem if we have to rewrite part of it when the new
> controller
> > > comes
> > > out. I will be more than happy to help out.
> > >
> > > Regarding KIP-590, I think that we can cope with a principal as a
> string
> > > when the
> > > time comes. The user entity name is defined with a string already.
> > >
> > > Regarding the name of the error, you have made a good point. I do agree
> > > that it
> > > is important to differentiate the two cases. I propose the following
> two
> > > errors:
> > > - THROTTLING_QUOTA_EXCEEDED - Throttling is slightly better than rate
> as
> > > we have quotas which are not rate (e.g. request quota). This one is
> > > retryable
> > > once the throttle time is passed.
> > > - LIMIT_QUOTA_EXCEEDED - This one would indicate that the limit has
> been
> > > reached and is a final error.
> > > We only need the former in this KIP. What do you think?
> > >
> > > Jun, I have added a few examples in the KIP. The new name works exactly
> > > like
> > > the existing one once it is added to the accepted dynamic configs for
> the
> > > user
> > > and the client entities. I have added a "Kafka Config Command" chapter
> in
> > > the
> > > KIP. I will also open a Jira to not forget updating the AK
> documentation
> > > once
> > > the KIP gets merged.
> > >
> > > Thanks,
> > > David
> > >
> > > On Wed, Jun 10, 2020 at 3:03 AM Jun Rao  wrote:
> > >
> > >> Hi, Colin,
> > >>
> > >> Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this
> clear.
> > >>
> > >> Hi, David,
> > >>
> > >> We added a new quota name in the KIP. You chose not to bump up the
> version
> > >> of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok
> since
> > >> the
> > >> quota name is represented as a string. However, the new quota name
> can be
> > >> used in client tools for setting and listing the quota (
> > >> https://kafka.apache.org/documentation/#quotas). Could you document
> how
> > >> the
> > >> new name will be used in those tools?
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >> On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe 
> wrote:
> > >>
> > >> > On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
> > >> > > Hi Colin,
> > >> > >
> > >> > > Thank you for your feedback.
> > >> > >
> > >> > > Jun has summarized the situation pretty well. Thanks Jun! I would
> > >> like to
> > >> > > complement it with the following points:
> > >> > >
> > >> > > 1. Indeed, when the quota is exceeded, the broker will reject the
> > >> topic
> > >> > > creations, partition creations and topics deletions that are
> exceeding
> > >> > > with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> > >> > > be populated accordingly to let the client know how long it must
> wait.
> > >> > >
> > >> > > 2. I do agree that we actually want a mechanism to apply back
> pressure
> > >> > > to the clients. The KIP basically proposes a mechanism to control
> and
> > >> to
> > >> > > limit the rate of operations before entering the controller. I
> think
> > >> that
> > >> > > it is similar to your thinking but is enforced based on a defined
> > >> > > instead of relying on the number of pending items in the
> controller.
> > >> > >
> > >> > > 3. You mentioned an alternative idea in your comments that, if I
> > >> > understood
> > >> > > correctly, would bound the queue to limit the overload and reject
> > >> work if
> > >> > > the queue is full. I have been thinking about this as well but I
> don't
> > >> > think
> > >> > > that it  works well in our case.
> > >> > > - The first reason is the one mentioned by Jun. We actually want
> to be
> > >> > able
> > >> > > to limit specific clients (the misbehaving ones) in a multi-tenant
> > >> > > environment.
> > >> > > - The second reason is that, at least in our current
> implementation,
> > >> the
> > >> > > length of the queue is not really a good characteristic to
> estimate
> > >> the
> > >> > load.
> > >> > > Coming back to your example of the CreateTopicsRequest. They
> create
> > >> path
> > >> > >  in ZK for each newly created topics which trigger a ChangeTopic
> event
> > >> > in
> > >> > > the controller. That single event could be for a single topic in
> some
> > >> > cases or
> > >> > > for a thousand topics in others.
> > >> > > These two reasons aside, bounding 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-12 Thread Colin McCabe
+1.  Thanks, David!

best,
Colin

On Thu, Jun 11, 2020, at 23:51, David Jacot wrote:
> Colin, Jun,
> 
> Do the proposed error code and the updated KIP look good to you guys? I’d
> like to wrap up and close the vote.
> 
> Thanks,
> David
> 
> Le mer. 10 juin 2020 à 14:50, David Jacot  a écrit :
> 
> > Hi Colin and Jun,
> >
> > I have no problem if we have to rewrite part of it when the new controller
> > comes
> > out. I will be more than happy to help out.
> >
> > Regarding KIP-590, I think that we can cope with a principal as a string
> > when the
> > time comes. The user entity name is defined with a string already.
> >
> > Regarding the name of the error, you have made a good point. I do agree
> > that it
> > is important to differentiate the two cases. I propose the following two
> > errors:
> > - THROTTLING_QUOTA_EXCEEDED - Throttling is slightly better than rate as
> > we have quotas which are not rate (e.g. request quota). This one is
> > retryable
> > once the throttle time is passed.
> > - LIMIT_QUOTA_EXCEEDED - This one would indicate that the limit has been
> > reached and is a final error.
> > We only need the former in this KIP. What do you think?
> >
> > Jun, I have added a few examples in the KIP. The new name works exactly
> > like
> > the existing one once it is added to the accepted dynamic configs for the
> > user
> > and the client entities. I have added a "Kafka Config Command" chapter in
> > the
> > KIP. I will also open a Jira to not forget updating the AK documentation
> > once
> > the KIP gets merged.
> >
> > Thanks,
> > David
> >
> > On Wed, Jun 10, 2020 at 3:03 AM Jun Rao  wrote:
> >
> >> Hi, Colin,
> >>
> >> Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this clear.
> >>
> >> Hi, David,
> >>
> >> We added a new quota name in the KIP. You chose not to bump up the version
> >> of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok since
> >> the
> >> quota name is represented as a string. However, the new quota name can be
> >> used in client tools for setting and listing the quota (
> >> https://kafka.apache.org/documentation/#quotas). Could you document how
> >> the
> >> new name will be used in those tools?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe  wrote:
> >>
> >> > On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
> >> > > Hi Colin,
> >> > >
> >> > > Thank you for your feedback.
> >> > >
> >> > > Jun has summarized the situation pretty well. Thanks Jun! I would
> >> like to
> >> > > complement it with the following points:
> >> > >
> >> > > 1. Indeed, when the quota is exceeded, the broker will reject the
> >> topic
> >> > > creations, partition creations and topics deletions that are exceeding
> >> > > with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> >> > > be populated accordingly to let the client know how long it must wait.
> >> > >
> >> > > 2. I do agree that we actually want a mechanism to apply back pressure
> >> > > to the clients. The KIP basically proposes a mechanism to control and
> >> to
> >> > > limit the rate of operations before entering the controller. I think
> >> that
> >> > > it is similar to your thinking but is enforced based on a defined
> >> > > instead of relying on the number of pending items in the controller.
> >> > >
> >> > > 3. You mentioned an alternative idea in your comments that, if I
> >> > understood
> >> > > correctly, would bound the queue to limit the overload and reject
> >> work if
> >> > > the queue is full. I have been thinking about this as well but I don't
> >> > think
> >> > > that it  works well in our case.
> >> > > - The first reason is the one mentioned by Jun. We actually want to be
> >> > able
> >> > > to limit specific clients (the misbehaving ones) in a multi-tenant
> >> > > environment.
> >> > > - The second reason is that, at least in our current implementation,
> >> the
> >> > > length of the queue is not really a good characteristic to estimate
> >> the
> >> > load.
> >> > > Coming back to your example of the CreateTopicsRequest. They create
> >> path
> >> > >  in ZK for each newly created topics which trigger a ChangeTopic event
> >> > in
> >> > > the controller. That single event could be for a single topic in some
> >> > cases or
> >> > > for a thousand topics in others.
> >> > > These two reasons aside, bounding the queue also introduces a knob
> >> which
> >> > > requires some tuning and thus suffers from all the points you
> >> mentioned
> >> > as
> >> > > well, isn't it? The value may be true for one version but not for
> >> > another.
> >> > >
> >> > > 4. Regarding the integration with KIP-500. The implementation of this
> >> KIP
> >> > > will span across the ApiLayer and the AdminManager. To be honest, we
> >> > > can influence the implementation to work best with what you have in
> >> mind
> >> > > for the future controller. If you could shed some more light on the
> >> > future
> >> > > internal architecture, I can 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-12 Thread Jun Rao
Hi, David,

Thanks for making those changes. They look fine to me. +1

Jun

On Thu, Jun 11, 2020 at 11:51 PM David Jacot  wrote:

> Colin, Jun,
>
> Do the proposed error code and the updated KIP look good to you guys? I’d
> like to wrap up and close the vote.
>
> Thanks,
> David
>
> Le mer. 10 juin 2020 à 14:50, David Jacot  a écrit :
>
> > Hi Colin and Jun,
> >
> > I have no problem if we have to rewrite part of it when the new
> controller
> > comes
> > out. I will be more than happy to help out.
> >
> > Regarding KIP-590, I think that we can cope with a principal as a string
> > when the
> > time comes. The user entity name is defined with a string already.
> >
> > Regarding the name of the error, you have made a good point. I do agree
> > that it
> > is important to differentiate the two cases. I propose the following two
> > errors:
> > - THROTTLING_QUOTA_EXCEEDED - Throttling is slightly better than rate as
> > we have quotas which are not rate (e.g. request quota). This one is
> > retryable
> > once the throttle time is passed.
> > - LIMIT_QUOTA_EXCEEDED - This one would indicate that the limit has been
> > reached and is a final error.
> > We only need the former in this KIP. What do you think?
> >
> > Jun, I have added a few examples in the KIP. The new name works exactly
> > like
> > the existing one once it is added to the accepted dynamic configs for the
> > user
> > and the client entities. I have added a "Kafka Config Command" chapter in
> > the
> > KIP. I will also open a Jira to not forget updating the AK documentation
> > once
> > the KIP gets merged.
> >
> > Thanks,
> > David
> >
> > On Wed, Jun 10, 2020 at 3:03 AM Jun Rao  wrote:
> >
> >> Hi, Colin,
> >>
> >> Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this
> clear.
> >>
> >> Hi, David,
> >>
> >> We added a new quota name in the KIP. You chose not to bump up the
> version
> >> of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok since
> >> the
> >> quota name is represented as a string. However, the new quota name can
> be
> >> used in client tools for setting and listing the quota (
> >> https://kafka.apache.org/documentation/#quotas). Could you document how
> >> the
> >> new name will be used in those tools?
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe  wrote:
> >>
> >> > On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
> >> > > Hi Colin,
> >> > >
> >> > > Thank you for your feedback.
> >> > >
> >> > > Jun has summarized the situation pretty well. Thanks Jun! I would
> >> like to
> >> > > complement it with the following points:
> >> > >
> >> > > 1. Indeed, when the quota is exceeded, the broker will reject the
> >> topic
> >> > > creations, partition creations and topics deletions that are
> exceeding
> >> > > with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> >> > > be populated accordingly to let the client know how long it must
> wait.
> >> > >
> >> > > 2. I do agree that we actually want a mechanism to apply back
> pressure
> >> > > to the clients. The KIP basically proposes a mechanism to control
> and
> >> to
> >> > > limit the rate of operations before entering the controller. I think
> >> that
> >> > > it is similar to your thinking but is enforced based on a defined
> >> > > instead of relying on the number of pending items in the controller.
> >> > >
> >> > > 3. You mentioned an alternative idea in your comments that, if I
> >> > understood
> >> > > correctly, would bound the queue to limit the overload and reject
> >> work if
> >> > > the queue is full. I have been thinking about this as well but I
> don't
> >> > think
> >> > > that it  works well in our case.
> >> > > - The first reason is the one mentioned by Jun. We actually want to
> be
> >> > able
> >> > > to limit specific clients (the misbehaving ones) in a multi-tenant
> >> > > environment.
> >> > > - The second reason is that, at least in our current implementation,
> >> the
> >> > > length of the queue is not really a good characteristic to estimate
> >> the
> >> > load.
> >> > > Coming back to your example of the CreateTopicsRequest. They create
> >> path
> >> > >  in ZK for each newly created topics which trigger a ChangeTopic
> event
> >> > in
> >> > > the controller. That single event could be for a single topic in
> some
> >> > cases or
> >> > > for a thousand topics in others.
> >> > > These two reasons aside, bounding the queue also introduces a knob
> >> which
> >> > > requires some tuning and thus suffers from all the points you
> >> mentioned
> >> > as
> >> > > well, isn't it? The value may be true for one version but not for
> >> > another.
> >> > >
> >> > > 4. Regarding the integration with KIP-500. The implementation of
> this
> >> KIP
> >> > > will span across the ApiLayer and the AdminManager. To be honest, we
> >> > > can influence the implementation to work best with what you have in
> >> mind
> >> > > for the future controller. If you could shed some more 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-12 Thread David Jacot
Colin, Jun,

Do the proposed error code and the updated KIP look good to you guys? I’d
like to wrap up and close the vote.

Thanks,
David

Le mer. 10 juin 2020 à 14:50, David Jacot  a écrit :

> Hi Colin and Jun,
>
> I have no problem if we have to rewrite part of it when the new controller
> comes
> out. I will be more than happy to help out.
>
> Regarding KIP-590, I think that we can cope with a principal as a string
> when the
> time comes. The user entity name is defined with a string already.
>
> Regarding the name of the error, you have made a good point. I do agree
> that it
> is important to differentiate the two cases. I propose the following two
> errors:
> - THROTTLING_QUOTA_EXCEEDED - Throttling is slightly better than rate as
> we have quotas which are not rate (e.g. request quota). This one is
> retryable
> once the throttle time is passed.
> - LIMIT_QUOTA_EXCEEDED - This one would indicate that the limit has been
> reached and is a final error.
> We only need the former in this KIP. What do you think?
>
> Jun, I have added a few examples in the KIP. The new name works exactly
> like
> the existing one once it is added to the accepted dynamic configs for the
> user
> and the client entities. I have added a "Kafka Config Command" chapter in
> the
> KIP. I will also open a Jira to not forget updating the AK documentation
> once
> the KIP gets merged.
>
> Thanks,
> David
>
> On Wed, Jun 10, 2020 at 3:03 AM Jun Rao  wrote:
>
>> Hi, Colin,
>>
>> Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this clear.
>>
>> Hi, David,
>>
>> We added a new quota name in the KIP. You chose not to bump up the version
>> of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok since
>> the
>> quota name is represented as a string. However, the new quota name can be
>> used in client tools for setting and listing the quota (
>> https://kafka.apache.org/documentation/#quotas). Could you document how
>> the
>> new name will be used in those tools?
>>
>> Thanks,
>>
>> Jun
>>
>> On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe  wrote:
>>
>> > On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
>> > > Hi Colin,
>> > >
>> > > Thank you for your feedback.
>> > >
>> > > Jun has summarized the situation pretty well. Thanks Jun! I would
>> like to
>> > > complement it with the following points:
>> > >
>> > > 1. Indeed, when the quota is exceeded, the broker will reject the
>> topic
>> > > creations, partition creations and topics deletions that are exceeding
>> > > with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
>> > > be populated accordingly to let the client know how long it must wait.
>> > >
>> > > 2. I do agree that we actually want a mechanism to apply back pressure
>> > > to the clients. The KIP basically proposes a mechanism to control and
>> to
>> > > limit the rate of operations before entering the controller. I think
>> that
>> > > it is similar to your thinking but is enforced based on a defined
>> > > instead of relying on the number of pending items in the controller.
>> > >
>> > > 3. You mentioned an alternative idea in your comments that, if I
>> > understood
>> > > correctly, would bound the queue to limit the overload and reject
>> work if
>> > > the queue is full. I have been thinking about this as well but I don't
>> > think
>> > > that it  works well in our case.
>> > > - The first reason is the one mentioned by Jun. We actually want to be
>> > able
>> > > to limit specific clients (the misbehaving ones) in a multi-tenant
>> > > environment.
>> > > - The second reason is that, at least in our current implementation,
>> the
>> > > length of the queue is not really a good characteristic to estimate
>> the
>> > load.
>> > > Coming back to your example of the CreateTopicsRequest. They create
>> path
>> > >  in ZK for each newly created topics which trigger a ChangeTopic event
>> > in
>> > > the controller. That single event could be for a single topic in some
>> > cases or
>> > > for a thousand topics in others.
>> > > These two reasons aside, bounding the queue also introduces a knob
>> which
>> > > requires some tuning and thus suffers from all the points you
>> mentioned
>> > as
>> > > well, isn't it? The value may be true for one version but not for
>> > another.
>> > >
>> > > 4. Regarding the integration with KIP-500. The implementation of this
>> KIP
>> > > will span across the ApiLayer and the AdminManager. To be honest, we
>> > > can influence the implementation to work best with what you have in
>> mind
>> > > for the future controller. If you could shed some more light on the
>> > future
>> > > internal architecture, I can take this into account during the
>> > > implementation.
>> > >
>> >
>> > Hi David,
>> >
>> > Good question.  The approach we are thinking of is that in the future,
>> > topic creation will be a controller RPC.  In other words, rather than
>> > changing ZK and waiting for the controller code to notice, we'll go
>> through
>> > the controller code 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-10 Thread David Jacot
Hi Colin and Jun,

I have no problem if we have to rewrite part of it when the new controller
comes
out. I will be more than happy to help out.

Regarding KIP-590, I think that we can cope with a principal as a string
when the
time comes. The user entity name is defined with a string already.

Regarding the name of the error, you have made a good point. I do agree
that it
is important to differentiate the two cases. I propose the following two
errors:
- THROTTLING_QUOTA_EXCEEDED - Throttling is slightly better than rate as
we have quotas which are not rate (e.g. request quota). This one is
retryable
once the throttle time is passed.
- LIMIT_QUOTA_EXCEEDED - This one would indicate that the limit has been
reached and is a final error.
We only need the former in this KIP. What do you think?

Jun, I have added a few examples in the KIP. The new name works exactly like
the existing one once it is added to the accepted dynamic configs for the
user
and the client entities. I have added a "Kafka Config Command" chapter in
the
KIP. I will also open a Jira to not forget updating the AK documentation
once
the KIP gets merged.

Thanks,
David

On Wed, Jun 10, 2020 at 3:03 AM Jun Rao  wrote:

> Hi, Colin,
>
> Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this clear.
>
> Hi, David,
>
> We added a new quota name in the KIP. You chose not to bump up the version
> of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok since the
> quota name is represented as a string. However, the new quota name can be
> used in client tools for setting and listing the quota (
> https://kafka.apache.org/documentation/#quotas). Could you document how
> the
> new name will be used in those tools?
>
> Thanks,
>
> Jun
>
> On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe  wrote:
>
> > On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
> > > Hi Colin,
> > >
> > > Thank you for your feedback.
> > >
> > > Jun has summarized the situation pretty well. Thanks Jun! I would like
> to
> > > complement it with the following points:
> > >
> > > 1. Indeed, when the quota is exceeded, the broker will reject the topic
> > > creations, partition creations and topics deletions that are exceeding
> > > with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> > > be populated accordingly to let the client know how long it must wait.
> > >
> > > 2. I do agree that we actually want a mechanism to apply back pressure
> > > to the clients. The KIP basically proposes a mechanism to control and
> to
> > > limit the rate of operations before entering the controller. I think
> that
> > > it is similar to your thinking but is enforced based on a defined
> > > instead of relying on the number of pending items in the controller.
> > >
> > > 3. You mentioned an alternative idea in your comments that, if I
> > understood
> > > correctly, would bound the queue to limit the overload and reject work
> if
> > > the queue is full. I have been thinking about this as well but I don't
> > think
> > > that it  works well in our case.
> > > - The first reason is the one mentioned by Jun. We actually want to be
> > able
> > > to limit specific clients (the misbehaving ones) in a multi-tenant
> > > environment.
> > > - The second reason is that, at least in our current implementation,
> the
> > > length of the queue is not really a good characteristic to estimate the
> > load.
> > > Coming back to your example of the CreateTopicsRequest. They create
> path
> > >  in ZK for each newly created topics which trigger a ChangeTopic event
> > in
> > > the controller. That single event could be for a single topic in some
> > cases or
> > > for a thousand topics in others.
> > > These two reasons aside, bounding the queue also introduces a knob
> which
> > > requires some tuning and thus suffers from all the points you mentioned
> > as
> > > well, isn't it? The value may be true for one version but not for
> > another.
> > >
> > > 4. Regarding the integration with KIP-500. The implementation of this
> KIP
> > > will span across the ApiLayer and the AdminManager. To be honest, we
> > > can influence the implementation to work best with what you have in
> mind
> > > for the future controller. If you could shed some more light on the
> > future
> > > internal architecture, I can take this into account during the
> > > implementation.
> > >
> >
> > Hi David,
> >
> > Good question.  The approach we are thinking of is that in the future,
> > topic creation will be a controller RPC.  In other words, rather than
> > changing ZK and waiting for the controller code to notice, we'll go
> through
> > the controller code (which may change ZK, or may do something else in the
> > ZK-free environment).
> >
> > I don't think there is a good way to write this in the short term that
> > avoids having to rewrite in the long term.  That's probably OK though, as
> > long as we keep in mind that we'll need to.
> >
> > >
> > > 5. Regarding KIP-590. For the create topics request, create 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-09 Thread Jun Rao
Hi, Colin,

Good point. Maybe sth like THROTTLING_QUOTA_VIOLATED will make this clear.

Hi, David,

We added a new quota name in the KIP. You chose not to bump up the version
of DESCRIBE_CLIENT_QUOTAS and ALTER_CLIENT_QUOTAS, which seems ok since the
quota name is represented as a string. However, the new quota name can be
used in client tools for setting and listing the quota (
https://kafka.apache.org/documentation/#quotas). Could you document how the
new name will be used in those tools?

Thanks,

Jun

On Tue, Jun 9, 2020 at 3:37 PM Colin McCabe  wrote:

> On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
> > Hi Colin,
> >
> > Thank you for your feedback.
> >
> > Jun has summarized the situation pretty well. Thanks Jun! I would like to
> > complement it with the following points:
> >
> > 1. Indeed, when the quota is exceeded, the broker will reject the topic
> > creations, partition creations and topics deletions that are exceeding
> > with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> > be populated accordingly to let the client know how long it must wait.
> >
> > 2. I do agree that we actually want a mechanism to apply back pressure
> > to the clients. The KIP basically proposes a mechanism to control and to
> > limit the rate of operations before entering the controller. I think that
> > it is similar to your thinking but is enforced based on a defined
> > instead of relying on the number of pending items in the controller.
> >
> > 3. You mentioned an alternative idea in your comments that, if I
> understood
> > correctly, would bound the queue to limit the overload and reject work if
> > the queue is full. I have been thinking about this as well but I don't
> think
> > that it  works well in our case.
> > - The first reason is the one mentioned by Jun. We actually want to be
> able
> > to limit specific clients (the misbehaving ones) in a multi-tenant
> > environment.
> > - The second reason is that, at least in our current implementation, the
> > length of the queue is not really a good characteristic to estimate the
> load.
> > Coming back to your example of the CreateTopicsRequest. They create path
> >  in ZK for each newly created topics which trigger a ChangeTopic event
> in
> > the controller. That single event could be for a single topic in some
> cases or
> > for a thousand topics in others.
> > These two reasons aside, bounding the queue also introduces a knob which
> > requires some tuning and thus suffers from all the points you mentioned
> as
> > well, isn't it? The value may be true for one version but not for
> another.
> >
> > 4. Regarding the integration with KIP-500. The implementation of this KIP
> > will span across the ApiLayer and the AdminManager. To be honest, we
> > can influence the implementation to work best with what you have in mind
> > for the future controller. If you could shed some more light on the
> future
> > internal architecture, I can take this into account during the
> > implementation.
> >
>
> Hi David,
>
> Good question.  The approach we are thinking of is that in the future,
> topic creation will be a controller RPC.  In other words, rather than
> changing ZK and waiting for the controller code to notice, we'll go through
> the controller code (which may change ZK, or may do something else in the
> ZK-free environment).
>
> I don't think there is a good way to write this in the short term that
> avoids having to rewrite in the long term.  That's probably OK though, as
> long as we keep in mind that we'll need to.
>
> >
> > 5. Regarding KIP-590. For the create topics request, create partitions
> > request, and delete topics request, we are lucky enough to have directed
> > all of them to the controller already so we are fine with doing the
> admission
> > in the broker which hosts the controller. If we were to throttle more
> operations
> > in the future,
> > I believe that we can continue to do it centrally while leveraging the
> > principal that is forwarded to account for the right tenant. The reason
> > why I would like to keep it central is that we are talking about sparse
> (or bursty)
> > workloads here so distributing the quota among all the brokers makes
> little sense.
> >
>
> Right.  The main requirement here is that the quota must operate based on
> principal names rather than KafkaPrincipal objects.  We had a long
> discussion about KIP-590 and eventually concluded that we didn't want to
> make KafkaPrincipal serializable (at least not yet) so what would be
> forwarded is just a string, not the principal object.
>
> >
> > 6. Regarding the naming of the new error code. BUSY sounds too generic to
> > me so I would rather prefer a specific error code. The main reason being
> > that we may be able to reuse it in the future for other quotas. That
> being said,
> > we can find another name. QUOTA_EXHAUSTED? I don't feel too strongly
> about
> > the naming. I wonder what the others think about this.
> >
>
> Think about if you're 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-09 Thread Colin McCabe
On Tue, Jun 9, 2020, at 05:06, David Jacot wrote:
> Hi Colin,
> 
> Thank you for your feedback.
> 
> Jun has summarized the situation pretty well. Thanks Jun! I would like to
> complement it with the following points:
> 
> 1. Indeed, when the quota is exceeded, the broker will reject the topic
> creations, partition creations and topics deletions that are exceeding
> with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> be populated accordingly to let the client know how long it must wait.
> 
> 2. I do agree that we actually want a mechanism to apply back pressure
> to the clients. The KIP basically proposes a mechanism to control and to
> limit the rate of operations before entering the controller. I think that
> it is similar to your thinking but is enforced based on a defined
> instead of relying on the number of pending items in the controller.
> 
> 3. You mentioned an alternative idea in your comments that, if I understood
> correctly, would bound the queue to limit the overload and reject work if
> the queue is full. I have been thinking about this as well but I don't think
> that it  works well in our case.
> - The first reason is the one mentioned by Jun. We actually want to be able
> to limit specific clients (the misbehaving ones) in a multi-tenant
> environment.
> - The second reason is that, at least in our current implementation, the
> length of the queue is not really a good characteristic to estimate the load.
> Coming back to your example of the CreateTopicsRequest. They create path
>  in ZK for each newly created topics which trigger a ChangeTopic event in 
> the controller. That single event could be for a single topic in some cases or
> for a thousand topics in others.
> These two reasons aside, bounding the queue also introduces a knob which
> requires some tuning and thus suffers from all the points you mentioned as
> well, isn't it? The value may be true for one version but not for another.
> 
> 4. Regarding the integration with KIP-500. The implementation of this KIP
> will span across the ApiLayer and the AdminManager. To be honest, we
> can influence the implementation to work best with what you have in mind
> for the future controller. If you could shed some more light on the future
> internal architecture, I can take this into account during the
> implementation.
>

Hi David,

Good question.  The approach we are thinking of is that in the future, topic 
creation will be a controller RPC.  In other words, rather than changing ZK and 
waiting for the controller code to notice, we'll go through the controller code 
(which may change ZK, or may do something else in the ZK-free environment).

I don't think there is a good way to write this in the short term that avoids 
having to rewrite in the long term.  That's probably OK though, as long as we 
keep in mind that we'll need to.

> 
> 5. Regarding KIP-590. For the create topics request, create partitions
> request, and delete topics request, we are lucky enough to have directed
> all of them to the controller already so we are fine with doing the admission
> in the broker which hosts the controller. If we were to throttle more 
> operations
> in the future,
> I believe that we can continue to do it centrally while leveraging the
> principal that is forwarded to account for the right tenant. The reason 
> why I would like to keep it central is that we are talking about sparse (or 
> bursty)
> workloads here so distributing the quota among all the brokers makes little 
> sense.
> 

Right.  The main requirement here is that the quota must operate based on 
principal names rather than KafkaPrincipal objects.  We had a long discussion 
about KIP-590 and eventually concluded that we didn't want to make 
KafkaPrincipal serializable (at least not yet) so what would be forwarded is 
just a string, not the principal object.

>
> 6. Regarding the naming of the new error code. BUSY sounds too generic to
> me so I would rather prefer a specific error code. The main reason being
> that we may be able to reuse it in the future for other quotas. That being 
> said,
> we can find another name. QUOTA_EXHAUSTED? I don't feel too strongly about
> the naming. I wonder what the others think about this.
> 

Think about if you're someone writing software that uses the Kafka client.  
Let's say you try to create some topics and get back QUOTA_VIOLATED.  What does 
it mean?  Maybe it means that you tried to create too many topics in too short 
a time (violating the controller throttle).  Or maybe it means that you 
exceeded your quota specifying the number of partitions that they are allowed 
to create (let's assume that someone did a follow-on KIP for that that reuses 
this error code for that.)

But now you have a dilemma.  If the controller was just busy, then trying again 
is the right thing to do.  If there is some other quota you violated (number of 
partitions, number of topics, etc.) then retrying is wasteful and will not 
accomplish 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-09 Thread Jun Rao
Hi, David,

Sounds good then.

Thanks,

Jun

On Tue, Jun 9, 2020 at 10:59 AM David Jacot  wrote:

> Hi Jun,
>
> Both are already in the KIP, see "New Broker Configurations" chapter. I
> think
> that we need them in order to be able to define different burst for the new
> quota.
>
> Best,
> David
>
> On Tue, Jun 9, 2020 at 7:48 PM Jun Rao  wrote:
>
> > Hi, David,
> >
> > Another thing. Should we add controller.quota.window.size.seconds and
> > controller.quota.window.num
> > or just reuse the existing quota.window.size.seconds and quota.window.num
> > that are used for other types of quotas?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Jun 9, 2020 at 10:30 AM Jun Rao  wrote:
> >
> > > Hi, David,
> > >
> > > Thanks for the KIP. The name QUOTA_VIOLATED sounds reasonable to me. +1
> > on
> > > the KIP.
> > >
> > > Jun
> > >
> > > On Tue, Jun 9, 2020 at 5:07 AM David Jacot 
> wrote:
> > >
> > >> Hi Colin,
> > >>
> > >> Thank you for your feedback.
> > >>
> > >> Jun has summarized the situation pretty well. Thanks Jun! I would like
> > to
> > >> complement it with the following points:
> > >>
> > >> 1. Indeed, when the quota is exceeded, the broker will reject the
> topic
> > >> creations, partition creations and topics deletions that are exceeding
> > >> with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> > >> be populated accordingly to let the client know how long it must wait.
> > >>
> > >> 2. I do agree that we actually want a mechanism to apply back pressure
> > >> to the clients. The KIP basically proposes a mechanism to control and
> to
> > >> limit the rate of operations before entering the controller. I think
> > that
> > >> it is
> > >> similar to your thinking but is enforced based on a defined quota
> > instead
> > >> of relying on the number of pending items in the controller.
> > >>
> > >> 3. You mentioned an alternative idea in your comments that, if I
> > >> understood
> > >> correctly, would bound the queue to limit the overload and reject work
> > if
> > >> the
> > >> queue is full. I have been thinking about this as well but I don't
> think
> > >> that it
> > >> works well in our case.
> > >> - The first reason is the one mentioned by Jun. We actually want to be
> > >> able
> > >> to limit specific clients (the misbehaving ones) in a multi-tenant
> > >> environment.
> > >> - The second reason is that, at least in our current implementation,
> the
> > >> length of
> > >> the queue is not really a good characteristic to estimate the load.
> > >> Coming back
> > >> to your example of the CreateTopicsRequest. They create path in ZK for
> > >> each
> > >> newly created topics which trigger a ChangeTopic event in the
> > controller.
> > >> That
> > >> single event could be for a single topic in some cases or for a
> thousand
> > >> topics
> > >> in others.
> > >> These two reasons aside, bounding the queue also introduces a knob
> which
> > >> requires some tuning and thus suffers from all the points you
> mentioned
> > as
> > >> well, isn't it? The value may be true for one version but not for
> > another.
> > >>
> > >> 4. Regarding the integration with KIP-500. The implementation of this
> > KIP
> > >> will span across the ApiLayer and the AdminManager. To be honest, we
> > >> can influence the implementation to work best with what you have in
> mind
> > >> for the future controller. If you could shed some more light on the
> > future
> > >> internal architecture, I can take this into account during the
> > >> implementation.
> > >>
> > >> 5. Regarding KIP-590. For the create topics request, create partitions
> > >> request,
> > >> and delete topics request, we are lucky enough to have directed all of
> > >> them
> > >> to
> > >> the controller already so we are fine with doing the admission in the
> > >> broker
> > >> which hosts the controller. If we were to throttle more operations in
> > the
> > >> future,
> > >> I believe that we can continue to do it centrally while leveraging the
> > >> principal
> > >> that is forwarded to account for the right tenant. The reason why I
> > would
> > >> like
> > >> to keep it central is that we are talking about sparse (or bursty)
> > >> workloads here
> > >> so distributing the quota among all the brokers makes little sense.
> > >>
> > >> 6. Regarding the naming of the new error code. BUSY sounds too generic
> > to
> > >> me so I would rather prefer a specific error code. The main reason
> being
> > >> that
> > >> we may be able to reuse it in the future for other quotas. That being
> > >> said,
> > >> we
> > >> can find another name. QUOTA_EXHAUSTED? I don't feel too strongly
> about
> > >> the naming. I wonder what the others think about this.
> > >>
> > >> Voilà. I hope that I have addressed all your questions/points and I am
> > >> sorry for
> > >> the lengthy email.
> > >>
> > >> Regards,
> > >> David
> > >>
> > >>
> > >> On Tue, Jun 9, 2020 at 2:13 AM Colin McCabe 
> wrote:
> > >>
> > >> > On Mon, Jun 8, 2020, at 14:41, Jun Rao wrote:

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-09 Thread David Jacot
Hi Jun,

Both are already in the KIP, see "New Broker Configurations" chapter. I
think
that we need them in order to be able to define different burst for the new
quota.

Best,
David

On Tue, Jun 9, 2020 at 7:48 PM Jun Rao  wrote:

> Hi, David,
>
> Another thing. Should we add controller.quota.window.size.seconds and
> controller.quota.window.num
> or just reuse the existing quota.window.size.seconds and quota.window.num
> that are used for other types of quotas?
>
> Thanks,
>
> Jun
>
> On Tue, Jun 9, 2020 at 10:30 AM Jun Rao  wrote:
>
> > Hi, David,
> >
> > Thanks for the KIP. The name QUOTA_VIOLATED sounds reasonable to me. +1
> on
> > the KIP.
> >
> > Jun
> >
> > On Tue, Jun 9, 2020 at 5:07 AM David Jacot  wrote:
> >
> >> Hi Colin,
> >>
> >> Thank you for your feedback.
> >>
> >> Jun has summarized the situation pretty well. Thanks Jun! I would like
> to
> >> complement it with the following points:
> >>
> >> 1. Indeed, when the quota is exceeded, the broker will reject the topic
> >> creations, partition creations and topics deletions that are exceeding
> >> with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> >> be populated accordingly to let the client know how long it must wait.
> >>
> >> 2. I do agree that we actually want a mechanism to apply back pressure
> >> to the clients. The KIP basically proposes a mechanism to control and to
> >> limit the rate of operations before entering the controller. I think
> that
> >> it is
> >> similar to your thinking but is enforced based on a defined quota
> instead
> >> of relying on the number of pending items in the controller.
> >>
> >> 3. You mentioned an alternative idea in your comments that, if I
> >> understood
> >> correctly, would bound the queue to limit the overload and reject work
> if
> >> the
> >> queue is full. I have been thinking about this as well but I don't think
> >> that it
> >> works well in our case.
> >> - The first reason is the one mentioned by Jun. We actually want to be
> >> able
> >> to limit specific clients (the misbehaving ones) in a multi-tenant
> >> environment.
> >> - The second reason is that, at least in our current implementation, the
> >> length of
> >> the queue is not really a good characteristic to estimate the load.
> >> Coming back
> >> to your example of the CreateTopicsRequest. They create path in ZK for
> >> each
> >> newly created topics which trigger a ChangeTopic event in the
> controller.
> >> That
> >> single event could be for a single topic in some cases or for a thousand
> >> topics
> >> in others.
> >> These two reasons aside, bounding the queue also introduces a knob which
> >> requires some tuning and thus suffers from all the points you mentioned
> as
> >> well, isn't it? The value may be true for one version but not for
> another.
> >>
> >> 4. Regarding the integration with KIP-500. The implementation of this
> KIP
> >> will span across the ApiLayer and the AdminManager. To be honest, we
> >> can influence the implementation to work best with what you have in mind
> >> for the future controller. If you could shed some more light on the
> future
> >> internal architecture, I can take this into account during the
> >> implementation.
> >>
> >> 5. Regarding KIP-590. For the create topics request, create partitions
> >> request,
> >> and delete topics request, we are lucky enough to have directed all of
> >> them
> >> to
> >> the controller already so we are fine with doing the admission in the
> >> broker
> >> which hosts the controller. If we were to throttle more operations in
> the
> >> future,
> >> I believe that we can continue to do it centrally while leveraging the
> >> principal
> >> that is forwarded to account for the right tenant. The reason why I
> would
> >> like
> >> to keep it central is that we are talking about sparse (or bursty)
> >> workloads here
> >> so distributing the quota among all the brokers makes little sense.
> >>
> >> 6. Regarding the naming of the new error code. BUSY sounds too generic
> to
> >> me so I would rather prefer a specific error code. The main reason being
> >> that
> >> we may be able to reuse it in the future for other quotas. That being
> >> said,
> >> we
> >> can find another name. QUOTA_EXHAUSTED? I don't feel too strongly about
> >> the naming. I wonder what the others think about this.
> >>
> >> Voilà. I hope that I have addressed all your questions/points and I am
> >> sorry for
> >> the lengthy email.
> >>
> >> Regards,
> >> David
> >>
> >>
> >> On Tue, Jun 9, 2020 at 2:13 AM Colin McCabe  wrote:
> >>
> >> > On Mon, Jun 8, 2020, at 14:41, Jun Rao wrote:
> >> > > Hi, Colin,
> >> > >
> >> > > Thanks for the comment. You brought up several points.
> >> > >
> >> > > 1. Should we set up a per user quota? To me, it does seem we need
> some
> >> > sort
> >> > > of a quota. When the controller runs out of resources, ideally, we
> >> only
> >> > > want to penalize the bad behaving applications, instead of every
> >> > > application. To do that, 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-09 Thread Jun Rao
Hi, David,

Another thing. Should we add controller.quota.window.size.seconds and
controller.quota.window.num
or just reuse the existing quota.window.size.seconds and quota.window.num
that are used for other types of quotas?

Thanks,

Jun

On Tue, Jun 9, 2020 at 10:30 AM Jun Rao  wrote:

> Hi, David,
>
> Thanks for the KIP. The name QUOTA_VIOLATED sounds reasonable to me. +1 on
> the KIP.
>
> Jun
>
> On Tue, Jun 9, 2020 at 5:07 AM David Jacot  wrote:
>
>> Hi Colin,
>>
>> Thank you for your feedback.
>>
>> Jun has summarized the situation pretty well. Thanks Jun! I would like to
>> complement it with the following points:
>>
>> 1. Indeed, when the quota is exceeded, the broker will reject the topic
>> creations, partition creations and topics deletions that are exceeding
>> with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
>> be populated accordingly to let the client know how long it must wait.
>>
>> 2. I do agree that we actually want a mechanism to apply back pressure
>> to the clients. The KIP basically proposes a mechanism to control and to
>> limit the rate of operations before entering the controller. I think that
>> it is
>> similar to your thinking but is enforced based on a defined quota instead
>> of relying on the number of pending items in the controller.
>>
>> 3. You mentioned an alternative idea in your comments that, if I
>> understood
>> correctly, would bound the queue to limit the overload and reject work if
>> the
>> queue is full. I have been thinking about this as well but I don't think
>> that it
>> works well in our case.
>> - The first reason is the one mentioned by Jun. We actually want to be
>> able
>> to limit specific clients (the misbehaving ones) in a multi-tenant
>> environment.
>> - The second reason is that, at least in our current implementation, the
>> length of
>> the queue is not really a good characteristic to estimate the load.
>> Coming back
>> to your example of the CreateTopicsRequest. They create path in ZK for
>> each
>> newly created topics which trigger a ChangeTopic event in the controller.
>> That
>> single event could be for a single topic in some cases or for a thousand
>> topics
>> in others.
>> These two reasons aside, bounding the queue also introduces a knob which
>> requires some tuning and thus suffers from all the points you mentioned as
>> well, isn't it? The value may be true for one version but not for another.
>>
>> 4. Regarding the integration with KIP-500. The implementation of this KIP
>> will span across the ApiLayer and the AdminManager. To be honest, we
>> can influence the implementation to work best with what you have in mind
>> for the future controller. If you could shed some more light on the future
>> internal architecture, I can take this into account during the
>> implementation.
>>
>> 5. Regarding KIP-590. For the create topics request, create partitions
>> request,
>> and delete topics request, we are lucky enough to have directed all of
>> them
>> to
>> the controller already so we are fine with doing the admission in the
>> broker
>> which hosts the controller. If we were to throttle more operations in the
>> future,
>> I believe that we can continue to do it centrally while leveraging the
>> principal
>> that is forwarded to account for the right tenant. The reason why I would
>> like
>> to keep it central is that we are talking about sparse (or bursty)
>> workloads here
>> so distributing the quota among all the brokers makes little sense.
>>
>> 6. Regarding the naming of the new error code. BUSY sounds too generic to
>> me so I would rather prefer a specific error code. The main reason being
>> that
>> we may be able to reuse it in the future for other quotas. That being
>> said,
>> we
>> can find another name. QUOTA_EXHAUSTED? I don't feel too strongly about
>> the naming. I wonder what the others think about this.
>>
>> Voilà. I hope that I have addressed all your questions/points and I am
>> sorry for
>> the lengthy email.
>>
>> Regards,
>> David
>>
>>
>> On Tue, Jun 9, 2020 at 2:13 AM Colin McCabe  wrote:
>>
>> > On Mon, Jun 8, 2020, at 14:41, Jun Rao wrote:
>> > > Hi, Colin,
>> > >
>> > > Thanks for the comment. You brought up several points.
>> > >
>> > > 1. Should we set up a per user quota? To me, it does seem we need some
>> > sort
>> > > of a quota. When the controller runs out of resources, ideally, we
>> only
>> > > want to penalize the bad behaving applications, instead of every
>> > > application. To do that, we will need to know what each application is
>> > > entitled to and the per user quota is intended to capture that.
>> > >
>> > > 2. How easy is it to configure a quota? The following is how an admin
>> > > typically sets up a quota in our existing systems. Pick a generous
>> > default
>> > > per user quota works for most applications. For the few resource
>> > intensive
>> > > applications, customize a higher quota for them. Reserve enough
>> resources
>> > > in anticipation that a single 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-09 Thread Jun Rao
Hi, David,

Thanks for the KIP. The name QUOTA_VIOLATED sounds reasonable to me. +1 on
the KIP.

Jun

On Tue, Jun 9, 2020 at 5:07 AM David Jacot  wrote:

> Hi Colin,
>
> Thank you for your feedback.
>
> Jun has summarized the situation pretty well. Thanks Jun! I would like to
> complement it with the following points:
>
> 1. Indeed, when the quota is exceeded, the broker will reject the topic
> creations, partition creations and topics deletions that are exceeding
> with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
> be populated accordingly to let the client know how long it must wait.
>
> 2. I do agree that we actually want a mechanism to apply back pressure
> to the clients. The KIP basically proposes a mechanism to control and to
> limit the rate of operations before entering the controller. I think that
> it is
> similar to your thinking but is enforced based on a defined quota instead
> of relying on the number of pending items in the controller.
>
> 3. You mentioned an alternative idea in your comments that, if I understood
> correctly, would bound the queue to limit the overload and reject work if
> the
> queue is full. I have been thinking about this as well but I don't think
> that it
> works well in our case.
> - The first reason is the one mentioned by Jun. We actually want to be able
> to limit specific clients (the misbehaving ones) in a multi-tenant
> environment.
> - The second reason is that, at least in our current implementation, the
> length of
> the queue is not really a good characteristic to estimate the load.
> Coming back
> to your example of the CreateTopicsRequest. They create path in ZK for each
> newly created topics which trigger a ChangeTopic event in the controller.
> That
> single event could be for a single topic in some cases or for a thousand
> topics
> in others.
> These two reasons aside, bounding the queue also introduces a knob which
> requires some tuning and thus suffers from all the points you mentioned as
> well, isn't it? The value may be true for one version but not for another.
>
> 4. Regarding the integration with KIP-500. The implementation of this KIP
> will span across the ApiLayer and the AdminManager. To be honest, we
> can influence the implementation to work best with what you have in mind
> for the future controller. If you could shed some more light on the future
> internal architecture, I can take this into account during the
> implementation.
>
> 5. Regarding KIP-590. For the create topics request, create partitions
> request,
> and delete topics request, we are lucky enough to have directed all of them
> to
> the controller already so we are fine with doing the admission in the
> broker
> which hosts the controller. If we were to throttle more operations in the
> future,
> I believe that we can continue to do it centrally while leveraging the
> principal
> that is forwarded to account for the right tenant. The reason why I would
> like
> to keep it central is that we are talking about sparse (or bursty)
> workloads here
> so distributing the quota among all the brokers makes little sense.
>
> 6. Regarding the naming of the new error code. BUSY sounds too generic to
> me so I would rather prefer a specific error code. The main reason being
> that
> we may be able to reuse it in the future for other quotas. That being said,
> we
> can find another name. QUOTA_EXHAUSTED? I don't feel too strongly about
> the naming. I wonder what the others think about this.
>
> Voilà. I hope that I have addressed all your questions/points and I am
> sorry for
> the lengthy email.
>
> Regards,
> David
>
>
> On Tue, Jun 9, 2020 at 2:13 AM Colin McCabe  wrote:
>
> > On Mon, Jun 8, 2020, at 14:41, Jun Rao wrote:
> > > Hi, Colin,
> > >
> > > Thanks for the comment. You brought up several points.
> > >
> > > 1. Should we set up a per user quota? To me, it does seem we need some
> > sort
> > > of a quota. When the controller runs out of resources, ideally, we only
> > > want to penalize the bad behaving applications, instead of every
> > > application. To do that, we will need to know what each application is
> > > entitled to and the per user quota is intended to capture that.
> > >
> > > 2. How easy is it to configure a quota? The following is how an admin
> > > typically sets up a quota in our existing systems. Pick a generous
> > default
> > > per user quota works for most applications. For the few resource
> > intensive
> > > applications, customize a higher quota for them. Reserve enough
> resources
> > > in anticipation that a single (or a few) application will exceed the
> > quota
> > > at a given time.
> > >
> >
> > Hi Jun,
> >
> > Thanks for the response.
> >
> > Maybe I was too pessimistic about the ability of admins to configure a
> > useful quota here.  I do agree that it would be nice to have the ability
> to
> > set different quotas for different users, as you mentioned.
> >
> > >
> > > 3. How should the quota be defined? In the discussion 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-09 Thread David Jacot
Hi Colin,

Thank you for your feedback.

Jun has summarized the situation pretty well. Thanks Jun! I would like to
complement it with the following points:

1. Indeed, when the quota is exceeded, the broker will reject the topic
creations, partition creations and topics deletions that are exceeding
with the new QUOTA_VIOLATED error. The ThrottleTimeMs field will
be populated accordingly to let the client know how long it must wait.

2. I do agree that we actually want a mechanism to apply back pressure
to the clients. The KIP basically proposes a mechanism to control and to
limit the rate of operations before entering the controller. I think that
it is
similar to your thinking but is enforced based on a defined quota instead
of relying on the number of pending items in the controller.

3. You mentioned an alternative idea in your comments that, if I understood
correctly, would bound the queue to limit the overload and reject work if
the
queue is full. I have been thinking about this as well but I don't think
that it
works well in our case.
- The first reason is the one mentioned by Jun. We actually want to be able
to limit specific clients (the misbehaving ones) in a multi-tenant
environment.
- The second reason is that, at least in our current implementation, the
length of
the queue is not really a good characteristic to estimate the load.
Coming back
to your example of the CreateTopicsRequest. They create path in ZK for each
newly created topics which trigger a ChangeTopic event in the controller.
That
single event could be for a single topic in some cases or for a thousand
topics
in others.
These two reasons aside, bounding the queue also introduces a knob which
requires some tuning and thus suffers from all the points you mentioned as
well, isn't it? The value may be true for one version but not for another.

4. Regarding the integration with KIP-500. The implementation of this KIP
will span across the ApiLayer and the AdminManager. To be honest, we
can influence the implementation to work best with what you have in mind
for the future controller. If you could shed some more light on the future
internal architecture, I can take this into account during the
implementation.

5. Regarding KIP-590. For the create topics request, create partitions
request,
and delete topics request, we are lucky enough to have directed all of them
to
the controller already so we are fine with doing the admission in the broker
which hosts the controller. If we were to throttle more operations in the
future,
I believe that we can continue to do it centrally while leveraging the
principal
that is forwarded to account for the right tenant. The reason why I would
like
to keep it central is that we are talking about sparse (or bursty)
workloads here
so distributing the quota among all the brokers makes little sense.

6. Regarding the naming of the new error code. BUSY sounds too generic to
me so I would rather prefer a specific error code. The main reason being
that
we may be able to reuse it in the future for other quotas. That being said,
we
can find another name. QUOTA_EXHAUSTED? I don't feel too strongly about
the naming. I wonder what the others think about this.

Voilà. I hope that I have addressed all your questions/points and I am
sorry for
the lengthy email.

Regards,
David


On Tue, Jun 9, 2020 at 2:13 AM Colin McCabe  wrote:

> On Mon, Jun 8, 2020, at 14:41, Jun Rao wrote:
> > Hi, Colin,
> >
> > Thanks for the comment. You brought up several points.
> >
> > 1. Should we set up a per user quota? To me, it does seem we need some
> sort
> > of a quota. When the controller runs out of resources, ideally, we only
> > want to penalize the bad behaving applications, instead of every
> > application. To do that, we will need to know what each application is
> > entitled to and the per user quota is intended to capture that.
> >
> > 2. How easy is it to configure a quota? The following is how an admin
> > typically sets up a quota in our existing systems. Pick a generous
> default
> > per user quota works for most applications. For the few resource
> intensive
> > applications, customize a higher quota for them. Reserve enough resources
> > in anticipation that a single (or a few) application will exceed the
> quota
> > at a given time.
> >
>
> Hi Jun,
>
> Thanks for the response.
>
> Maybe I was too pessimistic about the ability of admins to configure a
> useful quota here.  I do agree that it would be nice to have the ability to
> set different quotas for different users, as you mentioned.
>
> >
> > 3. How should the quota be defined? In the discussion thread, we debated
> > between a usage based model vs a rate based model. Dave and Anna argued
> for
> > the rate based model mostly because it's simpler to implement.
> >
>
> I'm trying to think more about how this integrates with our plans for
> KIP-500.  When we get rid of ZK, we will have to handle this in the
> controller itself, rather than in the AdminManager.  That 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-08 Thread Colin McCabe
On Mon, Jun 8, 2020, at 14:41, Jun Rao wrote:
> Hi, Colin,
> 
> Thanks for the comment. You brought up several points.
> 
> 1. Should we set up a per user quota? To me, it does seem we need some sort
> of a quota. When the controller runs out of resources, ideally, we only
> want to penalize the bad behaving applications, instead of every
> application. To do that, we will need to know what each application is
> entitled to and the per user quota is intended to capture that.
> 
> 2. How easy is it to configure a quota? The following is how an admin
> typically sets up a quota in our existing systems. Pick a generous default
> per user quota works for most applications. For the few resource intensive
> applications, customize a higher quota for them. Reserve enough resources
> in anticipation that a single (or a few) application will exceed the quota
> at a given time.
>

Hi Jun,

Thanks for the response.

Maybe I was too pessimistic about the ability of admins to configure a useful 
quota here.  I do agree that it would be nice to have the ability to set 
different quotas for different users, as you mentioned.

> 
> 3. How should the quota be defined? In the discussion thread, we debated
> between a usage based model vs a rate based model. Dave and Anna argued for
> the rate based model mostly because it's simpler to implement.
> 

I'm trying to think more about how this integrates with our plans for KIP-500.  
When we get rid of ZK, we will have to handle this in the controller itself, 
rather than in the AdminManager.  That implies we'll have to rewrite the code.  
Maybe this is worth it if we want this feature now, though.

Another wrinkle here is that as we discussed in KIP-590, controller operations 
will land on a random broker first, and only then be forwarded to the active 
controller.  This implies that either admissions control should happen on all 
brokers (needing some kind of distributed quota scheme), or be done on the 
controller after we've already done the work of forwarding the message.  The 
second approach might not be that bad, but it would be nice to figure this out.

>
> 4. If a quota is exceeded, how is that enforced? My understanding of the
> KIP is that, if a quota is exceeded, the broker immediately sends back
> a QUOTA_VIOLATED error and a throttle time back to the client, and the
> client will wait for the throttle time before issuing the next request.
> This seems to be the same as the BUSY error code you mentioned.
>

Yes, I agree, it sounds like we're thinking along the same lines.  However, 
rather than QUOTA_VIOLATED, how about naming the error code BUSY?  Then the 
error text could indicate the quota that we violated.  This would be more 
generally useful as an error code and also avoid being confusingly similar to 
POLICY_VIOLATION.

best,
Colin

> 
> I will let David chime in more on that.
> 
> Thanks,
> 
> Jun
> 
> 
> 
> On Sun, Jun 7, 2020 at 2:30 PM Colin McCabe  wrote:
> 
> > Hi David,
> >
> > Thanks for the KIP.
> >
> > I thought about this for a while and I actually think this approach is not
> > quite right.  The problem that I see here is that using an explicitly set
> > quota here requires careful tuning by the cluster operator.  Even worse,
> > this tuning might be invalidated by changes in overall conditions or even
> > more efficient controller software.
> >
> > For example, if we empirically find that the controller can do 1000 topics
> > in a minute (or whatever), this tuning might actually be wrong if the next
> > version of the software can do 2000 topics in a minute because of
> > efficiency upgrades.  Or, the broker that the controller is located on
> > might be experiencing heavy load from its non-controller operations, and so
> > it can only do 500 topics in a minute during this period.
> >
> > So the system administrator gets a very obscure tunable (it's not clear to
> > a non-Kafka-developer what "controller mutations" are or why they should
> > care).  And even worse, they will have to significantly "sandbag" the value
> > that they set it to, so that even under the heaviest load and oldest
> > deployed version of the software, the controller can still function.  Even
> > worse, this new quota adds a lot of complexity to the controller.
> >
> > What we really want is backpressure when the controller is overloaded.  I
> > believe this is the alternative you discuss in "Rejected Alternatives"
> > under "Throttle the Execution instead of the Admission"  Your reason for
> > rejecting it is that the client error handling does not work well in this
> > case.  But actually, this is an artifact of our current implementation,
> > rather than a fundamental issue with backpressure.
> >
> > Consider the example of a CreateTopicsRequest.  The controller could
> > return a special error code if the load was too high, and take the create
> > topics event off the controller queue.  Let's call that error code BUSY.
> >  Additionally, the controller could 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-08 Thread Jun Rao
Hi, Colin,

Thanks for the comment. You brought up several points.

1. Should we set up a per user quota? To me, it does seem we need some sort
of a quota. When the controller runs out of resources, ideally, we only
want to penalize the bad behaving applications, instead of every
application. To do that, we will need to know what each application is
entitled to and the per user quota is intended to capture that.

2. How easy is it to configure a quota? The following is how an admin
typically sets up a quota in our existing systems. Pick a generous default
per user quota works for most applications. For the few resource intensive
applications, customize a higher quota for them. Reserve enough resources
in anticipation that a single (or a few) application will exceed the quota
at a given time.

3. How should the quota be defined? In the discussion thread, we debated
between a usage based model vs a rate based model. Dave and Anna argued for
the rate based model mostly because it's simpler to implement.

4. If a quota is exceeded, how is that enforced? My understanding of the
KIP is that, if a quota is exceeded, the broker immediately sends back
a QUOTA_VIOLATED error and a throttle time back to the client, and the
client will wait for the throttle time before issuing the next request.
This seems to be the same as the BUSY error code you mentioned.

I will let David chime in more on that.

Thanks,

Jun



On Sun, Jun 7, 2020 at 2:30 PM Colin McCabe  wrote:

> Hi David,
>
> Thanks for the KIP.
>
> I thought about this for a while and I actually think this approach is not
> quite right.  The problem that I see here is that using an explicitly set
> quota here requires careful tuning by the cluster operator.  Even worse,
> this tuning might be invalidated by changes in overall conditions or even
> more efficient controller software.
>
> For example, if we empirically find that the controller can do 1000 topics
> in a minute (or whatever), this tuning might actually be wrong if the next
> version of the software can do 2000 topics in a minute because of
> efficiency upgrades.  Or, the broker that the controller is located on
> might be experiencing heavy load from its non-controller operations, and so
> it can only do 500 topics in a minute during this period.
>
> So the system administrator gets a very obscure tunable (it's not clear to
> a non-Kafka-developer what "controller mutations" are or why they should
> care).  And even worse, they will have to significantly "sandbag" the value
> that they set it to, so that even under the heaviest load and oldest
> deployed version of the software, the controller can still function.  Even
> worse, this new quota adds a lot of complexity to the controller.
>
> What we really want is backpressure when the controller is overloaded.  I
> believe this is the alternative you discuss in "Rejected Alternatives"
> under "Throttle the Execution instead of the Admission"  Your reason for
> rejecting it is that the client error handling does not work well in this
> case.  But actually, this is an artifact of our current implementation,
> rather than a fundamental issue with backpressure.
>
> Consider the example of a CreateTopicsRequest.  The controller could
> return a special error code if the load was too high, and take the create
> topics event off the controller queue.  Let's call that error code BUSY.
>  Additionally, the controller could immediately refuse new events if the
> queue had reached its maximum length, and simply return BUSY for that case
> as well.
>
> Basically, the way we handle RPC timeouts in the controller right now is
> not very good.  As you know, we time out the RPC, so the client gets
> TimeoutException, but then keep the event on the queue, so that it
> eventually gets executed!  There's no reason why we have to do that.  We
> could take the event off the queue if there is a timeout.  This would
> reduce load and mostly avoid the paradoxical situations you describe
> (getting TopicExistsException for a CreateTopicsRequest retry, etc.)
>
> I say "mostly" because there are still cases where retries could go astray
> (for example if we execute the topic creation but a network problem
> prevents the response from being sent to the client).  But this would still
> be a very big improvement over what we have now.
>
> Sorry for commenting so late on this but I got distracted by some other
> things.  I hope we can figure this one out-- I feel like there is a chance
> to significantly simplify this.
>
> best,
> Colin
>
>
> On Fri, May 29, 2020, at 07:57, David Jacot wrote:
> > Hi folks,
> >
> > I'd like to start the vote for KIP-599 which proposes a new quota to
> > throttle create topic, create partition, and delete topics operations to
> > protect the Kafka controller:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations
> >
> > Please, let me know what you think.
> >
> > 

Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-07 Thread Colin McCabe
Hi David,

Thanks for the KIP.

I thought about this for a while and I actually think this approach is not 
quite right.  The problem that I see here is that using an explicitly set quota 
here requires careful tuning by the cluster operator.  Even worse, this tuning 
might be invalidated by changes in overall conditions or even more efficient 
controller software.

For example, if we empirically find that the controller can do 1000 topics in a 
minute (or whatever), this tuning might actually be wrong if the next version 
of the software can do 2000 topics in a minute because of efficiency upgrades.  
Or, the broker that the controller is located on might be experiencing heavy 
load from its non-controller operations, and so it can only do 500 topics in a 
minute during this period.

So the system administrator gets a very obscure tunable (it's not clear to a 
non-Kafka-developer what "controller mutations" are or why they should care).  
And even worse, they will have to significantly "sandbag" the value that they 
set it to, so that even under the heaviest load and oldest deployed version of 
the software, the controller can still function.  Even worse, this new quota 
adds a lot of complexity to the controller.

What we really want is backpressure when the controller is overloaded.  I 
believe this is the alternative you discuss in "Rejected Alternatives" under 
"Throttle the Execution instead of the Admission"  Your reason for rejecting it 
is that the client error handling does not work well in this case.  But 
actually, this is an artifact of our current implementation, rather than a 
fundamental issue with backpressure.

Consider the example of a CreateTopicsRequest.  The controller could return a 
special error code if the load was too high, and take the create topics event 
off the controller queue.  Let's call that error code BUSY. 
 Additionally, the controller could immediately refuse new events if the queue 
had reached its maximum length, and simply return BUSY for that case as well.

Basically, the way we handle RPC timeouts in the controller right now is not 
very good.  As you know, we time out the RPC, so the client gets 
TimeoutException, but then keep the event on the queue, so that it eventually 
gets executed!  There's no reason why we have to do that.  We could take the 
event off the queue if there is a timeout.  This would reduce load and mostly 
avoid the paradoxical situations you describe (getting TopicExistsException for 
a CreateTopicsRequest retry, etc.)

I say "mostly" because there are still cases where retries could go astray (for 
example if we execute the topic creation but a network problem prevents the 
response from being sent to the client).  But this would still be a very big 
improvement over what we have now.

Sorry for commenting so late on this but I got distracted by some other things. 
 I hope we can figure this one out-- I feel like there is a chance to 
significantly simplify this.

best,
Colin


On Fri, May 29, 2020, at 07:57, David Jacot wrote:
> Hi folks,
> 
> I'd like to start the vote for KIP-599 which proposes a new quota to
> throttle create topic, create partition, and delete topics operations to
> protect the Kafka controller:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations
> 
> Please, let me know what you think.
> 
> Cheers,
> David
>


Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-05 Thread Anna Povzner
+1 (not binding)

Thanks for the KIP!

-Anna

On Thu, Jun 4, 2020 at 8:26 AM Mickael Maison 
wrote:

> +1 (binding)
> Thanks David for looking into this important issue
>
> On Thu, Jun 4, 2020 at 3:59 PM Tom Bentley  wrote:
> >
> > +1 (non binding).
> >
> > Thanks!
> >
> > On Wed, Jun 3, 2020 at 3:51 PM Rajini Sivaram 
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks for the KIP, David!
> > >
> > > Regards,
> > >
> > > Rajini
> > >
> > >
> > > On Sun, May 31, 2020 at 3:29 AM Gwen Shapira 
> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Looks great. Thank you for the in-depth design and discussion.
> > > >
> > > > On Fri, May 29, 2020 at 7:58 AM David Jacot 
> wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I'd like to start the vote for KIP-599 which proposes a new quota
> to
> > > > > throttle create topic, create partition, and delete topics
> operations
> > > to
> > > > > protect the Kafka controller:
> > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations
> > > > >
> > > > > Please, let me know what you think.
> > > > >
> > > > > Cheers,
> > > > > David
> > > > >
> > > >
> > > >
> > > > --
> > > > Gwen Shapira
> > > > Engineering Manager | Confluent
> > > > 650.450.2760 | @gwenshap
> > > > Follow us: Twitter | blog
> > > >
> > >
>


Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-04 Thread Mickael Maison
+1 (binding)
Thanks David for looking into this important issue

On Thu, Jun 4, 2020 at 3:59 PM Tom Bentley  wrote:
>
> +1 (non binding).
>
> Thanks!
>
> On Wed, Jun 3, 2020 at 3:51 PM Rajini Sivaram 
> wrote:
>
> > +1 (binding)
> >
> > Thanks for the KIP, David!
> >
> > Regards,
> >
> > Rajini
> >
> >
> > On Sun, May 31, 2020 at 3:29 AM Gwen Shapira  wrote:
> >
> > > +1 (binding)
> > >
> > > Looks great. Thank you for the in-depth design and discussion.
> > >
> > > On Fri, May 29, 2020 at 7:58 AM David Jacot  wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I'd like to start the vote for KIP-599 which proposes a new quota to
> > > > throttle create topic, create partition, and delete topics operations
> > to
> > > > protect the Kafka controller:
> > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations
> > > >
> > > > Please, let me know what you think.
> > > >
> > > > Cheers,
> > > > David
> > > >
> > >
> > >
> > > --
> > > Gwen Shapira
> > > Engineering Manager | Confluent
> > > 650.450.2760 | @gwenshap
> > > Follow us: Twitter | blog
> > >
> >


Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-04 Thread Tom Bentley
+1 (non binding).

Thanks!

On Wed, Jun 3, 2020 at 3:51 PM Rajini Sivaram 
wrote:

> +1 (binding)
>
> Thanks for the KIP, David!
>
> Regards,
>
> Rajini
>
>
> On Sun, May 31, 2020 at 3:29 AM Gwen Shapira  wrote:
>
> > +1 (binding)
> >
> > Looks great. Thank you for the in-depth design and discussion.
> >
> > On Fri, May 29, 2020 at 7:58 AM David Jacot  wrote:
> >
> > > Hi folks,
> > >
> > > I'd like to start the vote for KIP-599 which proposes a new quota to
> > > throttle create topic, create partition, and delete topics operations
> to
> > > protect the Kafka controller:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations
> > >
> > > Please, let me know what you think.
> > >
> > > Cheers,
> > > David
> > >
> >
> >
> > --
> > Gwen Shapira
> > Engineering Manager | Confluent
> > 650.450.2760 | @gwenshap
> > Follow us: Twitter | blog
> >
>


Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-06-03 Thread Rajini Sivaram
+1 (binding)

Thanks for the KIP, David!

Regards,

Rajini


On Sun, May 31, 2020 at 3:29 AM Gwen Shapira  wrote:

> +1 (binding)
>
> Looks great. Thank you for the in-depth design and discussion.
>
> On Fri, May 29, 2020 at 7:58 AM David Jacot  wrote:
>
> > Hi folks,
> >
> > I'd like to start the vote for KIP-599 which proposes a new quota to
> > throttle create topic, create partition, and delete topics operations to
> > protect the Kafka controller:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations
> >
> > Please, let me know what you think.
> >
> > Cheers,
> > David
> >
>
>
> --
> Gwen Shapira
> Engineering Manager | Confluent
> 650.450.2760 | @gwenshap
> Follow us: Twitter | blog
>


Re: [VOTE] KIP-599: Throttle Create Topic, Create Partition and Delete Topic Operations

2020-05-30 Thread Gwen Shapira
+1 (binding)

Looks great. Thank you for the in-depth design and discussion.

On Fri, May 29, 2020 at 7:58 AM David Jacot  wrote:

> Hi folks,
>
> I'd like to start the vote for KIP-599 which proposes a new quota to
> throttle create topic, create partition, and delete topics operations to
> protect the Kafka controller:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-599%3A+Throttle+Create+Topic%2C+Create+Partition+and+Delete+Topic+Operations
>
> Please, let me know what you think.
>
> Cheers,
> David
>


-- 
Gwen Shapira
Engineering Manager | Confluent
650.450.2760 | @gwenshap
Follow us: Twitter | blog