Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-04-20 Thread Dong Lin
Thanks for the explanation. I agree that it is not easy to have a
well-defined and accurate measurement of the split ratio.

On Wed, Apr 19, 2017 at 9:38 AM, Becket Qin  wrote:

> Thanks for the comment, Dong. I think the batch-split-ratio makes sense but
> is kind of redundant to batch-split-rate.
>
> Also the batch-split-ratio may be a little more involved to make right:
> 1. A all-time batch split ratio is easy to get but not that useful.
> 2. A time-windowed batch-split-ratio is more complicated to make accurate.
> This is because it is kind of a "stateful" metric relies on the number of
> batches sent in a time window and number of batches got split in the same
> time window. But the sending and the splitting time are not necessarily
> falling in the same window.
>
> Besides, a rough estimation of the batch split ratio can be derived from
> the existing metrics. And I think batch-split-rate is already a good
> indication on whether the batch split has caused performance problem or
> not.
>
> So I am not sure if it is worth having an explicit batch-split-ratio metric
> in this case.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Wed, Mar 22, 2017 at 10:54 AM, Dong Lin  wrote:
>
> > Never mind about my second comment. I misunderstood the semantics of
> > producer's batch.size.
> >
> > On Wed, Mar 22, 2017 at 10:20 AM, Dong Lin  wrote:
> >
> > > Hey Becket,
> > >
> > > In addition to the batch-split-rate, should we also add
> batch-split-ratio
> > > sensor to gauge the probability that we have to split batch?
> > >
> > > Also, in the case that the batch size configured for the producer is
> > > smaller than the max message size configured for the broker, why can't
> we
> > > just split the batch if its size exceeds the configured batch size? The
> > > benefit of this approach is that the semantics of producer is
> > > straightforward because we enforce the batch size that user has
> > configured.
> > > The implementation would also be simpler because we don't have to reply
> > on
> > > KIP-4 to fetch the max message size from broker. I guess you are
> worrying
> > > about the overhead of "unnecessary" split if a batch size is between
> > > user-configured batch size and broker's max message size. But is
> overhead
> > > really a concern? If overhead is too large because user has configured
> a
> > > very low batch size for producer, shouldn't user adjust produce config?
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Wed, Mar 15, 2017 at 2:50 PM, Becket Qin 
> > wrote:
> > >
> > >> I see, then we are thinking about the same thing :)
> > >>
> > >> On Wed, Mar 15, 2017 at 2:26 PM, Ismael Juma 
> wrote:
> > >>
> > >> > I meant finishing what's described in the following section and then
> > >> > starting a discussion followed by a vote:
> > >> >
> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > 4+-+Command+line+and+centralized+administrative+operations#KIP-4-
> > >> > Commandlineandcentralizedadministrativeoperations-DescribeCo
> > >> nfigsRequest
> > >> >
> > >> > We have only voted on KIP-4 Metadata, KIP-4 Create Topics, KIP-4
> > Delete
> > >> > Topics so far.
> > >> >
> > >> > Ismael
> > >> >
> > >> > On Wed, Mar 15, 2017 at 8:58 PM, Becket Qin 
> > >> wrote:
> > >> >
> > >> > > Hi Ismael,
> > >> > >
> > >> > > KIP-4 is also the one that I was thinking about. We have
> introduced
> > a
> > >> > > DescribeConfigRequest there so the producer can easily get the
> > >> > > configurations. By "another KIP" do you mean a new (or maybe
> > extended)
> > >> > > protocol or using that protocol in clients?
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Jiangjie (Becket) Qin
> > >> > >
> > >> > > On Wed, Mar 15, 2017 at 1:21 PM, Ismael Juma 
> > >> wrote:
> > >> > >
> > >> > > > Hi Becket,
> > >> > > >
> > >> > > > How were you thinking of retrieving the configuration items you
> > >> > > mentioned?
> > >> > > > I am asking because I was planning to post a KIP for Describe
> > >> Configs
> > >> > > (one
> > >> > > > of the protocols in KIP-4), which would expose such information.
> > But
> > >> > > maybe
> > >> > > > you are thinking of extending Metadata request?
> > >> > > >
> > >> > > > Ismael
> > >> > > >
> > >> > > > On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin <
> becket@gmail.com
> > >
> > >> > > wrote:
> > >> > > >
> > >> > > > > Hi Jason,
> > >> > > > >
> > >> > > > > Good point. I was thinking about that, too. I was not sure if
> > >> that is
> > >> > > the
> > >> > > > > right thing to do by default.
> > >> > > > >
> > >> > > > > If we assume people always set the batch size to max message
> > size,
> > >> > > > > splitting the oversized batch makes a lot of sense. But it
> seems
> > >> > > possible
> > >> > > > > that users want to control the memory footprint so they would
> > set
> > >> the
> > >> > > > batch
> > >> > > > > size 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-04-19 Thread Becket Qin
Thanks for the comment, Dong. I think the batch-split-ratio makes sense but
is kind of redundant to batch-split-rate.

Also the batch-split-ratio may be a little more involved to make right:
1. A all-time batch split ratio is easy to get but not that useful.
2. A time-windowed batch-split-ratio is more complicated to make accurate.
This is because it is kind of a "stateful" metric relies on the number of
batches sent in a time window and number of batches got split in the same
time window. But the sending and the splitting time are not necessarily
falling in the same window.

Besides, a rough estimation of the batch split ratio can be derived from
the existing metrics. And I think batch-split-rate is already a good
indication on whether the batch split has caused performance problem or
not.

So I am not sure if it is worth having an explicit batch-split-ratio metric
in this case.

Thanks,

Jiangjie (Becket) Qin

On Wed, Mar 22, 2017 at 10:54 AM, Dong Lin  wrote:

> Never mind about my second comment. I misunderstood the semantics of
> producer's batch.size.
>
> On Wed, Mar 22, 2017 at 10:20 AM, Dong Lin  wrote:
>
> > Hey Becket,
> >
> > In addition to the batch-split-rate, should we also add batch-split-ratio
> > sensor to gauge the probability that we have to split batch?
> >
> > Also, in the case that the batch size configured for the producer is
> > smaller than the max message size configured for the broker, why can't we
> > just split the batch if its size exceeds the configured batch size? The
> > benefit of this approach is that the semantics of producer is
> > straightforward because we enforce the batch size that user has
> configured.
> > The implementation would also be simpler because we don't have to reply
> on
> > KIP-4 to fetch the max message size from broker. I guess you are worrying
> > about the overhead of "unnecessary" split if a batch size is between
> > user-configured batch size and broker's max message size. But is overhead
> > really a concern? If overhead is too large because user has configured a
> > very low batch size for producer, shouldn't user adjust produce config?
> >
> > Thanks,
> > Dong
> >
> > On Wed, Mar 15, 2017 at 2:50 PM, Becket Qin 
> wrote:
> >
> >> I see, then we are thinking about the same thing :)
> >>
> >> On Wed, Mar 15, 2017 at 2:26 PM, Ismael Juma  wrote:
> >>
> >> > I meant finishing what's described in the following section and then
> >> > starting a discussion followed by a vote:
> >> >
> >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> > 4+-+Command+line+and+centralized+administrative+operations#KIP-4-
> >> > Commandlineandcentralizedadministrativeoperations-DescribeCo
> >> nfigsRequest
> >> >
> >> > We have only voted on KIP-4 Metadata, KIP-4 Create Topics, KIP-4
> Delete
> >> > Topics so far.
> >> >
> >> > Ismael
> >> >
> >> > On Wed, Mar 15, 2017 at 8:58 PM, Becket Qin 
> >> wrote:
> >> >
> >> > > Hi Ismael,
> >> > >
> >> > > KIP-4 is also the one that I was thinking about. We have introduced
> a
> >> > > DescribeConfigRequest there so the producer can easily get the
> >> > > configurations. By "another KIP" do you mean a new (or maybe
> extended)
> >> > > protocol or using that protocol in clients?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jiangjie (Becket) Qin
> >> > >
> >> > > On Wed, Mar 15, 2017 at 1:21 PM, Ismael Juma 
> >> wrote:
> >> > >
> >> > > > Hi Becket,
> >> > > >
> >> > > > How were you thinking of retrieving the configuration items you
> >> > > mentioned?
> >> > > > I am asking because I was planning to post a KIP for Describe
> >> Configs
> >> > > (one
> >> > > > of the protocols in KIP-4), which would expose such information.
> But
> >> > > maybe
> >> > > > you are thinking of extending Metadata request?
> >> > > >
> >> > > > Ismael
> >> > > >
> >> > > > On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin  >
> >> > > wrote:
> >> > > >
> >> > > > > Hi Jason,
> >> > > > >
> >> > > > > Good point. I was thinking about that, too. I was not sure if
> >> that is
> >> > > the
> >> > > > > right thing to do by default.
> >> > > > >
> >> > > > > If we assume people always set the batch size to max message
> size,
> >> > > > > splitting the oversized batch makes a lot of sense. But it seems
> >> > > possible
> >> > > > > that users want to control the memory footprint so they would
> set
> >> the
> >> > > > batch
> >> > > > > size to smaller than the max message size so the producer can
> have
> >> > hold
> >> > > > > batches for more partitions. In this case, splitting the batch
> >> might
> >> > > not
> >> > > > be
> >> > > > > the desired behavior.
> >> > > > >
> >> > > > > I think the most intuitive approach to this is allow the
> producer
> >> to
> >> > > get
> >> > > > > the max message size configuration (as well as some other
> >> > > configurations
> >> > > > > such as 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-22 Thread Dong Lin
Never mind about my second comment. I misunderstood the semantics of
producer's batch.size.

On Wed, Mar 22, 2017 at 10:20 AM, Dong Lin  wrote:

> Hey Becket,
>
> In addition to the batch-split-rate, should we also add batch-split-ratio
> sensor to gauge the probability that we have to split batch?
>
> Also, in the case that the batch size configured for the producer is
> smaller than the max message size configured for the broker, why can't we
> just split the batch if its size exceeds the configured batch size? The
> benefit of this approach is that the semantics of producer is
> straightforward because we enforce the batch size that user has configured.
> The implementation would also be simpler because we don't have to reply on
> KIP-4 to fetch the max message size from broker. I guess you are worrying
> about the overhead of "unnecessary" split if a batch size is between
> user-configured batch size and broker's max message size. But is overhead
> really a concern? If overhead is too large because user has configured a
> very low batch size for producer, shouldn't user adjust produce config?
>
> Thanks,
> Dong
>
> On Wed, Mar 15, 2017 at 2:50 PM, Becket Qin  wrote:
>
>> I see, then we are thinking about the same thing :)
>>
>> On Wed, Mar 15, 2017 at 2:26 PM, Ismael Juma  wrote:
>>
>> > I meant finishing what's described in the following section and then
>> > starting a discussion followed by a vote:
>> >
>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > 4+-+Command+line+and+centralized+administrative+operations#KIP-4-
>> > Commandlineandcentralizedadministrativeoperations-DescribeCo
>> nfigsRequest
>> >
>> > We have only voted on KIP-4 Metadata, KIP-4 Create Topics, KIP-4 Delete
>> > Topics so far.
>> >
>> > Ismael
>> >
>> > On Wed, Mar 15, 2017 at 8:58 PM, Becket Qin 
>> wrote:
>> >
>> > > Hi Ismael,
>> > >
>> > > KIP-4 is also the one that I was thinking about. We have introduced a
>> > > DescribeConfigRequest there so the producer can easily get the
>> > > configurations. By "another KIP" do you mean a new (or maybe extended)
>> > > protocol or using that protocol in clients?
>> > >
>> > > Thanks,
>> > >
>> > > Jiangjie (Becket) Qin
>> > >
>> > > On Wed, Mar 15, 2017 at 1:21 PM, Ismael Juma 
>> wrote:
>> > >
>> > > > Hi Becket,
>> > > >
>> > > > How were you thinking of retrieving the configuration items you
>> > > mentioned?
>> > > > I am asking because I was planning to post a KIP for Describe
>> Configs
>> > > (one
>> > > > of the protocols in KIP-4), which would expose such information. But
>> > > maybe
>> > > > you are thinking of extending Metadata request?
>> > > >
>> > > > Ismael
>> > > >
>> > > > On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin 
>> > > wrote:
>> > > >
>> > > > > Hi Jason,
>> > > > >
>> > > > > Good point. I was thinking about that, too. I was not sure if
>> that is
>> > > the
>> > > > > right thing to do by default.
>> > > > >
>> > > > > If we assume people always set the batch size to max message size,
>> > > > > splitting the oversized batch makes a lot of sense. But it seems
>> > > possible
>> > > > > that users want to control the memory footprint so they would set
>> the
>> > > > batch
>> > > > > size to smaller than the max message size so the producer can have
>> > hold
>> > > > > batches for more partitions. In this case, splitting the batch
>> might
>> > > not
>> > > > be
>> > > > > the desired behavior.
>> > > > >
>> > > > > I think the most intuitive approach to this is allow the producer
>> to
>> > > get
>> > > > > the max message size configuration (as well as some other
>> > > configurations
>> > > > > such as timestamp type)  from the broker side and use that to
>> decide
>> > > > > whether a batch should be split or not. I probably should add
>> this to
>> > > the
>> > > > > KIP wiki.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jiangjie (Becket) Qin
>> > > > >
>> > > > > On Wed, Mar 15, 2017 at 9:47 AM, Jason Gustafson <
>> ja...@confluent.io
>> > >
>> > > > > wrote:
>> > > > >
>> > > > > > Hey Becket,
>> > > > > >
>> > > > > > Thanks for the KIP! The approach seems reasonable. One
>> > clarification:
>> > > > is
>> > > > > > the intent to do the splitting after the broker rejects the
>> request
>> > > > with
>> > > > > > MESSAGE_TOO_LARGE, or prior to sending if the configured batch
>> size
>> > > is
>> > > > > > exceeded?
>> > > > > >
>> > > > > > -Jason
>> > > > > >
>> > > > > > On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin <
>> becket@gmail.com>
>> > > > > wrote:
>> > > > > >
>> > > > > > > Bump up the thread for further comments. If there is no more
>> > > comments
>> > > > > on
>> > > > > > > the KIP I will start the voting thread on Wed.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jiangjie (Becket) Qin
>> > > > > > >
>> > > > > > > On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin <
>> 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-22 Thread Dong Lin
Hey Becket,

In addition to the batch-split-rate, should we also add batch-split-ratio
sensor to gauge the probability that we have to split batch?

Also, in the case that the batch size configured for the producer is
smaller than the max message size configured for the broker, why can't we
just split the batch if its size exceeds the configured batch size? The
benefit of this approach is that the semantics of producer is
straightforward because we enforce the batch size that user has configured.
The implementation would also be simpler because we don't have to reply on
KIP-4 to fetch the max message size from broker. I guess you are worrying
about the overhead of "unnecessary" split if a batch size is between
user-configured batch size and broker's max message size. But is overhead
really a concern? If overhead is too large because user has configured a
very low batch size for producer, shouldn't user adjust produce config?

Thanks,
Dong

On Wed, Mar 15, 2017 at 2:50 PM, Becket Qin  wrote:

> I see, then we are thinking about the same thing :)
>
> On Wed, Mar 15, 2017 at 2:26 PM, Ismael Juma  wrote:
>
> > I meant finishing what's described in the following section and then
> > starting a discussion followed by a vote:
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 4+-+Command+line+and+centralized+administrative+operations#KIP-4-
> > Commandlineandcentralizedadministrativeoperations-DescribeConfigsRequest
> >
> > We have only voted on KIP-4 Metadata, KIP-4 Create Topics, KIP-4 Delete
> > Topics so far.
> >
> > Ismael
> >
> > On Wed, Mar 15, 2017 at 8:58 PM, Becket Qin 
> wrote:
> >
> > > Hi Ismael,
> > >
> > > KIP-4 is also the one that I was thinking about. We have introduced a
> > > DescribeConfigRequest there so the producer can easily get the
> > > configurations. By "another KIP" do you mean a new (or maybe extended)
> > > protocol or using that protocol in clients?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Wed, Mar 15, 2017 at 1:21 PM, Ismael Juma 
> wrote:
> > >
> > > > Hi Becket,
> > > >
> > > > How were you thinking of retrieving the configuration items you
> > > mentioned?
> > > > I am asking because I was planning to post a KIP for Describe Configs
> > > (one
> > > > of the protocols in KIP-4), which would expose such information. But
> > > maybe
> > > > you are thinking of extending Metadata request?
> > > >
> > > > Ismael
> > > >
> > > > On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin 
> > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > Good point. I was thinking about that, too. I was not sure if that
> is
> > > the
> > > > > right thing to do by default.
> > > > >
> > > > > If we assume people always set the batch size to max message size,
> > > > > splitting the oversized batch makes a lot of sense. But it seems
> > > possible
> > > > > that users want to control the memory footprint so they would set
> the
> > > > batch
> > > > > size to smaller than the max message size so the producer can have
> > hold
> > > > > batches for more partitions. In this case, splitting the batch
> might
> > > not
> > > > be
> > > > > the desired behavior.
> > > > >
> > > > > I think the most intuitive approach to this is allow the producer
> to
> > > get
> > > > > the max message size configuration (as well as some other
> > > configurations
> > > > > such as timestamp type)  from the broker side and use that to
> decide
> > > > > whether a batch should be split or not. I probably should add this
> to
> > > the
> > > > > KIP wiki.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Wed, Mar 15, 2017 at 9:47 AM, Jason Gustafson <
> ja...@confluent.io
> > >
> > > > > wrote:
> > > > >
> > > > > > Hey Becket,
> > > > > >
> > > > > > Thanks for the KIP! The approach seems reasonable. One
> > clarification:
> > > > is
> > > > > > the intent to do the splitting after the broker rejects the
> request
> > > > with
> > > > > > MESSAGE_TOO_LARGE, or prior to sending if the configured batch
> size
> > > is
> > > > > > exceeded?
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > > On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin <
> becket@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Bump up the thread for further comments. If there is no more
> > > comments
> > > > > on
> > > > > > > the KIP I will start the voting thread on Wed.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jiangjie (Becket) Qin
> > > > > > >
> > > > > > > On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin <
> becket@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Dong,
> > > > > > > >
> > > > > > > > Thanks for the comments.
> > > > > > > >
> > > > > > > > The patch is mostly for proof of concept in case there is any
> > > > concern
> > > > > > > > about the implementation which is indeed a little tricky.
> > > > > > 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-15 Thread Becket Qin
I see, then we are thinking about the same thing :)

On Wed, Mar 15, 2017 at 2:26 PM, Ismael Juma  wrote:

> I meant finishing what's described in the following section and then
> starting a discussion followed by a vote:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 4+-+Command+line+and+centralized+administrative+operations#KIP-4-
> Commandlineandcentralizedadministrativeoperations-DescribeConfigsRequest
>
> We have only voted on KIP-4 Metadata, KIP-4 Create Topics, KIP-4 Delete
> Topics so far.
>
> Ismael
>
> On Wed, Mar 15, 2017 at 8:58 PM, Becket Qin  wrote:
>
> > Hi Ismael,
> >
> > KIP-4 is also the one that I was thinking about. We have introduced a
> > DescribeConfigRequest there so the producer can easily get the
> > configurations. By "another KIP" do you mean a new (or maybe extended)
> > protocol or using that protocol in clients?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Mar 15, 2017 at 1:21 PM, Ismael Juma  wrote:
> >
> > > Hi Becket,
> > >
> > > How were you thinking of retrieving the configuration items you
> > mentioned?
> > > I am asking because I was planning to post a KIP for Describe Configs
> > (one
> > > of the protocols in KIP-4), which would expose such information. But
> > maybe
> > > you are thinking of extending Metadata request?
> > >
> > > Ismael
> > >
> > > On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin 
> > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > Good point. I was thinking about that, too. I was not sure if that is
> > the
> > > > right thing to do by default.
> > > >
> > > > If we assume people always set the batch size to max message size,
> > > > splitting the oversized batch makes a lot of sense. But it seems
> > possible
> > > > that users want to control the memory footprint so they would set the
> > > batch
> > > > size to smaller than the max message size so the producer can have
> hold
> > > > batches for more partitions. In this case, splitting the batch might
> > not
> > > be
> > > > the desired behavior.
> > > >
> > > > I think the most intuitive approach to this is allow the producer to
> > get
> > > > the max message size configuration (as well as some other
> > configurations
> > > > such as timestamp type)  from the broker side and use that to decide
> > > > whether a batch should be split or not. I probably should add this to
> > the
> > > > KIP wiki.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Wed, Mar 15, 2017 at 9:47 AM, Jason Gustafson  >
> > > > wrote:
> > > >
> > > > > Hey Becket,
> > > > >
> > > > > Thanks for the KIP! The approach seems reasonable. One
> clarification:
> > > is
> > > > > the intent to do the splitting after the broker rejects the request
> > > with
> > > > > MESSAGE_TOO_LARGE, or prior to sending if the configured batch size
> > is
> > > > > exceeded?
> > > > >
> > > > > -Jason
> > > > >
> > > > > On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin 
> > > > wrote:
> > > > >
> > > > > > Bump up the thread for further comments. If there is no more
> > comments
> > > > on
> > > > > > the KIP I will start the voting thread on Wed.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jiangjie (Becket) Qin
> > > > > >
> > > > > > On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin  >
> > > > wrote:
> > > > > >
> > > > > > > Hi Dong,
> > > > > > >
> > > > > > > Thanks for the comments.
> > > > > > >
> > > > > > > The patch is mostly for proof of concept in case there is any
> > > concern
> > > > > > > about the implementation which is indeed a little tricky.
> > > > > > >
> > > > > > > The new metric has already been mentioned in the Public
> Interface
> > > > > Change
> > > > > > > section.
> > > > > > >
> > > > > > > I added the reasoning about how the compression ratio
> > > > > > > improving/deteriorate steps are determined in the wiki.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jiangjie (Becket) Qin
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin 
> > > > wrote:
> > > > > > >
> > > > > > >> Hey Becket,
> > > > > > >>
> > > > > > >> I am wondering if we should first vote for the KIP before
> > > reviewing
> > > > > the
> > > > > > >> patch. I have two comments below:
> > > > > > >>
> > > > > > >> - Should we specify the new sensors as part of interface
> change
> > in
> > > > the
> > > > > > >> KIP?
> > > > > > >> - The KIP proposes to increase estimated compression ratio by
> > 0.05
> > > > for
> > > > > > >> each
> > > > > > >> underestimation and decrement the estimation by 0.005 for each
> > > > > > >> overestimation. Why are these two values chosen? I think there
> > is
> > > > some
> > > > > > >> tradeoff in selecting the value. Can the KIP be more explicit
> > > about
> > > > > the
> > > > > > >> tradeoff and explain how these two values would 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-15 Thread Ismael Juma
I meant finishing what's described in the following section and then
starting a discussion followed by a vote:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-DescribeConfigsRequest

We have only voted on KIP-4 Metadata, KIP-4 Create Topics, KIP-4 Delete
Topics so far.

Ismael

On Wed, Mar 15, 2017 at 8:58 PM, Becket Qin  wrote:

> Hi Ismael,
>
> KIP-4 is also the one that I was thinking about. We have introduced a
> DescribeConfigRequest there so the producer can easily get the
> configurations. By "another KIP" do you mean a new (or maybe extended)
> protocol or using that protocol in clients?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Wed, Mar 15, 2017 at 1:21 PM, Ismael Juma  wrote:
>
> > Hi Becket,
> >
> > How were you thinking of retrieving the configuration items you
> mentioned?
> > I am asking because I was planning to post a KIP for Describe Configs
> (one
> > of the protocols in KIP-4), which would expose such information. But
> maybe
> > you are thinking of extending Metadata request?
> >
> > Ismael
> >
> > On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin 
> wrote:
> >
> > > Hi Jason,
> > >
> > > Good point. I was thinking about that, too. I was not sure if that is
> the
> > > right thing to do by default.
> > >
> > > If we assume people always set the batch size to max message size,
> > > splitting the oversized batch makes a lot of sense. But it seems
> possible
> > > that users want to control the memory footprint so they would set the
> > batch
> > > size to smaller than the max message size so the producer can have hold
> > > batches for more partitions. In this case, splitting the batch might
> not
> > be
> > > the desired behavior.
> > >
> > > I think the most intuitive approach to this is allow the producer to
> get
> > > the max message size configuration (as well as some other
> configurations
> > > such as timestamp type)  from the broker side and use that to decide
> > > whether a batch should be split or not. I probably should add this to
> the
> > > KIP wiki.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Wed, Mar 15, 2017 at 9:47 AM, Jason Gustafson 
> > > wrote:
> > >
> > > > Hey Becket,
> > > >
> > > > Thanks for the KIP! The approach seems reasonable. One clarification:
> > is
> > > > the intent to do the splitting after the broker rejects the request
> > with
> > > > MESSAGE_TOO_LARGE, or prior to sending if the configured batch size
> is
> > > > exceeded?
> > > >
> > > > -Jason
> > > >
> > > > On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin 
> > > wrote:
> > > >
> > > > > Bump up the thread for further comments. If there is no more
> comments
> > > on
> > > > > the KIP I will start the voting thread on Wed.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin 
> > > wrote:
> > > > >
> > > > > > Hi Dong,
> > > > > >
> > > > > > Thanks for the comments.
> > > > > >
> > > > > > The patch is mostly for proof of concept in case there is any
> > concern
> > > > > > about the implementation which is indeed a little tricky.
> > > > > >
> > > > > > The new metric has already been mentioned in the Public Interface
> > > > Change
> > > > > > section.
> > > > > >
> > > > > > I added the reasoning about how the compression ratio
> > > > > > improving/deteriorate steps are determined in the wiki.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jiangjie (Becket) Qin
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin 
> > > wrote:
> > > > > >
> > > > > >> Hey Becket,
> > > > > >>
> > > > > >> I am wondering if we should first vote for the KIP before
> > reviewing
> > > > the
> > > > > >> patch. I have two comments below:
> > > > > >>
> > > > > >> - Should we specify the new sensors as part of interface change
> in
> > > the
> > > > > >> KIP?
> > > > > >> - The KIP proposes to increase estimated compression ratio by
> 0.05
> > > for
> > > > > >> each
> > > > > >> underestimation and decrement the estimation by 0.005 for each
> > > > > >> overestimation. Why are these two values chosen? I think there
> is
> > > some
> > > > > >> tradeoff in selecting the value. Can the KIP be more explicit
> > about
> > > > the
> > > > > >> tradeoff and explain how these two values would impact
> producer's
> > > > > >> performance?
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Dong
> > > > > >>
> > > > > >>
> > > > > >> On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin <
> becket@gmail.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >> > I have updated the KIP based on the latest discussion. Please
> > > check
> > > > > and
> > > > > >> let
> > > > > >> > me know if there is any further concern.
> > > > > >> >
> > 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-15 Thread Becket Qin
Hi Ismael,

KIP-4 is also the one that I was thinking about. We have introduced a
DescribeConfigRequest there so the producer can easily get the
configurations. By "another KIP" do you mean a new (or maybe extended)
protocol or using that protocol in clients?

Thanks,

Jiangjie (Becket) Qin

On Wed, Mar 15, 2017 at 1:21 PM, Ismael Juma  wrote:

> Hi Becket,
>
> How were you thinking of retrieving the configuration items you mentioned?
> I am asking because I was planning to post a KIP for Describe Configs (one
> of the protocols in KIP-4), which would expose such information. But maybe
> you are thinking of extending Metadata request?
>
> Ismael
>
> On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin  wrote:
>
> > Hi Jason,
> >
> > Good point. I was thinking about that, too. I was not sure if that is the
> > right thing to do by default.
> >
> > If we assume people always set the batch size to max message size,
> > splitting the oversized batch makes a lot of sense. But it seems possible
> > that users want to control the memory footprint so they would set the
> batch
> > size to smaller than the max message size so the producer can have hold
> > batches for more partitions. In this case, splitting the batch might not
> be
> > the desired behavior.
> >
> > I think the most intuitive approach to this is allow the producer to get
> > the max message size configuration (as well as some other configurations
> > such as timestamp type)  from the broker side and use that to decide
> > whether a batch should be split or not. I probably should add this to the
> > KIP wiki.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Mar 15, 2017 at 9:47 AM, Jason Gustafson 
> > wrote:
> >
> > > Hey Becket,
> > >
> > > Thanks for the KIP! The approach seems reasonable. One clarification:
> is
> > > the intent to do the splitting after the broker rejects the request
> with
> > > MESSAGE_TOO_LARGE, or prior to sending if the configured batch size is
> > > exceeded?
> > >
> > > -Jason
> > >
> > > On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin 
> > wrote:
> > >
> > > > Bump up the thread for further comments. If there is no more comments
> > on
> > > > the KIP I will start the voting thread on Wed.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin 
> > wrote:
> > > >
> > > > > Hi Dong,
> > > > >
> > > > > Thanks for the comments.
> > > > >
> > > > > The patch is mostly for proof of concept in case there is any
> concern
> > > > > about the implementation which is indeed a little tricky.
> > > > >
> > > > > The new metric has already been mentioned in the Public Interface
> > > Change
> > > > > section.
> > > > >
> > > > > I added the reasoning about how the compression ratio
> > > > > improving/deteriorate steps are determined in the wiki.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > >
> > > > > On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin 
> > wrote:
> > > > >
> > > > >> Hey Becket,
> > > > >>
> > > > >> I am wondering if we should first vote for the KIP before
> reviewing
> > > the
> > > > >> patch. I have two comments below:
> > > > >>
> > > > >> - Should we specify the new sensors as part of interface change in
> > the
> > > > >> KIP?
> > > > >> - The KIP proposes to increase estimated compression ratio by 0.05
> > for
> > > > >> each
> > > > >> underestimation and decrement the estimation by 0.005 for each
> > > > >> overestimation. Why are these two values chosen? I think there is
> > some
> > > > >> tradeoff in selecting the value. Can the KIP be more explicit
> about
> > > the
> > > > >> tradeoff and explain how these two values would impact producer's
> > > > >> performance?
> > > > >>
> > > > >> Thanks,
> > > > >> Dong
> > > > >>
> > > > >>
> > > > >> On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin  >
> > > > wrote:
> > > > >>
> > > > >> > I have updated the KIP based on the latest discussion. Please
> > check
> > > > and
> > > > >> let
> > > > >> > me know if there is any further concern.
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jiangjie (Becket) Qin
> > > > >> >
> > > > >> > On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin <
> becket@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Actually second thought on this, rate might be better for two
> > > > reasons:
> > > > >> > > 1. Most of the metrics in the producer we already have are
> using
> > > > rate
> > > > >> > > instead of count.
> > > > >> > > 2. If a service is bounced, the count will be reset to 0, but
> it
> > > > does
> > > > >> not
> > > > >> > > affect rate.
> > > > >> > >
> > > > >> > > I'll make the change.
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > >
> > > > >> > > Jiangjie (Becket) Qin
> > > > >> > >
> > > > >> > > On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin <
> 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-15 Thread Ismael Juma
Hi Becket,

How were you thinking of retrieving the configuration items you mentioned?
I am asking because I was planning to post a KIP for Describe Configs (one
of the protocols in KIP-4), which would expose such information. But maybe
you are thinking of extending Metadata request?

Ismael

On Wed, Mar 15, 2017 at 7:33 PM, Becket Qin  wrote:

> Hi Jason,
>
> Good point. I was thinking about that, too. I was not sure if that is the
> right thing to do by default.
>
> If we assume people always set the batch size to max message size,
> splitting the oversized batch makes a lot of sense. But it seems possible
> that users want to control the memory footprint so they would set the batch
> size to smaller than the max message size so the producer can have hold
> batches for more partitions. In this case, splitting the batch might not be
> the desired behavior.
>
> I think the most intuitive approach to this is allow the producer to get
> the max message size configuration (as well as some other configurations
> such as timestamp type)  from the broker side and use that to decide
> whether a batch should be split or not. I probably should add this to the
> KIP wiki.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Wed, Mar 15, 2017 at 9:47 AM, Jason Gustafson 
> wrote:
>
> > Hey Becket,
> >
> > Thanks for the KIP! The approach seems reasonable. One clarification: is
> > the intent to do the splitting after the broker rejects the request with
> > MESSAGE_TOO_LARGE, or prior to sending if the configured batch size is
> > exceeded?
> >
> > -Jason
> >
> > On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin 
> wrote:
> >
> > > Bump up the thread for further comments. If there is no more comments
> on
> > > the KIP I will start the voting thread on Wed.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin 
> wrote:
> > >
> > > > Hi Dong,
> > > >
> > > > Thanks for the comments.
> > > >
> > > > The patch is mostly for proof of concept in case there is any concern
> > > > about the implementation which is indeed a little tricky.
> > > >
> > > > The new metric has already been mentioned in the Public Interface
> > Change
> > > > section.
> > > >
> > > > I added the reasoning about how the compression ratio
> > > > improving/deteriorate steps are determined in the wiki.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > > On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin 
> wrote:
> > > >
> > > >> Hey Becket,
> > > >>
> > > >> I am wondering if we should first vote for the KIP before reviewing
> > the
> > > >> patch. I have two comments below:
> > > >>
> > > >> - Should we specify the new sensors as part of interface change in
> the
> > > >> KIP?
> > > >> - The KIP proposes to increase estimated compression ratio by 0.05
> for
> > > >> each
> > > >> underestimation and decrement the estimation by 0.005 for each
> > > >> overestimation. Why are these two values chosen? I think there is
> some
> > > >> tradeoff in selecting the value. Can the KIP be more explicit about
> > the
> > > >> tradeoff and explain how these two values would impact producer's
> > > >> performance?
> > > >>
> > > >> Thanks,
> > > >> Dong
> > > >>
> > > >>
> > > >> On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin 
> > > wrote:
> > > >>
> > > >> > I have updated the KIP based on the latest discussion. Please
> check
> > > and
> > > >> let
> > > >> > me know if there is any further concern.
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jiangjie (Becket) Qin
> > > >> >
> > > >> > On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin  >
> > > >> wrote:
> > > >> >
> > > >> > > Actually second thought on this, rate might be better for two
> > > reasons:
> > > >> > > 1. Most of the metrics in the producer we already have are using
> > > rate
> > > >> > > instead of count.
> > > >> > > 2. If a service is bounced, the count will be reset to 0, but it
> > > does
> > > >> not
> > > >> > > affect rate.
> > > >> > >
> > > >> > > I'll make the change.
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Jiangjie (Becket) Qin
> > > >> > >
> > > >> > > On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin <
> becket@gmail.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > >> Hi Dong,
> > > >> > >>
> > > >> > >> Yes, there is a sensor in the patch about the split occurrence.
> > > >> > >>
> > > >> > >> Currently it is a count instead of rate. In practice, it seems
> > > count
> > > >> is
> > > >> > >> easier to use in this case. But I am open to change.
> > > >> > >>
> > > >> > >> Thanks,
> > > >> > >>
> > > >> > >> Jiangjie (Becket) Qin
> > > >> > >>
> > > >> > >> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin 
> > > >> wrote:
> > > >> > >>
> > > >> > >>> Hey Becket,
> > > >> > >>>
> > > >> > >>> I haven't looked at the patch yet. But since 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-15 Thread Becket Qin
Hi Jason,

Good point. I was thinking about that, too. I was not sure if that is the
right thing to do by default.

If we assume people always set the batch size to max message size,
splitting the oversized batch makes a lot of sense. But it seems possible
that users want to control the memory footprint so they would set the batch
size to smaller than the max message size so the producer can have hold
batches for more partitions. In this case, splitting the batch might not be
the desired behavior.

I think the most intuitive approach to this is allow the producer to get
the max message size configuration (as well as some other configurations
such as timestamp type)  from the broker side and use that to decide
whether a batch should be split or not. I probably should add this to the
KIP wiki.

Thanks,

Jiangjie (Becket) Qin

On Wed, Mar 15, 2017 at 9:47 AM, Jason Gustafson  wrote:

> Hey Becket,
>
> Thanks for the KIP! The approach seems reasonable. One clarification: is
> the intent to do the splitting after the broker rejects the request with
> MESSAGE_TOO_LARGE, or prior to sending if the configured batch size is
> exceeded?
>
> -Jason
>
> On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin  wrote:
>
> > Bump up the thread for further comments. If there is no more comments on
> > the KIP I will start the voting thread on Wed.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin  wrote:
> >
> > > Hi Dong,
> > >
> > > Thanks for the comments.
> > >
> > > The patch is mostly for proof of concept in case there is any concern
> > > about the implementation which is indeed a little tricky.
> > >
> > > The new metric has already been mentioned in the Public Interface
> Change
> > > section.
> > >
> > > I added the reasoning about how the compression ratio
> > > improving/deteriorate steps are determined in the wiki.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > > On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin  wrote:
> > >
> > >> Hey Becket,
> > >>
> > >> I am wondering if we should first vote for the KIP before reviewing
> the
> > >> patch. I have two comments below:
> > >>
> > >> - Should we specify the new sensors as part of interface change in the
> > >> KIP?
> > >> - The KIP proposes to increase estimated compression ratio by 0.05 for
> > >> each
> > >> underestimation and decrement the estimation by 0.005 for each
> > >> overestimation. Why are these two values chosen? I think there is some
> > >> tradeoff in selecting the value. Can the KIP be more explicit about
> the
> > >> tradeoff and explain how these two values would impact producer's
> > >> performance?
> > >>
> > >> Thanks,
> > >> Dong
> > >>
> > >>
> > >> On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin 
> > wrote:
> > >>
> > >> > I have updated the KIP based on the latest discussion. Please check
> > and
> > >> let
> > >> > me know if there is any further concern.
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jiangjie (Becket) Qin
> > >> >
> > >> > On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin 
> > >> wrote:
> > >> >
> > >> > > Actually second thought on this, rate might be better for two
> > reasons:
> > >> > > 1. Most of the metrics in the producer we already have are using
> > rate
> > >> > > instead of count.
> > >> > > 2. If a service is bounced, the count will be reset to 0, but it
> > does
> > >> not
> > >> > > affect rate.
> > >> > >
> > >> > > I'll make the change.
> > >> > >
> > >> > > Thanks,
> > >> > >
> > >> > > Jiangjie (Becket) Qin
> > >> > >
> > >> > > On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin  >
> > >> > wrote:
> > >> > >
> > >> > >> Hi Dong,
> > >> > >>
> > >> > >> Yes, there is a sensor in the patch about the split occurrence.
> > >> > >>
> > >> > >> Currently it is a count instead of rate. In practice, it seems
> > count
> > >> is
> > >> > >> easier to use in this case. But I am open to change.
> > >> > >>
> > >> > >> Thanks,
> > >> > >>
> > >> > >> Jiangjie (Becket) Qin
> > >> > >>
> > >> > >> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin 
> > >> wrote:
> > >> > >>
> > >> > >>> Hey Becket,
> > >> > >>>
> > >> > >>> I haven't looked at the patch yet. But since we are going to try
> > the
> > >> > >>> split-on-oversize solution, should the KIP also add a sensor
> that
> > >> shows
> > >> > >>> the
> > >> > >>> rate of split per second and the probability of split?
> > >> > >>>
> > >> > >>> Thanks,
> > >> > >>> Dong
> > >> > >>>
> > >> > >>>
> > >> > >>> On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin <
> becket@gmail.com>
> > >> > wrote:
> > >> > >>>
> > >> > >>> > Just to clarify, the implementation is basically what I
> > mentioned
> > >> > above
> > >> > >>> > (split/resend + adjusted estimation evolving algorithm) and
> > >> changing
> > >> > >>> the
> > >> > >>> > compression ratio estimation to be per topic.

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-15 Thread Jason Gustafson
Hey Becket,

Thanks for the KIP! The approach seems reasonable. One clarification: is
the intent to do the splitting after the broker rejects the request with
MESSAGE_TOO_LARGE, or prior to sending if the configured batch size is
exceeded?

-Jason

On Mon, Mar 13, 2017 at 8:10 PM, Becket Qin  wrote:

> Bump up the thread for further comments. If there is no more comments on
> the KIP I will start the voting thread on Wed.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin  wrote:
>
> > Hi Dong,
> >
> > Thanks for the comments.
> >
> > The patch is mostly for proof of concept in case there is any concern
> > about the implementation which is indeed a little tricky.
> >
> > The new metric has already been mentioned in the Public Interface Change
> > section.
> >
> > I added the reasoning about how the compression ratio
> > improving/deteriorate steps are determined in the wiki.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin  wrote:
> >
> >> Hey Becket,
> >>
> >> I am wondering if we should first vote for the KIP before reviewing the
> >> patch. I have two comments below:
> >>
> >> - Should we specify the new sensors as part of interface change in the
> >> KIP?
> >> - The KIP proposes to increase estimated compression ratio by 0.05 for
> >> each
> >> underestimation and decrement the estimation by 0.005 for each
> >> overestimation. Why are these two values chosen? I think there is some
> >> tradeoff in selecting the value. Can the KIP be more explicit about the
> >> tradeoff and explain how these two values would impact producer's
> >> performance?
> >>
> >> Thanks,
> >> Dong
> >>
> >>
> >> On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin 
> wrote:
> >>
> >> > I have updated the KIP based on the latest discussion. Please check
> and
> >> let
> >> > me know if there is any further concern.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> > On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin 
> >> wrote:
> >> >
> >> > > Actually second thought on this, rate might be better for two
> reasons:
> >> > > 1. Most of the metrics in the producer we already have are using
> rate
> >> > > instead of count.
> >> > > 2. If a service is bounced, the count will be reset to 0, but it
> does
> >> not
> >> > > affect rate.
> >> > >
> >> > > I'll make the change.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jiangjie (Becket) Qin
> >> > >
> >> > > On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin 
> >> > wrote:
> >> > >
> >> > >> Hi Dong,
> >> > >>
> >> > >> Yes, there is a sensor in the patch about the split occurrence.
> >> > >>
> >> > >> Currently it is a count instead of rate. In practice, it seems
> count
> >> is
> >> > >> easier to use in this case. But I am open to change.
> >> > >>
> >> > >> Thanks,
> >> > >>
> >> > >> Jiangjie (Becket) Qin
> >> > >>
> >> > >> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin 
> >> wrote:
> >> > >>
> >> > >>> Hey Becket,
> >> > >>>
> >> > >>> I haven't looked at the patch yet. But since we are going to try
> the
> >> > >>> split-on-oversize solution, should the KIP also add a sensor that
> >> shows
> >> > >>> the
> >> > >>> rate of split per second and the probability of split?
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Dong
> >> > >>>
> >> > >>>
> >> > >>> On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin 
> >> > wrote:
> >> > >>>
> >> > >>> > Just to clarify, the implementation is basically what I
> mentioned
> >> > above
> >> > >>> > (split/resend + adjusted estimation evolving algorithm) and
> >> changing
> >> > >>> the
> >> > >>> > compression ratio estimation to be per topic.
> >> > >>> >
> >> > >>> > Thanks,
> >> > >>> >
> >> > >>> > Jiangjie (Becket) Qin
> >> > >>> >
> >> > >>> > On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin <
> becket@gmail.com>
> >> > >>> wrote:
> >> > >>> >
> >> > >>> > > I went ahead and have a patch submitted here:
> >> > >>> > > https://github.com/apache/kafka/pull/2638
> >> > >>> > >
> >> > >>> > > Per Joel's suggestion, I changed the compression ratio to be
> per
> >> > >>> topic as
> >> > >>> > > well. It seems working well. Since there is an important
> >> behavior
> >> > >>> change
> >> > >>> > > and a new sensor is added, I'll keep the KIP and update it
> >> > according.
> >> > >>> > >
> >> > >>> > > Thanks,
> >> > >>> > >
> >> > >>> > > Jiangjie (Becket) Qin
> >> > >>> > >
> >> > >>> > > On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy <
> >> jjkosh...@gmail.com>
> >> > >>> wrote:
> >> > >>> > >
> >> > >>> > >> >
> >> > >>> > >> > Lets say we sent the batch over the wire and received a
> >> > >>> > >> > RecordTooLargeException, how do we split it as once we add
> >> the
> >> > >>> message
> >> > >>> > >> to
> >> > >>> > >> > the batch we loose the message level granularity. We will
> >> have
> >> > to
> >> > >>> > 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-13 Thread Becket Qin
Bump up the thread for further comments. If there is no more comments on
the KIP I will start the voting thread on Wed.

Thanks,

Jiangjie (Becket) Qin

On Tue, Mar 7, 2017 at 9:48 AM, Becket Qin  wrote:

> Hi Dong,
>
> Thanks for the comments.
>
> The patch is mostly for proof of concept in case there is any concern
> about the implementation which is indeed a little tricky.
>
> The new metric has already been mentioned in the Public Interface Change
> section.
>
> I added the reasoning about how the compression ratio
> improving/deteriorate steps are determined in the wiki.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin  wrote:
>
>> Hey Becket,
>>
>> I am wondering if we should first vote for the KIP before reviewing the
>> patch. I have two comments below:
>>
>> - Should we specify the new sensors as part of interface change in the
>> KIP?
>> - The KIP proposes to increase estimated compression ratio by 0.05 for
>> each
>> underestimation and decrement the estimation by 0.005 for each
>> overestimation. Why are these two values chosen? I think there is some
>> tradeoff in selecting the value. Can the KIP be more explicit about the
>> tradeoff and explain how these two values would impact producer's
>> performance?
>>
>> Thanks,
>> Dong
>>
>>
>> On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin  wrote:
>>
>> > I have updated the KIP based on the latest discussion. Please check and
>> let
>> > me know if there is any further concern.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin 
>> wrote:
>> >
>> > > Actually second thought on this, rate might be better for two reasons:
>> > > 1. Most of the metrics in the producer we already have are using rate
>> > > instead of count.
>> > > 2. If a service is bounced, the count will be reset to 0, but it does
>> not
>> > > affect rate.
>> > >
>> > > I'll make the change.
>> > >
>> > > Thanks,
>> > >
>> > > Jiangjie (Becket) Qin
>> > >
>> > > On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin 
>> > wrote:
>> > >
>> > >> Hi Dong,
>> > >>
>> > >> Yes, there is a sensor in the patch about the split occurrence.
>> > >>
>> > >> Currently it is a count instead of rate. In practice, it seems count
>> is
>> > >> easier to use in this case. But I am open to change.
>> > >>
>> > >> Thanks,
>> > >>
>> > >> Jiangjie (Becket) Qin
>> > >>
>> > >> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin 
>> wrote:
>> > >>
>> > >>> Hey Becket,
>> > >>>
>> > >>> I haven't looked at the patch yet. But since we are going to try the
>> > >>> split-on-oversize solution, should the KIP also add a sensor that
>> shows
>> > >>> the
>> > >>> rate of split per second and the probability of split?
>> > >>>
>> > >>> Thanks,
>> > >>> Dong
>> > >>>
>> > >>>
>> > >>> On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin 
>> > wrote:
>> > >>>
>> > >>> > Just to clarify, the implementation is basically what I mentioned
>> > above
>> > >>> > (split/resend + adjusted estimation evolving algorithm) and
>> changing
>> > >>> the
>> > >>> > compression ratio estimation to be per topic.
>> > >>> >
>> > >>> > Thanks,
>> > >>> >
>> > >>> > Jiangjie (Becket) Qin
>> > >>> >
>> > >>> > On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin 
>> > >>> wrote:
>> > >>> >
>> > >>> > > I went ahead and have a patch submitted here:
>> > >>> > > https://github.com/apache/kafka/pull/2638
>> > >>> > >
>> > >>> > > Per Joel's suggestion, I changed the compression ratio to be per
>> > >>> topic as
>> > >>> > > well. It seems working well. Since there is an important
>> behavior
>> > >>> change
>> > >>> > > and a new sensor is added, I'll keep the KIP and update it
>> > according.
>> > >>> > >
>> > >>> > > Thanks,
>> > >>> > >
>> > >>> > > Jiangjie (Becket) Qin
>> > >>> > >
>> > >>> > > On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy <
>> jjkosh...@gmail.com>
>> > >>> wrote:
>> > >>> > >
>> > >>> > >> >
>> > >>> > >> > Lets say we sent the batch over the wire and received a
>> > >>> > >> > RecordTooLargeException, how do we split it as once we add
>> the
>> > >>> message
>> > >>> > >> to
>> > >>> > >> > the batch we loose the message level granularity. We will
>> have
>> > to
>> > >>> > >> > decompress, do deep iteration and split and again compress.
>> > right?
>> > >>> > This
>> > >>> > >> > looks like a performance bottle neck in case of multi topic
>> > >>> producers
>> > >>> > >> like
>> > >>> > >> > mirror maker.
>> > >>> > >> >
>> > >>> > >>
>> > >>> > >> Yes, but these should be outliers if we do estimation on a
>> > per-topic
>> > >>> > basis
>> > >>> > >> and if we target a conservative-enough compression ratio. The
>> > >>> producer
>> > >>> > >> should also avoid sending over the wire if it can be made
>> aware of
>> > >>> the
>> > >>> > >> max-message size limit on the broker, and split if it
>> 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-07 Thread Becket Qin
Hi Dong,

Thanks for the comments.

The patch is mostly for proof of concept in case there is any concern about
the implementation which is indeed a little tricky.

The new metric has already been mentioned in the Public Interface Change
section.

I added the reasoning about how the compression ratio improving/deteriorate
steps are determined in the wiki.

Thanks,

Jiangjie (Becket) Qin


On Mon, Mar 6, 2017 at 4:42 PM, Dong Lin  wrote:

> Hey Becket,
>
> I am wondering if we should first vote for the KIP before reviewing the
> patch. I have two comments below:
>
> - Should we specify the new sensors as part of interface change in the KIP?
> - The KIP proposes to increase estimated compression ratio by 0.05 for each
> underestimation and decrement the estimation by 0.005 for each
> overestimation. Why are these two values chosen? I think there is some
> tradeoff in selecting the value. Can the KIP be more explicit about the
> tradeoff and explain how these two values would impact producer's
> performance?
>
> Thanks,
> Dong
>
>
> On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin  wrote:
>
> > I have updated the KIP based on the latest discussion. Please check and
> let
> > me know if there is any further concern.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin 
> wrote:
> >
> > > Actually second thought on this, rate might be better for two reasons:
> > > 1. Most of the metrics in the producer we already have are using rate
> > > instead of count.
> > > 2. If a service is bounced, the count will be reset to 0, but it does
> not
> > > affect rate.
> > >
> > > I'll make the change.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin 
> > wrote:
> > >
> > >> Hi Dong,
> > >>
> > >> Yes, there is a sensor in the patch about the split occurrence.
> > >>
> > >> Currently it is a count instead of rate. In practice, it seems count
> is
> > >> easier to use in this case. But I am open to change.
> > >>
> > >> Thanks,
> > >>
> > >> Jiangjie (Becket) Qin
> > >>
> > >> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin  wrote:
> > >>
> > >>> Hey Becket,
> > >>>
> > >>> I haven't looked at the patch yet. But since we are going to try the
> > >>> split-on-oversize solution, should the KIP also add a sensor that
> shows
> > >>> the
> > >>> rate of split per second and the probability of split?
> > >>>
> > >>> Thanks,
> > >>> Dong
> > >>>
> > >>>
> > >>> On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin 
> > wrote:
> > >>>
> > >>> > Just to clarify, the implementation is basically what I mentioned
> > above
> > >>> > (split/resend + adjusted estimation evolving algorithm) and
> changing
> > >>> the
> > >>> > compression ratio estimation to be per topic.
> > >>> >
> > >>> > Thanks,
> > >>> >
> > >>> > Jiangjie (Becket) Qin
> > >>> >
> > >>> > On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin 
> > >>> wrote:
> > >>> >
> > >>> > > I went ahead and have a patch submitted here:
> > >>> > > https://github.com/apache/kafka/pull/2638
> > >>> > >
> > >>> > > Per Joel's suggestion, I changed the compression ratio to be per
> > >>> topic as
> > >>> > > well. It seems working well. Since there is an important behavior
> > >>> change
> > >>> > > and a new sensor is added, I'll keep the KIP and update it
> > according.
> > >>> > >
> > >>> > > Thanks,
> > >>> > >
> > >>> > > Jiangjie (Becket) Qin
> > >>> > >
> > >>> > > On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy  >
> > >>> wrote:
> > >>> > >
> > >>> > >> >
> > >>> > >> > Lets say we sent the batch over the wire and received a
> > >>> > >> > RecordTooLargeException, how do we split it as once we add the
> > >>> message
> > >>> > >> to
> > >>> > >> > the batch we loose the message level granularity. We will have
> > to
> > >>> > >> > decompress, do deep iteration and split and again compress.
> > right?
> > >>> > This
> > >>> > >> > looks like a performance bottle neck in case of multi topic
> > >>> producers
> > >>> > >> like
> > >>> > >> > mirror maker.
> > >>> > >> >
> > >>> > >>
> > >>> > >> Yes, but these should be outliers if we do estimation on a
> > per-topic
> > >>> > basis
> > >>> > >> and if we target a conservative-enough compression ratio. The
> > >>> producer
> > >>> > >> should also avoid sending over the wire if it can be made aware
> of
> > >>> the
> > >>> > >> max-message size limit on the broker, and split if it determines
> > >>> that a
> > >>> > >> record exceeds the broker's config. Ideally this should be part
> of
> > >>> topic
> > >>> > >> metadata but is not - so it could be off a periodic
> > describe-configs
> > >>> > >>  > >>> > >> Command+line+and+centralized+administrative+operations#KIP-
> > >>> > >> 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-06 Thread Dong Lin
Hey Becket,

I am wondering if we should first vote for the KIP before reviewing the
patch. I have two comments below:

- Should we specify the new sensors as part of interface change in the KIP?
- The KIP proposes to increase estimated compression ratio by 0.05 for each
underestimation and decrement the estimation by 0.005 for each
overestimation. Why are these two values chosen? I think there is some
tradeoff in selecting the value. Can the KIP be more explicit about the
tradeoff and explain how these two values would impact producer's
performance?

Thanks,
Dong


On Sat, Mar 4, 2017 at 11:42 AM, Becket Qin  wrote:

> I have updated the KIP based on the latest discussion. Please check and let
> me know if there is any further concern.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin  wrote:
>
> > Actually second thought on this, rate might be better for two reasons:
> > 1. Most of the metrics in the producer we already have are using rate
> > instead of count.
> > 2. If a service is bounced, the count will be reset to 0, but it does not
> > affect rate.
> >
> > I'll make the change.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin 
> wrote:
> >
> >> Hi Dong,
> >>
> >> Yes, there is a sensor in the patch about the split occurrence.
> >>
> >> Currently it is a count instead of rate. In practice, it seems count is
> >> easier to use in this case. But I am open to change.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin  wrote:
> >>
> >>> Hey Becket,
> >>>
> >>> I haven't looked at the patch yet. But since we are going to try the
> >>> split-on-oversize solution, should the KIP also add a sensor that shows
> >>> the
> >>> rate of split per second and the probability of split?
> >>>
> >>> Thanks,
> >>> Dong
> >>>
> >>>
> >>> On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin 
> wrote:
> >>>
> >>> > Just to clarify, the implementation is basically what I mentioned
> above
> >>> > (split/resend + adjusted estimation evolving algorithm) and changing
> >>> the
> >>> > compression ratio estimation to be per topic.
> >>> >
> >>> > Thanks,
> >>> >
> >>> > Jiangjie (Becket) Qin
> >>> >
> >>> > On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin 
> >>> wrote:
> >>> >
> >>> > > I went ahead and have a patch submitted here:
> >>> > > https://github.com/apache/kafka/pull/2638
> >>> > >
> >>> > > Per Joel's suggestion, I changed the compression ratio to be per
> >>> topic as
> >>> > > well. It seems working well. Since there is an important behavior
> >>> change
> >>> > > and a new sensor is added, I'll keep the KIP and update it
> according.
> >>> > >
> >>> > > Thanks,
> >>> > >
> >>> > > Jiangjie (Becket) Qin
> >>> > >
> >>> > > On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy 
> >>> wrote:
> >>> > >
> >>> > >> >
> >>> > >> > Lets say we sent the batch over the wire and received a
> >>> > >> > RecordTooLargeException, how do we split it as once we add the
> >>> message
> >>> > >> to
> >>> > >> > the batch we loose the message level granularity. We will have
> to
> >>> > >> > decompress, do deep iteration and split and again compress.
> right?
> >>> > This
> >>> > >> > looks like a performance bottle neck in case of multi topic
> >>> producers
> >>> > >> like
> >>> > >> > mirror maker.
> >>> > >> >
> >>> > >>
> >>> > >> Yes, but these should be outliers if we do estimation on a
> per-topic
> >>> > basis
> >>> > >> and if we target a conservative-enough compression ratio. The
> >>> producer
> >>> > >> should also avoid sending over the wire if it can be made aware of
> >>> the
> >>> > >> max-message size limit on the broker, and split if it determines
> >>> that a
> >>> > >> record exceeds the broker's config. Ideally this should be part of
> >>> topic
> >>> > >> metadata but is not - so it could be off a periodic
> describe-configs
> >>> > >>  >>> > >> Command+line+and+centralized+administrative+operations#KIP-
> >>> > >> 4-Commandlineandcentralizedadministrativeoperations-Describe
> >>> > >> ConfigsRequest>
> >>> > >> (which isn't available yet). This doesn't remove the need to split
> >>> and
> >>> > >> recompress though.
> >>> > >>
> >>> > >>
> >>> > >> > On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin <
> >>> becket@gmail.com>
> >>> > >> wrote:
> >>> > >> >
> >>> > >> > > Hey Mayuresh,
> >>> > >> > >
> >>> > >> > > 1) The batch would be split when an RecordTooLargeException is
> >>> > >> received.
> >>> > >> > > 2) Not lower the actual compression ratio, but lower the
> >>> estimated
> >>> > >> > > compression ratio "according to" the Actual Compression
> >>> Ratio(ACR).
> >>> > >> > >
> >>> > >> > > An example, let's start with Estimated Compression Ratio
> (ECR) =
> >>> > 1.0.
> >>> > >> 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-04 Thread Becket Qin
I have updated the KIP based on the latest discussion. Please check and let
me know if there is any further concern.

Thanks,

Jiangjie (Becket) Qin

On Sat, Mar 4, 2017 at 10:56 AM, Becket Qin  wrote:

> Actually second thought on this, rate might be better for two reasons:
> 1. Most of the metrics in the producer we already have are using rate
> instead of count.
> 2. If a service is bounced, the count will be reset to 0, but it does not
> affect rate.
>
> I'll make the change.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin  wrote:
>
>> Hi Dong,
>>
>> Yes, there is a sensor in the patch about the split occurrence.
>>
>> Currently it is a count instead of rate. In practice, it seems count is
>> easier to use in this case. But I am open to change.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin  wrote:
>>
>>> Hey Becket,
>>>
>>> I haven't looked at the patch yet. But since we are going to try the
>>> split-on-oversize solution, should the KIP also add a sensor that shows
>>> the
>>> rate of split per second and the probability of split?
>>>
>>> Thanks,
>>> Dong
>>>
>>>
>>> On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin  wrote:
>>>
>>> > Just to clarify, the implementation is basically what I mentioned above
>>> > (split/resend + adjusted estimation evolving algorithm) and changing
>>> the
>>> > compression ratio estimation to be per topic.
>>> >
>>> > Thanks,
>>> >
>>> > Jiangjie (Becket) Qin
>>> >
>>> > On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin 
>>> wrote:
>>> >
>>> > > I went ahead and have a patch submitted here:
>>> > > https://github.com/apache/kafka/pull/2638
>>> > >
>>> > > Per Joel's suggestion, I changed the compression ratio to be per
>>> topic as
>>> > > well. It seems working well. Since there is an important behavior
>>> change
>>> > > and a new sensor is added, I'll keep the KIP and update it according.
>>> > >
>>> > > Thanks,
>>> > >
>>> > > Jiangjie (Becket) Qin
>>> > >
>>> > > On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy 
>>> wrote:
>>> > >
>>> > >> >
>>> > >> > Lets say we sent the batch over the wire and received a
>>> > >> > RecordTooLargeException, how do we split it as once we add the
>>> message
>>> > >> to
>>> > >> > the batch we loose the message level granularity. We will have to
>>> > >> > decompress, do deep iteration and split and again compress. right?
>>> > This
>>> > >> > looks like a performance bottle neck in case of multi topic
>>> producers
>>> > >> like
>>> > >> > mirror maker.
>>> > >> >
>>> > >>
>>> > >> Yes, but these should be outliers if we do estimation on a per-topic
>>> > basis
>>> > >> and if we target a conservative-enough compression ratio. The
>>> producer
>>> > >> should also avoid sending over the wire if it can be made aware of
>>> the
>>> > >> max-message size limit on the broker, and split if it determines
>>> that a
>>> > >> record exceeds the broker's config. Ideally this should be part of
>>> topic
>>> > >> metadata but is not - so it could be off a periodic describe-configs
>>> > >> >> > >> Command+line+and+centralized+administrative+operations#KIP-
>>> > >> 4-Commandlineandcentralizedadministrativeoperations-Describe
>>> > >> ConfigsRequest>
>>> > >> (which isn't available yet). This doesn't remove the need to split
>>> and
>>> > >> recompress though.
>>> > >>
>>> > >>
>>> > >> > On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin <
>>> becket@gmail.com>
>>> > >> wrote:
>>> > >> >
>>> > >> > > Hey Mayuresh,
>>> > >> > >
>>> > >> > > 1) The batch would be split when an RecordTooLargeException is
>>> > >> received.
>>> > >> > > 2) Not lower the actual compression ratio, but lower the
>>> estimated
>>> > >> > > compression ratio "according to" the Actual Compression
>>> Ratio(ACR).
>>> > >> > >
>>> > >> > > An example, let's start with Estimated Compression Ratio (ECR) =
>>> > 1.0.
>>> > >> Say
>>> > >> > > the compression ratio of ACR is ~0.8, instead of letting the ECR
>>> > >> dropped
>>> > >> > to
>>> > >> > > 0.8 very quickly, we only drop 0.001 every time when ACR < ECR.
>>> > >> However,
>>> > >> > > once we see an ACR > ECR, we increment ECR by 0.05. If a
>>> > >> > > RecordTooLargeException is received, we reset the ECR back to
>>> 1.0
>>> > and
>>> > >> > split
>>> > >> > > the batch.
>>> > >> > >
>>> > >> > > Thanks,
>>> > >> > >
>>> > >> > > Jiangjie (Becket) Qin
>>> > >> > >
>>> > >> > >
>>> > >> > >
>>> > >> > > On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
>>> > >> > > gharatmayures...@gmail.com> wrote:
>>> > >> > >
>>> > >> > > > Hi Becket,
>>> > >> > > >
>>> > >> > > > Seems like an interesting idea.
>>> > >> > > > I had couple of questions :
>>> > >> > > > 1) How do we decide when the batch should be split?
>>> > >> > > > 2) What do you mean by slowly lowering the 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-04 Thread Becket Qin
Actually second thought on this, rate might be better for two reasons:
1. Most of the metrics in the producer we already have are using rate
instead of count.
2. If a service is bounced, the count will be reset to 0, but it does not
affect rate.

I'll make the change.

Thanks,

Jiangjie (Becket) Qin

On Sat, Mar 4, 2017 at 10:27 AM, Becket Qin  wrote:

> Hi Dong,
>
> Yes, there is a sensor in the patch about the split occurrence.
>
> Currently it is a count instead of rate. In practice, it seems count is
> easier to use in this case. But I am open to change.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Mar 3, 2017 at 7:43 PM, Dong Lin  wrote:
>
>> Hey Becket,
>>
>> I haven't looked at the patch yet. But since we are going to try the
>> split-on-oversize solution, should the KIP also add a sensor that shows
>> the
>> rate of split per second and the probability of split?
>>
>> Thanks,
>> Dong
>>
>>
>> On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin  wrote:
>>
>> > Just to clarify, the implementation is basically what I mentioned above
>> > (split/resend + adjusted estimation evolving algorithm) and changing the
>> > compression ratio estimation to be per topic.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin 
>> wrote:
>> >
>> > > I went ahead and have a patch submitted here:
>> > > https://github.com/apache/kafka/pull/2638
>> > >
>> > > Per Joel's suggestion, I changed the compression ratio to be per
>> topic as
>> > > well. It seems working well. Since there is an important behavior
>> change
>> > > and a new sensor is added, I'll keep the KIP and update it according.
>> > >
>> > > Thanks,
>> > >
>> > > Jiangjie (Becket) Qin
>> > >
>> > > On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy 
>> wrote:
>> > >
>> > >> >
>> > >> > Lets say we sent the batch over the wire and received a
>> > >> > RecordTooLargeException, how do we split it as once we add the
>> message
>> > >> to
>> > >> > the batch we loose the message level granularity. We will have to
>> > >> > decompress, do deep iteration and split and again compress. right?
>> > This
>> > >> > looks like a performance bottle neck in case of multi topic
>> producers
>> > >> like
>> > >> > mirror maker.
>> > >> >
>> > >>
>> > >> Yes, but these should be outliers if we do estimation on a per-topic
>> > basis
>> > >> and if we target a conservative-enough compression ratio. The
>> producer
>> > >> should also avoid sending over the wire if it can be made aware of
>> the
>> > >> max-message size limit on the broker, and split if it determines
>> that a
>> > >> record exceeds the broker's config. Ideally this should be part of
>> topic
>> > >> metadata but is not - so it could be off a periodic describe-configs
>> > >> > > >> Command+line+and+centralized+administrative+operations#KIP-
>> > >> 4-Commandlineandcentralizedadministrativeoperations-Describe
>> > >> ConfigsRequest>
>> > >> (which isn't available yet). This doesn't remove the need to split
>> and
>> > >> recompress though.
>> > >>
>> > >>
>> > >> > On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin > >
>> > >> wrote:
>> > >> >
>> > >> > > Hey Mayuresh,
>> > >> > >
>> > >> > > 1) The batch would be split when an RecordTooLargeException is
>> > >> received.
>> > >> > > 2) Not lower the actual compression ratio, but lower the
>> estimated
>> > >> > > compression ratio "according to" the Actual Compression
>> Ratio(ACR).
>> > >> > >
>> > >> > > An example, let's start with Estimated Compression Ratio (ECR) =
>> > 1.0.
>> > >> Say
>> > >> > > the compression ratio of ACR is ~0.8, instead of letting the ECR
>> > >> dropped
>> > >> > to
>> > >> > > 0.8 very quickly, we only drop 0.001 every time when ACR < ECR.
>> > >> However,
>> > >> > > once we see an ACR > ECR, we increment ECR by 0.05. If a
>> > >> > > RecordTooLargeException is received, we reset the ECR back to 1.0
>> > and
>> > >> > split
>> > >> > > the batch.
>> > >> > >
>> > >> > > Thanks,
>> > >> > >
>> > >> > > Jiangjie (Becket) Qin
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
>> > >> > > gharatmayures...@gmail.com> wrote:
>> > >> > >
>> > >> > > > Hi Becket,
>> > >> > > >
>> > >> > > > Seems like an interesting idea.
>> > >> > > > I had couple of questions :
>> > >> > > > 1) How do we decide when the batch should be split?
>> > >> > > > 2) What do you mean by slowly lowering the "actual" compression
>> > >> ratio?
>> > >> > > > An example would really help here.
>> > >> > > >
>> > >> > > > Thanks,
>> > >> > > >
>> > >> > > > Mayuresh
>> > >> > > >
>> > >> > > > On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin <
>> becket@gmail.com
>> > >
>> > >> > > wrote:
>> > >> > > >
>> > >> > > > > Hi Jay,
>> > >> > > > >
>> > >> > > > > Yeah, I got your point.
>> > >> > > 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-03 Thread Dong Lin
Hey Becket,

I haven't looked at the patch yet. But since we are going to try the
split-on-oversize solution, should the KIP also add a sensor that shows the
rate of split per second and the probability of split?

Thanks,
Dong


On Fri, Mar 3, 2017 at 6:39 PM, Becket Qin  wrote:

> Just to clarify, the implementation is basically what I mentioned above
> (split/resend + adjusted estimation evolving algorithm) and changing the
> compression ratio estimation to be per topic.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin  wrote:
>
> > I went ahead and have a patch submitted here:
> > https://github.com/apache/kafka/pull/2638
> >
> > Per Joel's suggestion, I changed the compression ratio to be per topic as
> > well. It seems working well. Since there is an important behavior change
> > and a new sensor is added, I'll keep the KIP and update it according.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy  wrote:
> >
> >> >
> >> > Lets say we sent the batch over the wire and received a
> >> > RecordTooLargeException, how do we split it as once we add the message
> >> to
> >> > the batch we loose the message level granularity. We will have to
> >> > decompress, do deep iteration and split and again compress. right?
> This
> >> > looks like a performance bottle neck in case of multi topic producers
> >> like
> >> > mirror maker.
> >> >
> >>
> >> Yes, but these should be outliers if we do estimation on a per-topic
> basis
> >> and if we target a conservative-enough compression ratio. The producer
> >> should also avoid sending over the wire if it can be made aware of the
> >> max-message size limit on the broker, and split if it determines that a
> >> record exceeds the broker's config. Ideally this should be part of topic
> >> metadata but is not - so it could be off a periodic describe-configs
> >>  >> Command+line+and+centralized+administrative+operations#KIP-
> >> 4-Commandlineandcentralizedadministrativeoperations-Describe
> >> ConfigsRequest>
> >> (which isn't available yet). This doesn't remove the need to split and
> >> recompress though.
> >>
> >>
> >> > On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin 
> >> wrote:
> >> >
> >> > > Hey Mayuresh,
> >> > >
> >> > > 1) The batch would be split when an RecordTooLargeException is
> >> received.
> >> > > 2) Not lower the actual compression ratio, but lower the estimated
> >> > > compression ratio "according to" the Actual Compression Ratio(ACR).
> >> > >
> >> > > An example, let's start with Estimated Compression Ratio (ECR) =
> 1.0.
> >> Say
> >> > > the compression ratio of ACR is ~0.8, instead of letting the ECR
> >> dropped
> >> > to
> >> > > 0.8 very quickly, we only drop 0.001 every time when ACR < ECR.
> >> However,
> >> > > once we see an ACR > ECR, we increment ECR by 0.05. If a
> >> > > RecordTooLargeException is received, we reset the ECR back to 1.0
> and
> >> > split
> >> > > the batch.
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jiangjie (Becket) Qin
> >> > >
> >> > >
> >> > >
> >> > > On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
> >> > > gharatmayures...@gmail.com> wrote:
> >> > >
> >> > > > Hi Becket,
> >> > > >
> >> > > > Seems like an interesting idea.
> >> > > > I had couple of questions :
> >> > > > 1) How do we decide when the batch should be split?
> >> > > > 2) What do you mean by slowly lowering the "actual" compression
> >> ratio?
> >> > > > An example would really help here.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Mayuresh
> >> > > >
> >> > > > On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin  >
> >> > > wrote:
> >> > > >
> >> > > > > Hi Jay,
> >> > > > >
> >> > > > > Yeah, I got your point.
> >> > > > >
> >> > > > > I think there might be a solution which do not require adding a
> >> new
> >> > > > > configuration. We can start from a very conservative compression
> >> > ratio
> >> > > > say
> >> > > > > 1.0 and lower it very slowly according to the actual compression
> >> > ratio
> >> > > > > until we hit a point that we have to split a batch. At that
> >> point, we
> >> > > > > exponentially back off on the compression ratio. The idea is
> >> somewhat
> >> > > > like
> >> > > > > TCP. This should help avoid frequent split.
> >> > > > >
> >> > > > > The upper bound of the batch size is also a little awkward today
> >> > > because
> >> > > > we
> >> > > > > say the batch size is based on compressed size, but users cannot
> >> set
> >> > it
> >> > > > to
> >> > > > > the max message size because that will result in oversized
> >> messages.
> >> > > With
> >> > > > > this change we will be able to allow the users to set the
> message
> >> > size
> >> > > to
> >> > > > > close to max message size.
> >> > > > >
> >> > > > > However the downside is that there could be latency 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-03 Thread Becket Qin
Just to clarify, the implementation is basically what I mentioned above
(split/resend + adjusted estimation evolving algorithm) and changing the
compression ratio estimation to be per topic.

Thanks,

Jiangjie (Becket) Qin

On Fri, Mar 3, 2017 at 6:36 PM, Becket Qin  wrote:

> I went ahead and have a patch submitted here:
> https://github.com/apache/kafka/pull/2638
>
> Per Joel's suggestion, I changed the compression ratio to be per topic as
> well. It seems working well. Since there is an important behavior change
> and a new sensor is added, I'll keep the KIP and update it according.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy  wrote:
>
>> >
>> > Lets say we sent the batch over the wire and received a
>> > RecordTooLargeException, how do we split it as once we add the message
>> to
>> > the batch we loose the message level granularity. We will have to
>> > decompress, do deep iteration and split and again compress. right? This
>> > looks like a performance bottle neck in case of multi topic producers
>> like
>> > mirror maker.
>> >
>>
>> Yes, but these should be outliers if we do estimation on a per-topic basis
>> and if we target a conservative-enough compression ratio. The producer
>> should also avoid sending over the wire if it can be made aware of the
>> max-message size limit on the broker, and split if it determines that a
>> record exceeds the broker's config. Ideally this should be part of topic
>> metadata but is not - so it could be off a periodic describe-configs
>> > Command+line+and+centralized+administrative+operations#KIP-
>> 4-Commandlineandcentralizedadministrativeoperations-Describe
>> ConfigsRequest>
>> (which isn't available yet). This doesn't remove the need to split and
>> recompress though.
>>
>>
>> > On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin 
>> wrote:
>> >
>> > > Hey Mayuresh,
>> > >
>> > > 1) The batch would be split when an RecordTooLargeException is
>> received.
>> > > 2) Not lower the actual compression ratio, but lower the estimated
>> > > compression ratio "according to" the Actual Compression Ratio(ACR).
>> > >
>> > > An example, let's start with Estimated Compression Ratio (ECR) = 1.0.
>> Say
>> > > the compression ratio of ACR is ~0.8, instead of letting the ECR
>> dropped
>> > to
>> > > 0.8 very quickly, we only drop 0.001 every time when ACR < ECR.
>> However,
>> > > once we see an ACR > ECR, we increment ECR by 0.05. If a
>> > > RecordTooLargeException is received, we reset the ECR back to 1.0 and
>> > split
>> > > the batch.
>> > >
>> > > Thanks,
>> > >
>> > > Jiangjie (Becket) Qin
>> > >
>> > >
>> > >
>> > > On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
>> > > gharatmayures...@gmail.com> wrote:
>> > >
>> > > > Hi Becket,
>> > > >
>> > > > Seems like an interesting idea.
>> > > > I had couple of questions :
>> > > > 1) How do we decide when the batch should be split?
>> > > > 2) What do you mean by slowly lowering the "actual" compression
>> ratio?
>> > > > An example would really help here.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Mayuresh
>> > > >
>> > > > On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin 
>> > > wrote:
>> > > >
>> > > > > Hi Jay,
>> > > > >
>> > > > > Yeah, I got your point.
>> > > > >
>> > > > > I think there might be a solution which do not require adding a
>> new
>> > > > > configuration. We can start from a very conservative compression
>> > ratio
>> > > > say
>> > > > > 1.0 and lower it very slowly according to the actual compression
>> > ratio
>> > > > > until we hit a point that we have to split a batch. At that
>> point, we
>> > > > > exponentially back off on the compression ratio. The idea is
>> somewhat
>> > > > like
>> > > > > TCP. This should help avoid frequent split.
>> > > > >
>> > > > > The upper bound of the batch size is also a little awkward today
>> > > because
>> > > > we
>> > > > > say the batch size is based on compressed size, but users cannot
>> set
>> > it
>> > > > to
>> > > > > the max message size because that will result in oversized
>> messages.
>> > > With
>> > > > > this change we will be able to allow the users to set the message
>> > size
>> > > to
>> > > > > close to max message size.
>> > > > >
>> > > > > However the downside is that there could be latency spikes in the
>> > > system
>> > > > in
>> > > > > this case due to the splitting, especially when there are many
>> > messages
>> > > > > need to be split at the same time. That could potentially be an
>> issue
>> > > for
>> > > > > some users.
>> > > > >
>> > > > > What do you think about this approach?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jiangjie (Becket) Qin
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Feb 23, 2017 at 1:31 PM, Jay Kreps 
>> wrote:
>> > > > >
>> > > > > > Hey Becket,
>> > > > > >
>> > > > > > Yeah that 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-03-03 Thread Becket Qin
I went ahead and have a patch submitted here:
https://github.com/apache/kafka/pull/2638

Per Joel's suggestion, I changed the compression ratio to be per topic as
well. It seems working well. Since there is an important behavior change
and a new sensor is added, I'll keep the KIP and update it according.

Thanks,

Jiangjie (Becket) Qin

On Mon, Feb 27, 2017 at 3:50 PM, Joel Koshy  wrote:

> >
> > Lets say we sent the batch over the wire and received a
> > RecordTooLargeException, how do we split it as once we add the message to
> > the batch we loose the message level granularity. We will have to
> > decompress, do deep iteration and split and again compress. right? This
> > looks like a performance bottle neck in case of multi topic producers
> like
> > mirror maker.
> >
>
> Yes, but these should be outliers if we do estimation on a per-topic basis
> and if we target a conservative-enough compression ratio. The producer
> should also avoid sending over the wire if it can be made aware of the
> max-message size limit on the broker, and split if it determines that a
> record exceeds the broker's config. Ideally this should be part of topic
> metadata but is not - so it could be off a periodic describe-configs
>  4+-+Command+line+and+centralized+administrative+operations#KIP-4-
> Commandlineandcentralizedadministrativeoperations-DescribeConfigsRequest>
> (which isn't available yet). This doesn't remove the need to split and
> recompress though.
>
>
> > On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin 
> wrote:
> >
> > > Hey Mayuresh,
> > >
> > > 1) The batch would be split when an RecordTooLargeException is
> received.
> > > 2) Not lower the actual compression ratio, but lower the estimated
> > > compression ratio "according to" the Actual Compression Ratio(ACR).
> > >
> > > An example, let's start with Estimated Compression Ratio (ECR) = 1.0.
> Say
> > > the compression ratio of ACR is ~0.8, instead of letting the ECR
> dropped
> > to
> > > 0.8 very quickly, we only drop 0.001 every time when ACR < ECR.
> However,
> > > once we see an ACR > ECR, we increment ECR by 0.05. If a
> > > RecordTooLargeException is received, we reset the ECR back to 1.0 and
> > split
> > > the batch.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
> > > gharatmayures...@gmail.com> wrote:
> > >
> > > > Hi Becket,
> > > >
> > > > Seems like an interesting idea.
> > > > I had couple of questions :
> > > > 1) How do we decide when the batch should be split?
> > > > 2) What do you mean by slowly lowering the "actual" compression
> ratio?
> > > > An example would really help here.
> > > >
> > > > Thanks,
> > > >
> > > > Mayuresh
> > > >
> > > > On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin 
> > > wrote:
> > > >
> > > > > Hi Jay,
> > > > >
> > > > > Yeah, I got your point.
> > > > >
> > > > > I think there might be a solution which do not require adding a new
> > > > > configuration. We can start from a very conservative compression
> > ratio
> > > > say
> > > > > 1.0 and lower it very slowly according to the actual compression
> > ratio
> > > > > until we hit a point that we have to split a batch. At that point,
> we
> > > > > exponentially back off on the compression ratio. The idea is
> somewhat
> > > > like
> > > > > TCP. This should help avoid frequent split.
> > > > >
> > > > > The upper bound of the batch size is also a little awkward today
> > > because
> > > > we
> > > > > say the batch size is based on compressed size, but users cannot
> set
> > it
> > > > to
> > > > > the max message size because that will result in oversized
> messages.
> > > With
> > > > > this change we will be able to allow the users to set the message
> > size
> > > to
> > > > > close to max message size.
> > > > >
> > > > > However the downside is that there could be latency spikes in the
> > > system
> > > > in
> > > > > this case due to the splitting, especially when there are many
> > messages
> > > > > need to be split at the same time. That could potentially be an
> issue
> > > for
> > > > > some users.
> > > > >
> > > > > What do you think about this approach?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Feb 23, 2017 at 1:31 PM, Jay Kreps 
> wrote:
> > > > >
> > > > > > Hey Becket,
> > > > > >
> > > > > > Yeah that makes sense.
> > > > > >
> > > > > > I agree that you'd really have to both fix the estimation (i.e.
> > make
> > > it
> > > > > per
> > > > > > topic or make it better estimate the high percentiles) AND have
> the
> > > > > > recovery mechanism. If you are underestimating often and then
> > paying
> > > a
> > > > > high
> > > > > > recovery price that won't fly.
> > > > > >
> > > > > > I think you take my main point though, which is just that I hate
> 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-27 Thread Joel Koshy
>
> Lets say we sent the batch over the wire and received a
> RecordTooLargeException, how do we split it as once we add the message to
> the batch we loose the message level granularity. We will have to
> decompress, do deep iteration and split and again compress. right? This
> looks like a performance bottle neck in case of multi topic producers like
> mirror maker.
>

Yes, but these should be outliers if we do estimation on a per-topic basis
and if we target a conservative-enough compression ratio. The producer
should also avoid sending over the wire if it can be made aware of the
max-message size limit on the broker, and split if it determines that a
record exceeds the broker's config. Ideally this should be part of topic
metadata but is not - so it could be off a periodic describe-configs

(which isn't available yet). This doesn't remove the need to split and
recompress though.


> On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin  wrote:
>
> > Hey Mayuresh,
> >
> > 1) The batch would be split when an RecordTooLargeException is received.
> > 2) Not lower the actual compression ratio, but lower the estimated
> > compression ratio "according to" the Actual Compression Ratio(ACR).
> >
> > An example, let's start with Estimated Compression Ratio (ECR) = 1.0. Say
> > the compression ratio of ACR is ~0.8, instead of letting the ECR dropped
> to
> > 0.8 very quickly, we only drop 0.001 every time when ACR < ECR. However,
> > once we see an ACR > ECR, we increment ECR by 0.05. If a
> > RecordTooLargeException is received, we reset the ECR back to 1.0 and
> split
> > the batch.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
> > gharatmayures...@gmail.com> wrote:
> >
> > > Hi Becket,
> > >
> > > Seems like an interesting idea.
> > > I had couple of questions :
> > > 1) How do we decide when the batch should be split?
> > > 2) What do you mean by slowly lowering the "actual" compression ratio?
> > > An example would really help here.
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin 
> > wrote:
> > >
> > > > Hi Jay,
> > > >
> > > > Yeah, I got your point.
> > > >
> > > > I think there might be a solution which do not require adding a new
> > > > configuration. We can start from a very conservative compression
> ratio
> > > say
> > > > 1.0 and lower it very slowly according to the actual compression
> ratio
> > > > until we hit a point that we have to split a batch. At that point, we
> > > > exponentially back off on the compression ratio. The idea is somewhat
> > > like
> > > > TCP. This should help avoid frequent split.
> > > >
> > > > The upper bound of the batch size is also a little awkward today
> > because
> > > we
> > > > say the batch size is based on compressed size, but users cannot set
> it
> > > to
> > > > the max message size because that will result in oversized messages.
> > With
> > > > this change we will be able to allow the users to set the message
> size
> > to
> > > > close to max message size.
> > > >
> > > > However the downside is that there could be latency spikes in the
> > system
> > > in
> > > > this case due to the splitting, especially when there are many
> messages
> > > > need to be split at the same time. That could potentially be an issue
> > for
> > > > some users.
> > > >
> > > > What do you think about this approach?
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > >
> > > > On Thu, Feb 23, 2017 at 1:31 PM, Jay Kreps  wrote:
> > > >
> > > > > Hey Becket,
> > > > >
> > > > > Yeah that makes sense.
> > > > >
> > > > > I agree that you'd really have to both fix the estimation (i.e.
> make
> > it
> > > > per
> > > > > topic or make it better estimate the high percentiles) AND have the
> > > > > recovery mechanism. If you are underestimating often and then
> paying
> > a
> > > > high
> > > > > recovery price that won't fly.
> > > > >
> > > > > I think you take my main point though, which is just that I hate to
> > > > exposes
> > > > > these super low level options to users because it is so hard to
> > explain
> > > > to
> > > > > people what it means and how they should set it. So if it is
> possible
> > > to
> > > > > make either some combination of better estimation and splitting or
> > > better
> > > > > tolerance of overage that would be preferrable.
> > > > >
> > > > > -Jay
> > > > >
> > > > > On Thu, Feb 23, 2017 at 11:51 AM, Becket Qin  >
> > > > wrote:
> > > > >
> > > > > > @Dong,
> > > > > >
> > > > > > Thanks for the comments. The default behavior of the producer
> won't
> > > > > change.
> > > > > > If the users want to use the uncompressed message 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-27 Thread Becket Qin
Hey Mayuresh,

1) The batch would be split when an RecordTooLargeException is received.
2) Not lower the actual compression ratio, but lower the estimated
compression ratio "according to" the Actual Compression Ratio(ACR).

An example, let's start with Estimated Compression Ratio (ECR) = 1.0. Say
the compression ratio of ACR is ~0.8, instead of letting the ECR dropped to
0.8 very quickly, we only drop 0.001 every time when ACR < ECR. However,
once we see an ACR > ECR, we increment ECR by 0.05. If a
RecordTooLargeException is received, we reset the ECR back to 1.0 and split
the batch.

Thanks,

Jiangjie (Becket) Qin



On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
gharatmayures...@gmail.com> wrote:

> Hi Becket,
>
> Seems like an interesting idea.
> I had couple of questions :
> 1) How do we decide when the batch should be split?
> 2) What do you mean by slowly lowering the "actual" compression ratio?
> An example would really help here.
>
> Thanks,
>
> Mayuresh
>
> On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin  wrote:
>
> > Hi Jay,
> >
> > Yeah, I got your point.
> >
> > I think there might be a solution which do not require adding a new
> > configuration. We can start from a very conservative compression ratio
> say
> > 1.0 and lower it very slowly according to the actual compression ratio
> > until we hit a point that we have to split a batch. At that point, we
> > exponentially back off on the compression ratio. The idea is somewhat
> like
> > TCP. This should help avoid frequent split.
> >
> > The upper bound of the batch size is also a little awkward today because
> we
> > say the batch size is based on compressed size, but users cannot set it
> to
> > the max message size because that will result in oversized messages. With
> > this change we will be able to allow the users to set the message size to
> > close to max message size.
> >
> > However the downside is that there could be latency spikes in the system
> in
> > this case due to the splitting, especially when there are many messages
> > need to be split at the same time. That could potentially be an issue for
> > some users.
> >
> > What do you think about this approach?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Thu, Feb 23, 2017 at 1:31 PM, Jay Kreps  wrote:
> >
> > > Hey Becket,
> > >
> > > Yeah that makes sense.
> > >
> > > I agree that you'd really have to both fix the estimation (i.e. make it
> > per
> > > topic or make it better estimate the high percentiles) AND have the
> > > recovery mechanism. If you are underestimating often and then paying a
> > high
> > > recovery price that won't fly.
> > >
> > > I think you take my main point though, which is just that I hate to
> > exposes
> > > these super low level options to users because it is so hard to explain
> > to
> > > people what it means and how they should set it. So if it is possible
> to
> > > make either some combination of better estimation and splitting or
> better
> > > tolerance of overage that would be preferrable.
> > >
> > > -Jay
> > >
> > > On Thu, Feb 23, 2017 at 11:51 AM, Becket Qin 
> > wrote:
> > >
> > > > @Dong,
> > > >
> > > > Thanks for the comments. The default behavior of the producer won't
> > > change.
> > > > If the users want to use the uncompressed message size, they probably
> > > will
> > > > also bump up the batch size to somewhere close to the max message
> size.
> > > > This would be in the document. BTW the default batch size is 16K
> which
> > is
> > > > pretty small.
> > > >
> > > > @Jay,
> > > >
> > > > Yeah, we actually had debated quite a bit internally what is the best
> > > > solution to this.
> > > >
> > > > I completely agree it is a bug. In practice we usually leave some
> > > headroom
> > > > to allow the compressed size to grow a little if the the original
> > > messages
> > > > are not compressible, for example, 1000 KB instead of exactly 1 MB.
> It
> > is
> > > > likely safe enough.
> > > >
> > > > The major concern for the rejected alternative is performance. It
> > largely
> > > > depends on how frequent we need to split a batch, i.e. how likely the
> > > > estimation can go off. If we only need to the split work
> occasionally,
> > > the
> > > > cost would be amortized so we don't need to worry about it too much.
> > > > However, it looks that for a producer with shared topics, the
> > estimation
> > > is
> > > > always off. As an example, consider two topics, one with compression
> > > ratio
> > > > 0.6 the other 0.2, assuming exactly same traffic, the average
> > compression
> > > > ratio would be roughly 0.4, which is not right for either of the
> > topics.
> > > So
> > > > almost half of the batches (of the topics with 0.6 compression ratio)
> > > will
> > > > end up larger than the configured batch size. When it comes to more
> > > topics
> > > > such as mirror maker, this becomes more unpredictable. To avoid
> > frequent
> > > > rejection 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-27 Thread Mayuresh Gharat
Hi Becket,

Thanks for the expatiation.
Regarding :
1) The batch would be split when an RecordTooLargeException is received.

Lets say we sent the batch over the wire and received a
RecordTooLargeException, how do we split it as once we add the message to
the batch we loose the message level granularity. We will have to
decompress, do deep iteration and split and again compress. right? This
looks like a performance bottle neck in case of multi topic producers like
mirror maker.


Thanks,

Mayuresh

On Mon, Feb 27, 2017 at 10:51 AM, Becket Qin  wrote:

> Hey Mayuresh,
>
> 1) The batch would be split when an RecordTooLargeException is received.
> 2) Not lower the actual compression ratio, but lower the estimated
> compression ratio "according to" the Actual Compression Ratio(ACR).
>
> An example, let's start with Estimated Compression Ratio (ECR) = 1.0. Say
> the compression ratio of ACR is ~0.8, instead of letting the ECR dropped to
> 0.8 very quickly, we only drop 0.001 every time when ACR < ECR. However,
> once we see an ACR > ECR, we increment ECR by 0.05. If a
> RecordTooLargeException is received, we reset the ECR back to 1.0 and split
> the batch.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Mon, Feb 27, 2017 at 10:30 AM, Mayuresh Gharat <
> gharatmayures...@gmail.com> wrote:
>
> > Hi Becket,
> >
> > Seems like an interesting idea.
> > I had couple of questions :
> > 1) How do we decide when the batch should be split?
> > 2) What do you mean by slowly lowering the "actual" compression ratio?
> > An example would really help here.
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin 
> wrote:
> >
> > > Hi Jay,
> > >
> > > Yeah, I got your point.
> > >
> > > I think there might be a solution which do not require adding a new
> > > configuration. We can start from a very conservative compression ratio
> > say
> > > 1.0 and lower it very slowly according to the actual compression ratio
> > > until we hit a point that we have to split a batch. At that point, we
> > > exponentially back off on the compression ratio. The idea is somewhat
> > like
> > > TCP. This should help avoid frequent split.
> > >
> > > The upper bound of the batch size is also a little awkward today
> because
> > we
> > > say the batch size is based on compressed size, but users cannot set it
> > to
> > > the max message size because that will result in oversized messages.
> With
> > > this change we will be able to allow the users to set the message size
> to
> > > close to max message size.
> > >
> > > However the downside is that there could be latency spikes in the
> system
> > in
> > > this case due to the splitting, especially when there are many messages
> > > need to be split at the same time. That could potentially be an issue
> for
> > > some users.
> > >
> > > What do you think about this approach?
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On Thu, Feb 23, 2017 at 1:31 PM, Jay Kreps  wrote:
> > >
> > > > Hey Becket,
> > > >
> > > > Yeah that makes sense.
> > > >
> > > > I agree that you'd really have to both fix the estimation (i.e. make
> it
> > > per
> > > > topic or make it better estimate the high percentiles) AND have the
> > > > recovery mechanism. If you are underestimating often and then paying
> a
> > > high
> > > > recovery price that won't fly.
> > > >
> > > > I think you take my main point though, which is just that I hate to
> > > exposes
> > > > these super low level options to users because it is so hard to
> explain
> > > to
> > > > people what it means and how they should set it. So if it is possible
> > to
> > > > make either some combination of better estimation and splitting or
> > better
> > > > tolerance of overage that would be preferrable.
> > > >
> > > > -Jay
> > > >
> > > > On Thu, Feb 23, 2017 at 11:51 AM, Becket Qin 
> > > wrote:
> > > >
> > > > > @Dong,
> > > > >
> > > > > Thanks for the comments. The default behavior of the producer won't
> > > > change.
> > > > > If the users want to use the uncompressed message size, they
> probably
> > > > will
> > > > > also bump up the batch size to somewhere close to the max message
> > size.
> > > > > This would be in the document. BTW the default batch size is 16K
> > which
> > > is
> > > > > pretty small.
> > > > >
> > > > > @Jay,
> > > > >
> > > > > Yeah, we actually had debated quite a bit internally what is the
> best
> > > > > solution to this.
> > > > >
> > > > > I completely agree it is a bug. In practice we usually leave some
> > > > headroom
> > > > > to allow the compressed size to grow a little if the the original
> > > > messages
> > > > > are not compressible, for example, 1000 KB instead of exactly 1 MB.
> > It
> > > is
> > > > > likely safe enough.
> > > > >
> > > > > The major concern for the rejected alternative is performance. It
> > > largely
> > > > > depends on how frequent we 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-27 Thread Mayuresh Gharat
Hi Becket,

Seems like an interesting idea.
I had couple of questions :
1) How do we decide when the batch should be split?
2) What do you mean by slowly lowering the "actual" compression ratio?
An example would really help here.

Thanks,

Mayuresh

On Fri, Feb 24, 2017 at 3:17 PM, Becket Qin  wrote:

> Hi Jay,
>
> Yeah, I got your point.
>
> I think there might be a solution which do not require adding a new
> configuration. We can start from a very conservative compression ratio say
> 1.0 and lower it very slowly according to the actual compression ratio
> until we hit a point that we have to split a batch. At that point, we
> exponentially back off on the compression ratio. The idea is somewhat like
> TCP. This should help avoid frequent split.
>
> The upper bound of the batch size is also a little awkward today because we
> say the batch size is based on compressed size, but users cannot set it to
> the max message size because that will result in oversized messages. With
> this change we will be able to allow the users to set the message size to
> close to max message size.
>
> However the downside is that there could be latency spikes in the system in
> this case due to the splitting, especially when there are many messages
> need to be split at the same time. That could potentially be an issue for
> some users.
>
> What do you think about this approach?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Thu, Feb 23, 2017 at 1:31 PM, Jay Kreps  wrote:
>
> > Hey Becket,
> >
> > Yeah that makes sense.
> >
> > I agree that you'd really have to both fix the estimation (i.e. make it
> per
> > topic or make it better estimate the high percentiles) AND have the
> > recovery mechanism. If you are underestimating often and then paying a
> high
> > recovery price that won't fly.
> >
> > I think you take my main point though, which is just that I hate to
> exposes
> > these super low level options to users because it is so hard to explain
> to
> > people what it means and how they should set it. So if it is possible to
> > make either some combination of better estimation and splitting or better
> > tolerance of overage that would be preferrable.
> >
> > -Jay
> >
> > On Thu, Feb 23, 2017 at 11:51 AM, Becket Qin 
> wrote:
> >
> > > @Dong,
> > >
> > > Thanks for the comments. The default behavior of the producer won't
> > change.
> > > If the users want to use the uncompressed message size, they probably
> > will
> > > also bump up the batch size to somewhere close to the max message size.
> > > This would be in the document. BTW the default batch size is 16K which
> is
> > > pretty small.
> > >
> > > @Jay,
> > >
> > > Yeah, we actually had debated quite a bit internally what is the best
> > > solution to this.
> > >
> > > I completely agree it is a bug. In practice we usually leave some
> > headroom
> > > to allow the compressed size to grow a little if the the original
> > messages
> > > are not compressible, for example, 1000 KB instead of exactly 1 MB. It
> is
> > > likely safe enough.
> > >
> > > The major concern for the rejected alternative is performance. It
> largely
> > > depends on how frequent we need to split a batch, i.e. how likely the
> > > estimation can go off. If we only need to the split work occasionally,
> > the
> > > cost would be amortized so we don't need to worry about it too much.
> > > However, it looks that for a producer with shared topics, the
> estimation
> > is
> > > always off. As an example, consider two topics, one with compression
> > ratio
> > > 0.6 the other 0.2, assuming exactly same traffic, the average
> compression
> > > ratio would be roughly 0.4, which is not right for either of the
> topics.
> > So
> > > almost half of the batches (of the topics with 0.6 compression ratio)
> > will
> > > end up larger than the configured batch size. When it comes to more
> > topics
> > > such as mirror maker, this becomes more unpredictable. To avoid
> frequent
> > > rejection / split of the batches, we need to configured the batch size
> > > pretty conservatively. This could actually hurt the performance because
> > we
> > > are shoehorn the messages that are highly compressible to a small batch
> > so
> > > that the other topics that are not that compressible will not become
> too
> > > large with the same batch size. At LinkedIn, our batch size is
> configured
> > > to 64 KB because of this. I think we may actually have better batching
> if
> > > we just use the uncompressed message size and 800 KB batch size.
> > >
> > > We did not think about loosening the message size restriction, but that
> > > sounds a viable solution given that the consumer now can fetch
> oversized
> > > messages. One concern would be that on the broker side oversized
> messages
> > > will bring more memory pressure. With KIP-92, we may mitigate that, but
> > the
> > > memory allocation for large messages may not be very GC friendly. I
> need
> > 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-24 Thread Becket Qin
Hi Jay,

Yeah, I got your point.

I think there might be a solution which do not require adding a new
configuration. We can start from a very conservative compression ratio say
1.0 and lower it very slowly according to the actual compression ratio
until we hit a point that we have to split a batch. At that point, we
exponentially back off on the compression ratio. The idea is somewhat like
TCP. This should help avoid frequent split.

The upper bound of the batch size is also a little awkward today because we
say the batch size is based on compressed size, but users cannot set it to
the max message size because that will result in oversized messages. With
this change we will be able to allow the users to set the message size to
close to max message size.

However the downside is that there could be latency spikes in the system in
this case due to the splitting, especially when there are many messages
need to be split at the same time. That could potentially be an issue for
some users.

What do you think about this approach?

Thanks,

Jiangjie (Becket) Qin



On Thu, Feb 23, 2017 at 1:31 PM, Jay Kreps  wrote:

> Hey Becket,
>
> Yeah that makes sense.
>
> I agree that you'd really have to both fix the estimation (i.e. make it per
> topic or make it better estimate the high percentiles) AND have the
> recovery mechanism. If you are underestimating often and then paying a high
> recovery price that won't fly.
>
> I think you take my main point though, which is just that I hate to exposes
> these super low level options to users because it is so hard to explain to
> people what it means and how they should set it. So if it is possible to
> make either some combination of better estimation and splitting or better
> tolerance of overage that would be preferrable.
>
> -Jay
>
> On Thu, Feb 23, 2017 at 11:51 AM, Becket Qin  wrote:
>
> > @Dong,
> >
> > Thanks for the comments. The default behavior of the producer won't
> change.
> > If the users want to use the uncompressed message size, they probably
> will
> > also bump up the batch size to somewhere close to the max message size.
> > This would be in the document. BTW the default batch size is 16K which is
> > pretty small.
> >
> > @Jay,
> >
> > Yeah, we actually had debated quite a bit internally what is the best
> > solution to this.
> >
> > I completely agree it is a bug. In practice we usually leave some
> headroom
> > to allow the compressed size to grow a little if the the original
> messages
> > are not compressible, for example, 1000 KB instead of exactly 1 MB. It is
> > likely safe enough.
> >
> > The major concern for the rejected alternative is performance. It largely
> > depends on how frequent we need to split a batch, i.e. how likely the
> > estimation can go off. If we only need to the split work occasionally,
> the
> > cost would be amortized so we don't need to worry about it too much.
> > However, it looks that for a producer with shared topics, the estimation
> is
> > always off. As an example, consider two topics, one with compression
> ratio
> > 0.6 the other 0.2, assuming exactly same traffic, the average compression
> > ratio would be roughly 0.4, which is not right for either of the topics.
> So
> > almost half of the batches (of the topics with 0.6 compression ratio)
> will
> > end up larger than the configured batch size. When it comes to more
> topics
> > such as mirror maker, this becomes more unpredictable. To avoid frequent
> > rejection / split of the batches, we need to configured the batch size
> > pretty conservatively. This could actually hurt the performance because
> we
> > are shoehorn the messages that are highly compressible to a small batch
> so
> > that the other topics that are not that compressible will not become too
> > large with the same batch size. At LinkedIn, our batch size is configured
> > to 64 KB because of this. I think we may actually have better batching if
> > we just use the uncompressed message size and 800 KB batch size.
> >
> > We did not think about loosening the message size restriction, but that
> > sounds a viable solution given that the consumer now can fetch oversized
> > messages. One concern would be that on the broker side oversized messages
> > will bring more memory pressure. With KIP-92, we may mitigate that, but
> the
> > memory allocation for large messages may not be very GC friendly. I need
> to
> > think about this a little more.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Wed, Feb 22, 2017 at 8:57 PM, Jay Kreps  wrote:
> >
> > > Hey Becket,
> > >
> > > I get the problem we want to solve with this, but I don't think this is
> > > something that makes sense as a user controlled knob that everyone
> > sending
> > > data to kafka has to think about. It is basically a bug, right?
> > >
> > > First, as a technical question is it true that using the uncompressed
> > size
> > > for batching actually guarantees that 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-23 Thread Jay Kreps
Hey Becket,

Yeah that makes sense.

I agree that you'd really have to both fix the estimation (i.e. make it per
topic or make it better estimate the high percentiles) AND have the
recovery mechanism. If you are underestimating often and then paying a high
recovery price that won't fly.

I think you take my main point though, which is just that I hate to exposes
these super low level options to users because it is so hard to explain to
people what it means and how they should set it. So if it is possible to
make either some combination of better estimation and splitting or better
tolerance of overage that would be preferrable.

-Jay

On Thu, Feb 23, 2017 at 11:51 AM, Becket Qin  wrote:

> @Dong,
>
> Thanks for the comments. The default behavior of the producer won't change.
> If the users want to use the uncompressed message size, they probably will
> also bump up the batch size to somewhere close to the max message size.
> This would be in the document. BTW the default batch size is 16K which is
> pretty small.
>
> @Jay,
>
> Yeah, we actually had debated quite a bit internally what is the best
> solution to this.
>
> I completely agree it is a bug. In practice we usually leave some headroom
> to allow the compressed size to grow a little if the the original messages
> are not compressible, for example, 1000 KB instead of exactly 1 MB. It is
> likely safe enough.
>
> The major concern for the rejected alternative is performance. It largely
> depends on how frequent we need to split a batch, i.e. how likely the
> estimation can go off. If we only need to the split work occasionally, the
> cost would be amortized so we don't need to worry about it too much.
> However, it looks that for a producer with shared topics, the estimation is
> always off. As an example, consider two topics, one with compression ratio
> 0.6 the other 0.2, assuming exactly same traffic, the average compression
> ratio would be roughly 0.4, which is not right for either of the topics. So
> almost half of the batches (of the topics with 0.6 compression ratio) will
> end up larger than the configured batch size. When it comes to more topics
> such as mirror maker, this becomes more unpredictable. To avoid frequent
> rejection / split of the batches, we need to configured the batch size
> pretty conservatively. This could actually hurt the performance because we
> are shoehorn the messages that are highly compressible to a small batch so
> that the other topics that are not that compressible will not become too
> large with the same batch size. At LinkedIn, our batch size is configured
> to 64 KB because of this. I think we may actually have better batching if
> we just use the uncompressed message size and 800 KB batch size.
>
> We did not think about loosening the message size restriction, but that
> sounds a viable solution given that the consumer now can fetch oversized
> messages. One concern would be that on the broker side oversized messages
> will bring more memory pressure. With KIP-92, we may mitigate that, but the
> memory allocation for large messages may not be very GC friendly. I need to
> think about this a little more.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Wed, Feb 22, 2017 at 8:57 PM, Jay Kreps  wrote:
>
> > Hey Becket,
> >
> > I get the problem we want to solve with this, but I don't think this is
> > something that makes sense as a user controlled knob that everyone
> sending
> > data to kafka has to think about. It is basically a bug, right?
> >
> > First, as a technical question is it true that using the uncompressed
> size
> > for batching actually guarantees that you observe the limit? I think that
> > implies that compression always makes the messages smaller, which i think
> > usually true but is not guaranteed, right? e.g. if someone encrypts their
> > data which tends to randomize it and then enables compressesion, it could
> > slightly get bigger?
> >
> > I also wonder if the rejected alternatives you describe couldn't be made
> to
> > work: basically try to be a bit better at estimation and recover when we
> > guess wrong. I don't think the memory usage should be a problem: isn't it
> > the same memory usage the consumer of that topic would need? And can't
> you
> > do the splitting and recompression in a streaming fashion? If we an make
> > the estimation rate low and the recovery cost is just ~2x the normal cost
> > for that batch that should be totally fine, right? (It's technically true
> > you might have to split more than once, but since you halve it each time
> I
> > think should you get a number of halvings that is logarithmic in the miss
> > size, which, with better estimation you'd hope would be super duper
> small).
> >
> > Alternatively maybe we could work on the other side of the problem and
> try
> > to make it so that a small miss on message size isn't a big problem. I
> > think original issue was that max size and fetch size were tightly
> 

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-23 Thread Becket Qin
@Dong,

Thanks for the comments. The default behavior of the producer won't change.
If the users want to use the uncompressed message size, they probably will
also bump up the batch size to somewhere close to the max message size.
This would be in the document. BTW the default batch size is 16K which is
pretty small.

@Jay,

Yeah, we actually had debated quite a bit internally what is the best
solution to this.

I completely agree it is a bug. In practice we usually leave some headroom
to allow the compressed size to grow a little if the the original messages
are not compressible, for example, 1000 KB instead of exactly 1 MB. It is
likely safe enough.

The major concern for the rejected alternative is performance. It largely
depends on how frequent we need to split a batch, i.e. how likely the
estimation can go off. If we only need to the split work occasionally, the
cost would be amortized so we don't need to worry about it too much.
However, it looks that for a producer with shared topics, the estimation is
always off. As an example, consider two topics, one with compression ratio
0.6 the other 0.2, assuming exactly same traffic, the average compression
ratio would be roughly 0.4, which is not right for either of the topics. So
almost half of the batches (of the topics with 0.6 compression ratio) will
end up larger than the configured batch size. When it comes to more topics
such as mirror maker, this becomes more unpredictable. To avoid frequent
rejection / split of the batches, we need to configured the batch size
pretty conservatively. This could actually hurt the performance because we
are shoehorn the messages that are highly compressible to a small batch so
that the other topics that are not that compressible will not become too
large with the same batch size. At LinkedIn, our batch size is configured
to 64 KB because of this. I think we may actually have better batching if
we just use the uncompressed message size and 800 KB batch size.

We did not think about loosening the message size restriction, but that
sounds a viable solution given that the consumer now can fetch oversized
messages. One concern would be that on the broker side oversized messages
will bring more memory pressure. With KIP-92, we may mitigate that, but the
memory allocation for large messages may not be very GC friendly. I need to
think about this a little more.

Thanks,

Jiangjie (Becket) Qin


On Wed, Feb 22, 2017 at 8:57 PM, Jay Kreps  wrote:

> Hey Becket,
>
> I get the problem we want to solve with this, but I don't think this is
> something that makes sense as a user controlled knob that everyone sending
> data to kafka has to think about. It is basically a bug, right?
>
> First, as a technical question is it true that using the uncompressed size
> for batching actually guarantees that you observe the limit? I think that
> implies that compression always makes the messages smaller, which i think
> usually true but is not guaranteed, right? e.g. if someone encrypts their
> data which tends to randomize it and then enables compressesion, it could
> slightly get bigger?
>
> I also wonder if the rejected alternatives you describe couldn't be made to
> work: basically try to be a bit better at estimation and recover when we
> guess wrong. I don't think the memory usage should be a problem: isn't it
> the same memory usage the consumer of that topic would need? And can't you
> do the splitting and recompression in a streaming fashion? If we an make
> the estimation rate low and the recovery cost is just ~2x the normal cost
> for that batch that should be totally fine, right? (It's technically true
> you might have to split more than once, but since you halve it each time I
> think should you get a number of halvings that is logarithmic in the miss
> size, which, with better estimation you'd hope would be super duper small).
>
> Alternatively maybe we could work on the other side of the problem and try
> to make it so that a small miss on message size isn't a big problem. I
> think original issue was that max size and fetch size were tightly coupled
> and the way memory in the consumer worked you really wanted fetch size to
> be as small as possible because you'd use that much memory per fetched
> partition and the consumer would get stuck if its fetch size wasn't big
> enough. I think we made some progress on that issue and maybe more could be
> done there so that a small bit of fuzziness around the size would not be an
> issue?
>
> -Jay
>
>
>
> On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin  wrote:
>
> > Hi folks,
> >
> > I would like to start the discussion thread on KIP-126. The KIP propose
> > adding a new configuration to KafkaProducer to allow batching based on
> > uncompressed message size.
> >
> > Comments are welcome.
> >
> > The KIP wiki is following:
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
> >

Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-22 Thread Jay Kreps
Hey Becket,

I get the problem we want to solve with this, but I don't think this is
something that makes sense as a user controlled knob that everyone sending
data to kafka has to think about. It is basically a bug, right?

First, as a technical question is it true that using the uncompressed size
for batching actually guarantees that you observe the limit? I think that
implies that compression always makes the messages smaller, which i think
usually true but is not guaranteed, right? e.g. if someone encrypts their
data which tends to randomize it and then enables compressesion, it could
slightly get bigger?

I also wonder if the rejected alternatives you describe couldn't be made to
work: basically try to be a bit better at estimation and recover when we
guess wrong. I don't think the memory usage should be a problem: isn't it
the same memory usage the consumer of that topic would need? And can't you
do the splitting and recompression in a streaming fashion? If we an make
the estimation rate low and the recovery cost is just ~2x the normal cost
for that batch that should be totally fine, right? (It's technically true
you might have to split more than once, but since you halve it each time I
think should you get a number of halvings that is logarithmic in the miss
size, which, with better estimation you'd hope would be super duper small).

Alternatively maybe we could work on the other side of the problem and try
to make it so that a small miss on message size isn't a big problem. I
think original issue was that max size and fetch size were tightly coupled
and the way memory in the consumer worked you really wanted fetch size to
be as small as possible because you'd use that much memory per fetched
partition and the consumer would get stuck if its fetch size wasn't big
enough. I think we made some progress on that issue and maybe more could be
done there so that a small bit of fuzziness around the size would not be an
issue?

-Jay



On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin  wrote:

> Hi folks,
>
> I would like to start the discussion thread on KIP-126. The KIP propose
> adding a new configuration to KafkaProducer to allow batching based on
> uncompressed message size.
>
> Comments are welcome.
>
> The KIP wiki is following:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
>
> Thanks,
>
> Jiangjie (Becket) Qin
>


Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-22 Thread Dong Lin
Hey Becket,

I realized that Apurva has already raised similar questions. I think you
answered his question by saying that the request size will not be small. I
agree that there will be no impact on throughput if we can reach request
size limit with compression estimation disabled. But I am not sure that
will be the case. This is at least a concern where MM is mirroring traffic
for only a few partitions of high byte-in rate. Thus I am wondering if we
should do the optimization proposed above.

Thanks,
Dong

On Wed, Feb 22, 2017 at 6:39 PM, Dong Lin  wrote:

> Hey Becket,
>
> Thanks for the KIP. I have one question here.
>
> Suppose producer's batch.size=100 KB, max.in.flight.requests.per.connection=1.
> Since each ProduceRequest contains one batch per partition, it means that
> 100 KB compressed data will be produced per partition per round-trip time
> as of current implementation. If we disable compression estimation with
> this KIP, then produce can only produce 100 KB uncompressed data per
> partition per round-trip time. Suppose average compression ratio is 10,
> then there will be 10X difference in the bytes that are transmitted per
> round-trip time. The impact on the throughput can be big if mirror maker is
> producer to a remote cluster, even though the compression ratio may be the
> same.
>
> Given this observation, we should probably note in the KIP that user
> should bump up the producer's batch.size to the message.max.bytes
> configured on the broker, which by default is roughly 1MB, to achieve
> maximum possible throughput when compression estimation is disabled.
>
> Still, this can impact throughput of producer or MM that are producing
> highly compressible data. I think we can get around this problem by
> allowing each request to have multiple batches per partition as long as the
> size of these batches <= producer's batch.size config. Do you think it is
> worth doing?
>
> Thanks,
> Dong
>
>
>
> On Tue, Feb 21, 2017 at 7:56 PM, Mayuresh Gharat <
> gharatmayures...@gmail.com> wrote:
>
>> Apurva has a point that can be documented for this config.
>>
>> Overall, LGTM +1.
>>
>> Thanks,
>>
>> Mayuresh
>>
>> On Tue, Feb 21, 2017 at 6:41 PM, Becket Qin  wrote:
>>
>> > Hi Apurva,
>> >
>> > Yes, it is true that the request size might be much smaller if the
>> batching
>> > is based on uncompressed size. I will let the users know about this.
>> That
>> > said, in practice, this is probably fine. For example, at LinkedIn, our
>> max
>> > message size is 1 MB, typically the compressed size would be 100 KB or
>> > larger, given that in most cases, there are many partitions, the request
>> > size would not be too small (typically around a few MB).
>> >
>> > At LinkedIn we do have some topics has various compression ratio. Those
>> are
>> > usually topics shared by different services so the data may differ a lot
>> > although they are in the same topic and similar fields.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> > On Tue, Feb 21, 2017 at 6:17 PM, Apurva Mehta 
>> wrote:
>> >
>> > > Hi Becket, Thanks for the kip.
>> > >
>> > > I think one of the risks here is that when compression estimation is
>> > > disabled, you could have much smaller batches than expected, and
>> > throughput
>> > > could be hurt. It would be worth adding this to the documentation of
>> this
>> > > setting.
>> > >
>> > > Also, one of the rejected alternatives states that per topic
>> estimations
>> > > would not work when the compression of individual messages is
>> variable.
>> > > This is true in theory, but in practice one would expect Kafka topics
>> to
>> > > have fairly homogenous data, and hence should compress evenly. I was
>> > > curious if you have data which shows otherwise.
>> > >
>> > > Thanks,
>> > > Apurva
>> > >
>> > > On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin 
>> > wrote:
>> > >
>> > > > Hi folks,
>> > > >
>> > > > I would like to start the discussion thread on KIP-126. The KIP
>> propose
>> > > > adding a new configuration to KafkaProducer to allow batching based
>> on
>> > > > uncompressed message size.
>> > > >
>> > > > Comments are welcome.
>> > > >
>> > > > The KIP wiki is following:
>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > > > 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jiangjie (Becket) Qin
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> -Regards,
>> Mayuresh R. Gharat
>> (862) 250-7125
>>
>
>


Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-22 Thread Dong Lin
Hey Becket,

Thanks for the KIP. I have one question here.

Suppose producer's batch.size=100 KB,
max.in.flight.requests.per.connection=1. Since each ProduceRequest contains
one batch per partition, it means that 100 KB compressed data will be
produced per partition per round-trip time as of current implementation. If
we disable compression estimation with this KIP, then produce can only
produce 100 KB uncompressed data per partition per round-trip time. Suppose
average compression ratio is 10, then there will be 10X difference in the
bytes that are transmitted per round-trip time. The impact on the
throughput can be big if mirror maker is producer to a remote cluster, even
though the compression ratio may be the same.

Given this observation, we should probably note in the KIP that user should
bump up the producer's batch.size to the message.max.bytes configured on
the broker, which by default is roughly 1MB, to achieve maximum possible
throughput when compression estimation is disabled.

Still, this can impact throughput of producer or MM that are producing
highly compressible data. I think we can get around this problem by
allowing each request to have multiple batches per partition as long as the
size of these batches <= producer's batch.size config. Do you think it is
worth doing?

Thanks,
Dong



On Tue, Feb 21, 2017 at 7:56 PM, Mayuresh Gharat  wrote:

> Apurva has a point that can be documented for this config.
>
> Overall, LGTM +1.
>
> Thanks,
>
> Mayuresh
>
> On Tue, Feb 21, 2017 at 6:41 PM, Becket Qin  wrote:
>
> > Hi Apurva,
> >
> > Yes, it is true that the request size might be much smaller if the
> batching
> > is based on uncompressed size. I will let the users know about this. That
> > said, in practice, this is probably fine. For example, at LinkedIn, our
> max
> > message size is 1 MB, typically the compressed size would be 100 KB or
> > larger, given that in most cases, there are many partitions, the request
> > size would not be too small (typically around a few MB).
> >
> > At LinkedIn we do have some topics has various compression ratio. Those
> are
> > usually topics shared by different services so the data may differ a lot
> > although they are in the same topic and similar fields.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Tue, Feb 21, 2017 at 6:17 PM, Apurva Mehta 
> wrote:
> >
> > > Hi Becket, Thanks for the kip.
> > >
> > > I think one of the risks here is that when compression estimation is
> > > disabled, you could have much smaller batches than expected, and
> > throughput
> > > could be hurt. It would be worth adding this to the documentation of
> this
> > > setting.
> > >
> > > Also, one of the rejected alternatives states that per topic
> estimations
> > > would not work when the compression of individual messages is variable.
> > > This is true in theory, but in practice one would expect Kafka topics
> to
> > > have fairly homogenous data, and hence should compress evenly. I was
> > > curious if you have data which shows otherwise.
> > >
> > > Thanks,
> > > Apurva
> > >
> > > On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin 
> > wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I would like to start the discussion thread on KIP-126. The KIP
> propose
> > > > adding a new configuration to KafkaProducer to allow batching based
> on
> > > > uncompressed message size.
> > > >
> > > > Comments are welcome.
> > > >
> > > > The KIP wiki is following:
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > >
> >
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>


Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-21 Thread Mayuresh Gharat
Apurva has a point that can be documented for this config.

Overall, LGTM +1.

Thanks,

Mayuresh

On Tue, Feb 21, 2017 at 6:41 PM, Becket Qin  wrote:

> Hi Apurva,
>
> Yes, it is true that the request size might be much smaller if the batching
> is based on uncompressed size. I will let the users know about this. That
> said, in practice, this is probably fine. For example, at LinkedIn, our max
> message size is 1 MB, typically the compressed size would be 100 KB or
> larger, given that in most cases, there are many partitions, the request
> size would not be too small (typically around a few MB).
>
> At LinkedIn we do have some topics has various compression ratio. Those are
> usually topics shared by different services so the data may differ a lot
> although they are in the same topic and similar fields.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Tue, Feb 21, 2017 at 6:17 PM, Apurva Mehta  wrote:
>
> > Hi Becket, Thanks for the kip.
> >
> > I think one of the risks here is that when compression estimation is
> > disabled, you could have much smaller batches than expected, and
> throughput
> > could be hurt. It would be worth adding this to the documentation of this
> > setting.
> >
> > Also, one of the rejected alternatives states that per topic estimations
> > would not work when the compression of individual messages is variable.
> > This is true in theory, but in practice one would expect Kafka topics to
> > have fairly homogenous data, and hence should compress evenly. I was
> > curious if you have data which shows otherwise.
> >
> > Thanks,
> > Apurva
> >
> > On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin 
> wrote:
> >
> > > Hi folks,
> > >
> > > I would like to start the discussion thread on KIP-126. The KIP propose
> > > adding a new configuration to KafkaProducer to allow batching based on
> > > uncompressed message size.
> > >
> > > Comments are welcome.
> > >
> > > The KIP wiki is following:
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> >
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-21 Thread Becket Qin
Hi Apurva,

Yes, it is true that the request size might be much smaller if the batching
is based on uncompressed size. I will let the users know about this. That
said, in practice, this is probably fine. For example, at LinkedIn, our max
message size is 1 MB, typically the compressed size would be 100 KB or
larger, given that in most cases, there are many partitions, the request
size would not be too small (typically around a few MB).

At LinkedIn we do have some topics has various compression ratio. Those are
usually topics shared by different services so the data may differ a lot
although they are in the same topic and similar fields.

Thanks,

Jiangjie (Becket) Qin


On Tue, Feb 21, 2017 at 6:17 PM, Apurva Mehta  wrote:

> Hi Becket, Thanks for the kip.
>
> I think one of the risks here is that when compression estimation is
> disabled, you could have much smaller batches than expected, and throughput
> could be hurt. It would be worth adding this to the documentation of this
> setting.
>
> Also, one of the rejected alternatives states that per topic estimations
> would not work when the compression of individual messages is variable.
> This is true in theory, but in practice one would expect Kafka topics to
> have fairly homogenous data, and hence should compress evenly. I was
> curious if you have data which shows otherwise.
>
> Thanks,
> Apurva
>
> On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin  wrote:
>
> > Hi folks,
> >
> > I would like to start the discussion thread on KIP-126. The KIP propose
> > adding a new configuration to KafkaProducer to allow batching based on
> > uncompressed message size.
> >
> > Comments are welcome.
> >
> > The KIP wiki is following:
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
>


Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-21 Thread Apurva Mehta
Hi Becket, Thanks for the kip.

I think one of the risks here is that when compression estimation is
disabled, you could have much smaller batches than expected, and throughput
could be hurt. It would be worth adding this to the documentation of this
setting.

Also, one of the rejected alternatives states that per topic estimations
would not work when the compression of individual messages is variable.
This is true in theory, but in practice one would expect Kafka topics to
have fairly homogenous data, and hence should compress evenly. I was
curious if you have data which shows otherwise.

Thanks,
Apurva

On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin  wrote:

> Hi folks,
>
> I would like to start the discussion thread on KIP-126. The KIP propose
> adding a new configuration to KafkaProducer to allow batching based on
> uncompressed message size.
>
> Comments are welcome.
>
> The KIP wiki is following:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
>
> Thanks,
>
> Jiangjie (Becket) Qin
>


Re: [DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-21 Thread Onur Karaman
+1. LGTM. No major concerns.

On Tue, Feb 21, 2017 at 12:30 PM, Becket Qin  wrote:

> Hi folks,
>
> I would like to start the discussion thread on KIP-126. The KIP propose
> adding a new configuration to KafkaProducer to allow batching based on
> uncompressed message size.
>
> Comments are welcome.
>
> The KIP wiki is following:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size
>
> Thanks,
>
> Jiangjie (Becket) Qin
>


[DISCUSS] KIP-126 - Allow KafkaProducer to batch based on uncompressed size

2017-02-21 Thread Becket Qin
Hi folks,

I would like to start the discussion thread on KIP-126. The KIP propose
adding a new configuration to KafkaProducer to allow batching based on
uncompressed message size.

Comments are welcome.

The KIP wiki is following:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-126+-+Allow+KafkaProducer+to+batch+based+on+uncompressed+size

Thanks,

Jiangjie (Becket) Qin