Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-08 Thread Andrey Falko
On Sat, Jan 6, 2018 at 5:57 PM, Colin McCabe  wrote:
> On Thu, Jan 4, 2018, at 10:37, Jun Rao wrote:
>>
>> 4. The process to mark partition as dirty requires updating every fetch
>> session having the partition. This may add some overhead. An alternative
>> approach is to check the difference btw cached fetch offset and HW (or LEO)
>> when serving the fetch request.
>
> That's a good point.  The caching approach avoids needing to update every 
> fetch session when one of those numbers changes.  I think an even more 
> important advantage is that it's simpler to implement -- we don't have to 
> worry about forgetting to update a fetch session when one of those numbers 
> changes.  The disadvantage is some extra memory consumption per partition per 
> fetch session.
>
> I think the advantage, especially in terms of simplicity, might override the 
> memory concern.  My initial implementation uses the caching approach.  I will 
> update the KIP once I have this working.
>

We're very interested in this KIP because it might improve one of our
topic-heavy clusters. I have a stress-test generating topics across a
number of kafka brokers; if you'd like early and quick feedback on
your implementation let me know!

The discussion thread is very long, so hopefully I'm not asking
something that was asked before: does Kafka already expose
`FetchRequest` size for monitoring purposes? It might improve the KIP
to track the before-and-after behavior.

-Andrey


Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-06 Thread Colin McCabe
On Thu, Jan 4, 2018, at 10:37, Jun Rao wrote:
> Hi, Collin,
> 
> A few more comments on the KIP.
> 
> 1. About fetch sequence number. My understanding is that the sequence
> number is really used to provide better fencing for requests coming from
> different TCP sessions for a given client. Suppose a consumer issues a
> fetch request and doesn't get a response in time. The consumer will then
> retry the request on a new TCP session. In this case, it's important that
> the fetch session cached on the leader not to be overridden by the request
> from the old TCP session after the request from the new TCP session has
> been processed. To support this, one way is to have an epoch field and have
> the consumer bump up the epoch on every new TCP session. The leader will
> reject requests with an old epoch. It's less clear to me the value of
> sequence number within the same TCP session. In any case, it would be
> useful to document the client behavior on request timeout.

Hi Jun,

I'll try to add some examples the next time I rework the KIP.

I think the biggest win with epoch / sequence number is the ability to keep 
your session ID the same after a TCP disconnect.

> 
> 2. It would be useful to add some metrics for monitoring the usage of the
> session cache. For example, it would be useful to know how many slots are
> being used (or unused) and the eviction rate (to see if there is any churn).
> 
> 3. The partition shuffling algorithm in the KIP is a bit different from the
> current approach. In the current approach, the shuffling is more
> deterministic in that after sending data for a partition, the leader won't
> send new data to that partition until every other partition is given a
> chance.

Right.  The previous approach was maintaining a linked list and reordering it 
as needed; the proposed approach is rotating the start position by 1 on each 
incremental fetch.

I think the new approach is adequate for ensuring fairness.  The scenarios 
where rotation is needed are rare.  We also should generally expect those 
scenarios to feature frequent fetches.

We could probably do even better from a fairness point of view if we start 
fetching where we left off on the last incremental fetch...

> 
> 4. The process to mark partition as dirty requires updating every fetch
> session having the partition. This may add some overhead. An alternative
> approach is to check the difference btw cached fetch offset and HW (or LEO)
> when serving the fetch request.

That's a good point.  The caching approach avoids needing to update every fetch 
session when one of those numbers changes.  I think an even more important 
advantage is that it's simpler to implement -- we don't have to worry about 
forgetting to update a fetch session when one of those numbers changes.  The 
disadvantage is some extra memory consumption per partition per fetch session.

I think the advantage, especially in terms of simplicity, might override the 
memory concern.  My initial implementation uses the caching approach.  I will 
update the KIP once I have this working.

> 
> 5. The cache size is based on the number of slots. However, different slots
> could have different partitions. I am wondering if you have considered
> measuring the cache size by partitions.

Currently, the cache implements thrash protection by "protecting" new entries 
for a short amount of time (a few minutes).  While an entry is protected, it 
can't be evicted.  This is to avoid the situation where you have N+1 equally 
sized fetchers and N slots, and so every fetch triggers an eviction.

Unfortunately, this "protection" mechanism doesn't play well with the idea of 
partition-based caching.  It's easy to imagine a scenario where the cache is 
completely monopolized by small fetchers, because there are never enough 
partitions free to allow a bigger fetcher to enter the cache.

Another possible issue is that if the cache were based on the number of 
partitions, it would be harder to size it so that all the followers would 
always fit.  With a slot-based system, that's easy to do, just by setting slots 
>= cluster size.

I think these are solvable issues, but it might be best to start with something 
simple.  I think it should be possible to add more sophisticated caching 
strategies later pretty transparently, since the broker is in full control of 
what is cached.

regards,
Colin

> 
> Thanks,
> 
> Jun
> 
> 
> On Tue, Jan 2, 2018 at 6:34 PM, Colin McCabe  wrote:
> 
> > On Tue, Jan 2, 2018, at 04:46, Becket Qin wrote:
> > > Hi Colin,
> > >
> > > Good point about KIP-226. Maybe a separate broker epoch is needed
> > although
> > > it is a little awkward to let the consumer set this. So was there a
> > > solution to the frequent pause and resume scenario? Did I miss something?
> > >
> > > Thanks,
> > > Jiangjie (Becket) Qin
> >
> > Hi Becket,
> >
> > Allowing sessions to be re-initialized (as the current KIP does) makes
> > frequent pauses and resumes less 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-04 Thread Jun Rao
Hi, Collin,

A few more comments on the KIP.

1. About fetch sequence number. My understanding is that the sequence
number is really used to provide better fencing for requests coming from
different TCP sessions for a given client. Suppose a consumer issues a
fetch request and doesn't get a response in time. The consumer will then
retry the request on a new TCP session. In this case, it's important that
the fetch session cached on the leader not to be overridden by the request
from the old TCP session after the request from the new TCP session has
been processed. To support this, one way is to have an epoch field and have
the consumer bump up the epoch on every new TCP session. The leader will
reject requests with an old epoch. It's less clear to me the value of
sequence number within the same TCP session. In any case, it would be
useful to document the client behavior on request timeout.

2. It would be useful to add some metrics for monitoring the usage of the
session cache. For example, it would be useful to know how many slots are
being used (or unused) and the eviction rate (to see if there is any churn).

3. The partition shuffling algorithm in the KIP is a bit different from the
current approach. In the current approach, the shuffling is more
deterministic in that after sending data for a partition, the leader won't
send new data to that partition until every other partition is given a
chance.

4. The process to mark partition as dirty requires updating every fetch
session having the partition. This may add some overhead. An alternative
approach is to check the difference btw cached fetch offset and HW (or LEO)
when serving the fetch request.

5. The cache size is based on the number of slots. However, different slots
could have different partitions. I am wondering if you have considered
measuring the cache size by partitions.

Thanks,

Jun


On Tue, Jan 2, 2018 at 6:34 PM, Colin McCabe  wrote:

> On Tue, Jan 2, 2018, at 04:46, Becket Qin wrote:
> > Hi Colin,
> >
> > Good point about KIP-226. Maybe a separate broker epoch is needed
> although
> > it is a little awkward to let the consumer set this. So was there a
> > solution to the frequent pause and resume scenario? Did I miss something?
> >
> > Thanks,
> > Jiangjie (Becket) Qin
>
> Hi Becket,
>
> Allowing sessions to be re-initialized (as the current KIP does) makes
> frequent pauses and resumes less painful, because the memory associated
> with the old session can be reclaimed.  The only cost is sending a full
> fetch request once when the pause or resume is activated.
>
> There are other follow-on optimizations that we might want to do later,
> like allowing new partitions to be added to existing fetch sessions without
> a re-initialization, that could make this even more efficient.  But that's
> not in the current KIP, in order to avoid expanding the scope too much.
>
> best,
> Colin
>
> >
> > On Wed, Dec 27, 2017 at 1:40 PM, Colin McCabe 
> wrote:
> >
> > > On Sat, Dec 23, 2017, at 09:15, Becket Qin wrote:
> > > > Hi Colin,
> > > >
> > > > Thanks for the explanation. I want to clarify a bit more on my
> thoughts.
> > > >
> > > > I am fine with having a separate discussion as long as the follow-up
> > > > discussion will be incremental on top of this KIP instead of
> override the
> > > > protocol in this KIP.
> > >
> > > Hi Becket,
> > >
> > > Thanks for the clarification.  I do think that the changes we've been
> > > discussing would be incremental rather than completely replacing what
> we've
> > > talked about here.  See my responses inline below.
> > >
> > > >
> > > > I completely agree this KIP is useful by itself. That being said, we
> want
> > > > to avoid falling into a "local optimal" solution by just saying
> because
> > > it
> > > > solves the problem in this scope. I think we should also think if the
> > > > solution aligns with a "global optimal" (systematic optimal)
> solution as
> > > > well. That is why I brought up other considerations. If they turned
> out
> > > to
> > > > be orthogonal and should be addressed separately, that's good. But at
> > > least
> > > > it is worth thinking about the potential connections between those
> > > things.
> > > >
> > > > One example of such related consideration is the following two
> seemingly
> > > > unrelated things:
> > > >
> > > > 1. I might have missed the discussion, but it seems the concern of
> the
> > > > clients doing frequent pause and resume is still not addressed. Since
> > > this
> > > > is a pretty common use case for applications that want to have flow
> > > > control, or have prioritized consumption, or get consumption
> fairness, we
> > > > probably want to see how to handle this case. One of the solution
> might
> > > be
> > > > a long-lived session id spanning the clients' life time.
> > > >
> > > > 2. KAFKA-6029. The key problem is that the leader wants to know if a
> > > fetch
> > > > request is from a shutting down broker or 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-03 Thread Becket Qin
Hi Colin,

I see. I was under the impression that the sessions are immutable and they
are actually not. Then I don't have further concerns over the KIP. We can
incrementally do the future optimization.

One minor thing, the KIP is still using epoch instead of sequence number in
some places. We may want to replace them to avoid confusion.

Thanks,

Jiangjie (Becket) Qin

On Wed, Jan 3, 2018 at 9:37 AM, Colin McCabe  wrote:

> On Tue, Jan 2, 2018, at 23:49, Becket Qin wrote:
> > Thanks for the reply, Colin.
> >
> > My concern for the reinitialization is potential churn rather than
> > efficiency. The current KIP proposal uses the time and priority based
> > protection to avoid thrashing, but it is not clear to me if that is
> > sufficient. For example, consider topic creation/deletion. In those
> cases,
> > a lot of the replica fetchers will potentially need to re-establish the
> > session. And there might be many client session got evicted. And thus
> again
> > need to re-establish sessions. This would involve two round trips (due to
> > InvalidFetchSessionException), potential metadata refresh and backoff.
>
> Hi Becket,
>
> When a fetcher is adding or removing a partition, it can re-use its
> existing fetch session.  There is no cache churn, and nobody gets evicted,
> in this case.  The fetcher just has to send a full fetch request to
> establish what it wants the new partition set to be.
>
> best,
> Colin
>
> >
> > Admittedly it is probably not going to be worse than what we have now,
> but
> > such uncertain impact still worries me. Are we going to have the follow
> up
> > optimization discussion before the implementation of this KIP or are we
> > going to do it after? In the past we used to have separate KIPs for a
> > complicated feature but implement them together. Perhaps we can do the
> same
> > here if you want to limit the scope of this KIP.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> > On Tue, Jan 2, 2018 at 6:34 PM, Colin McCabe  wrote:
> >
> > > On Tue, Jan 2, 2018, at 04:46, Becket Qin wrote:
> > > > Hi Colin,
> > > >
> > > > Good point about KIP-226. Maybe a separate broker epoch is needed
> > > although
> > > > it is a little awkward to let the consumer set this. So was there a
> > > > solution to the frequent pause and resume scenario? Did I miss
> something?
> > > >
> > > > Thanks,
> > > > Jiangjie (Becket) Qin
> > >
> > > Hi Becket,
> > >
> > > Allowing sessions to be re-initialized (as the current KIP does) makes
> > > frequent pauses and resumes less painful, because the memory associated
> > > with the old session can be reclaimed.  The only cost is sending a full
> > > fetch request once when the pause or resume is activated.
> > >
> > > There are other follow-on optimizations that we might want to do later,
> > > like allowing new partitions to be added to existing fetch sessions
> without
> > > a re-initialization, that could make this even more efficient.  But
> that's
> > > not in the current KIP, in order to avoid expanding the scope too much.
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > On Wed, Dec 27, 2017 at 1:40 PM, Colin McCabe 
> > > wrote:
> > > >
> > > > > On Sat, Dec 23, 2017, at 09:15, Becket Qin wrote:
> > > > > > Hi Colin,
> > > > > >
> > > > > > Thanks for the explanation. I want to clarify a bit more on my
> > > thoughts.
> > > > > >
> > > > > > I am fine with having a separate discussion as long as the
> follow-up
> > > > > > discussion will be incremental on top of this KIP instead of
> > > override the
> > > > > > protocol in this KIP.
> > > > >
> > > > > Hi Becket,
> > > > >
> > > > > Thanks for the clarification.  I do think that the changes we've
> been
> > > > > discussing would be incremental rather than completely replacing
> what
> > > we've
> > > > > talked about here.  See my responses inline below.
> > > > >
> > > > > >
> > > > > > I completely agree this KIP is useful by itself. That being
> said, we
> > > want
> > > > > > to avoid falling into a "local optimal" solution by just saying
> > > because
> > > > > it
> > > > > > solves the problem in this scope. I think we should also think
> if the
> > > > > > solution aligns with a "global optimal" (systematic optimal)
> > > solution as
> > > > > > well. That is why I brought up other considerations. If they
> turned
> > > out
> > > > > to
> > > > > > be orthogonal and should be addressed separately, that's good.
> But at
> > > > > least
> > > > > > it is worth thinking about the potential connections between
> those
> > > > > things.
> > > > > >
> > > > > > One example of such related consideration is the following two
> > > seemingly
> > > > > > unrelated things:
> > > > > >
> > > > > > 1. I might have missed the discussion, but it seems the concern
> of
> > > the
> > > > > > clients doing frequent pause and resume is still not addressed.
> Since
> > > > > this
> > > > > > is a pretty common use case for applications that 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-03 Thread Colin McCabe
On Tue, Jan 2, 2018, at 23:49, Becket Qin wrote:
> Thanks for the reply, Colin.
> 
> My concern for the reinitialization is potential churn rather than
> efficiency. The current KIP proposal uses the time and priority based
> protection to avoid thrashing, but it is not clear to me if that is
> sufficient. For example, consider topic creation/deletion. In those cases,
> a lot of the replica fetchers will potentially need to re-establish the
> session. And there might be many client session got evicted. And thus again
> need to re-establish sessions. This would involve two round trips (due to
> InvalidFetchSessionException), potential metadata refresh and backoff.

Hi Becket,

When a fetcher is adding or removing a partition, it can re-use its existing 
fetch session.  There is no cache churn, and nobody gets evicted, in this case. 
 The fetcher just has to send a full fetch request to establish what it wants 
the new partition set to be.

best,
Colin

> 
> Admittedly it is probably not going to be worse than what we have now, but
> such uncertain impact still worries me. Are we going to have the follow up
> optimization discussion before the implementation of this KIP or are we
> going to do it after? In the past we used to have separate KIPs for a
> complicated feature but implement them together. Perhaps we can do the same
> here if you want to limit the scope of this KIP.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> On Tue, Jan 2, 2018 at 6:34 PM, Colin McCabe  wrote:
> 
> > On Tue, Jan 2, 2018, at 04:46, Becket Qin wrote:
> > > Hi Colin,
> > >
> > > Good point about KIP-226. Maybe a separate broker epoch is needed
> > although
> > > it is a little awkward to let the consumer set this. So was there a
> > > solution to the frequent pause and resume scenario? Did I miss something?
> > >
> > > Thanks,
> > > Jiangjie (Becket) Qin
> >
> > Hi Becket,
> >
> > Allowing sessions to be re-initialized (as the current KIP does) makes
> > frequent pauses and resumes less painful, because the memory associated
> > with the old session can be reclaimed.  The only cost is sending a full
> > fetch request once when the pause or resume is activated.
> >
> > There are other follow-on optimizations that we might want to do later,
> > like allowing new partitions to be added to existing fetch sessions without
> > a re-initialization, that could make this even more efficient.  But that's
> > not in the current KIP, in order to avoid expanding the scope too much.
> >
> > best,
> > Colin
> >
> > >
> > > On Wed, Dec 27, 2017 at 1:40 PM, Colin McCabe 
> > wrote:
> > >
> > > > On Sat, Dec 23, 2017, at 09:15, Becket Qin wrote:
> > > > > Hi Colin,
> > > > >
> > > > > Thanks for the explanation. I want to clarify a bit more on my
> > thoughts.
> > > > >
> > > > > I am fine with having a separate discussion as long as the follow-up
> > > > > discussion will be incremental on top of this KIP instead of
> > override the
> > > > > protocol in this KIP.
> > > >
> > > > Hi Becket,
> > > >
> > > > Thanks for the clarification.  I do think that the changes we've been
> > > > discussing would be incremental rather than completely replacing what
> > we've
> > > > talked about here.  See my responses inline below.
> > > >
> > > > >
> > > > > I completely agree this KIP is useful by itself. That being said, we
> > want
> > > > > to avoid falling into a "local optimal" solution by just saying
> > because
> > > > it
> > > > > solves the problem in this scope. I think we should also think if the
> > > > > solution aligns with a "global optimal" (systematic optimal)
> > solution as
> > > > > well. That is why I brought up other considerations. If they turned
> > out
> > > > to
> > > > > be orthogonal and should be addressed separately, that's good. But at
> > > > least
> > > > > it is worth thinking about the potential connections between those
> > > > things.
> > > > >
> > > > > One example of such related consideration is the following two
> > seemingly
> > > > > unrelated things:
> > > > >
> > > > > 1. I might have missed the discussion, but it seems the concern of
> > the
> > > > > clients doing frequent pause and resume is still not addressed. Since
> > > > this
> > > > > is a pretty common use case for applications that want to have flow
> > > > > control, or have prioritized consumption, or get consumption
> > fairness, we
> > > > > probably want to see how to handle this case. One of the solution
> > might
> > > > be
> > > > > a long-lived session id spanning the clients' life time.
> > > > >
> > > > > 2. KAFKA-6029. The key problem is that the leader wants to know if a
> > > > fetch
> > > > > request is from a shutting down broker or from a restarted broker.
> > > > >
> > > > > The connection between those two issues is that both of them could be
> > > > > addressed by having a life-long session id for each client (or
> > fetcher,
> > > > to
> > > > > be more accurate). This may indicate that 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-02 Thread Becket Qin
Thanks for the reply, Colin.

My concern for the reinitialization is potential churn rather than
efficiency. The current KIP proposal uses the time and priority based
protection to avoid thrashing, but it is not clear to me if that is
sufficient. For example, consider topic creation/deletion. In those cases,
a lot of the replica fetchers will potentially need to re-establish the
session. And there might be many client session got evicted. And thus again
need to re-establish sessions. This would involve two round trips (due to
InvalidFetchSessionException), potential metadata refresh and backoff.

Admittedly it is probably not going to be worse than what we have now, but
such uncertain impact still worries me. Are we going to have the follow up
optimization discussion before the implementation of this KIP or are we
going to do it after? In the past we used to have separate KIPs for a
complicated feature but implement them together. Perhaps we can do the same
here if you want to limit the scope of this KIP.

Thanks,

Jiangjie (Becket) Qin


On Tue, Jan 2, 2018 at 6:34 PM, Colin McCabe  wrote:

> On Tue, Jan 2, 2018, at 04:46, Becket Qin wrote:
> > Hi Colin,
> >
> > Good point about KIP-226. Maybe a separate broker epoch is needed
> although
> > it is a little awkward to let the consumer set this. So was there a
> > solution to the frequent pause and resume scenario? Did I miss something?
> >
> > Thanks,
> > Jiangjie (Becket) Qin
>
> Hi Becket,
>
> Allowing sessions to be re-initialized (as the current KIP does) makes
> frequent pauses and resumes less painful, because the memory associated
> with the old session can be reclaimed.  The only cost is sending a full
> fetch request once when the pause or resume is activated.
>
> There are other follow-on optimizations that we might want to do later,
> like allowing new partitions to be added to existing fetch sessions without
> a re-initialization, that could make this even more efficient.  But that's
> not in the current KIP, in order to avoid expanding the scope too much.
>
> best,
> Colin
>
> >
> > On Wed, Dec 27, 2017 at 1:40 PM, Colin McCabe 
> wrote:
> >
> > > On Sat, Dec 23, 2017, at 09:15, Becket Qin wrote:
> > > > Hi Colin,
> > > >
> > > > Thanks for the explanation. I want to clarify a bit more on my
> thoughts.
> > > >
> > > > I am fine with having a separate discussion as long as the follow-up
> > > > discussion will be incremental on top of this KIP instead of
> override the
> > > > protocol in this KIP.
> > >
> > > Hi Becket,
> > >
> > > Thanks for the clarification.  I do think that the changes we've been
> > > discussing would be incremental rather than completely replacing what
> we've
> > > talked about here.  See my responses inline below.
> > >
> > > >
> > > > I completely agree this KIP is useful by itself. That being said, we
> want
> > > > to avoid falling into a "local optimal" solution by just saying
> because
> > > it
> > > > solves the problem in this scope. I think we should also think if the
> > > > solution aligns with a "global optimal" (systematic optimal)
> solution as
> > > > well. That is why I brought up other considerations. If they turned
> out
> > > to
> > > > be orthogonal and should be addressed separately, that's good. But at
> > > least
> > > > it is worth thinking about the potential connections between those
> > > things.
> > > >
> > > > One example of such related consideration is the following two
> seemingly
> > > > unrelated things:
> > > >
> > > > 1. I might have missed the discussion, but it seems the concern of
> the
> > > > clients doing frequent pause and resume is still not addressed. Since
> > > this
> > > > is a pretty common use case for applications that want to have flow
> > > > control, or have prioritized consumption, or get consumption
> fairness, we
> > > > probably want to see how to handle this case. One of the solution
> might
> > > be
> > > > a long-lived session id spanning the clients' life time.
> > > >
> > > > 2. KAFKA-6029. The key problem is that the leader wants to know if a
> > > fetch
> > > > request is from a shutting down broker or from a restarted broker.
> > > >
> > > > The connection between those two issues is that both of them could be
> > > > addressed by having a life-long session id for each client (or
> fetcher,
> > > to
> > > > be more accurate). This may indicate that having a life long session
> id
> > > > might be a "global optimal" solution so it should be considered in
> this
> > > > KIP. Otherwise, a follow up KIP discussion for KAFKA-6029 may either
> > > > introduce a broker epoch unnecessarily (which will not be used by the
> > > > consumers at all) or override what we do in this KIP.
> > >
> > > Remember that a given follower will have more than one fetch session
> ID.
> > > Each fetcher thread will have its own session ID.  And we will
> eventually
> > > be able to dynamically add or remove fetcher threads using KIP-226.
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-02 Thread Colin McCabe
On Tue, Jan 2, 2018, at 04:46, Becket Qin wrote:
> Hi Colin,
> 
> Good point about KIP-226. Maybe a separate broker epoch is needed although
> it is a little awkward to let the consumer set this. So was there a
> solution to the frequent pause and resume scenario? Did I miss something?
> 
> Thanks,
> Jiangjie (Becket) Qin

Hi Becket,

Allowing sessions to be re-initialized (as the current KIP does) makes frequent 
pauses and resumes less painful, because the memory associated with the old 
session can be reclaimed.  The only cost is sending a full fetch request once 
when the pause or resume is activated.

There are other follow-on optimizations that we might want to do later, like 
allowing new partitions to be added to existing fetch sessions without a 
re-initialization, that could make this even more efficient.  But that's not in 
the current KIP, in order to avoid expanding the scope too much.

best,
Colin

> 
> On Wed, Dec 27, 2017 at 1:40 PM, Colin McCabe  wrote:
> 
> > On Sat, Dec 23, 2017, at 09:15, Becket Qin wrote:
> > > Hi Colin,
> > >
> > > Thanks for the explanation. I want to clarify a bit more on my thoughts.
> > >
> > > I am fine with having a separate discussion as long as the follow-up
> > > discussion will be incremental on top of this KIP instead of override the
> > > protocol in this KIP.
> >
> > Hi Becket,
> >
> > Thanks for the clarification.  I do think that the changes we've been
> > discussing would be incremental rather than completely replacing what we've
> > talked about here.  See my responses inline below.
> >
> > >
> > > I completely agree this KIP is useful by itself. That being said, we want
> > > to avoid falling into a "local optimal" solution by just saying because
> > it
> > > solves the problem in this scope. I think we should also think if the
> > > solution aligns with a "global optimal" (systematic optimal) solution as
> > > well. That is why I brought up other considerations. If they turned out
> > to
> > > be orthogonal and should be addressed separately, that's good. But at
> > least
> > > it is worth thinking about the potential connections between those
> > things.
> > >
> > > One example of such related consideration is the following two seemingly
> > > unrelated things:
> > >
> > > 1. I might have missed the discussion, but it seems the concern of the
> > > clients doing frequent pause and resume is still not addressed. Since
> > this
> > > is a pretty common use case for applications that want to have flow
> > > control, or have prioritized consumption, or get consumption fairness, we
> > > probably want to see how to handle this case. One of the solution might
> > be
> > > a long-lived session id spanning the clients' life time.
> > >
> > > 2. KAFKA-6029. The key problem is that the leader wants to know if a
> > fetch
> > > request is from a shutting down broker or from a restarted broker.
> > >
> > > The connection between those two issues is that both of them could be
> > > addressed by having a life-long session id for each client (or fetcher,
> > to
> > > be more accurate). This may indicate that having a life long session id
> > > might be a "global optimal" solution so it should be considered in this
> > > KIP. Otherwise, a follow up KIP discussion for KAFKA-6029 may either
> > > introduce a broker epoch unnecessarily (which will not be used by the
> > > consumers at all) or override what we do in this KIP.
> >
> > Remember that a given follower will have more than one fetch session ID.
> > Each fetcher thread will have its own session ID.  And we will eventually
> > be able to dynamically add or remove fetcher threads using KIP-226.
> > Therefore, we can't use fetch session IDs to uniquely identify a given
> > broker incarnation.  Any time we increase the number of fetcher threads, a
> > new fetch session ID will show up.
> >
> > If we want to know if a fetch request is from a shutting down broker or
> > from a restarted broker, the most straightforward and robust way would
> > probably be to add an incarnation number for each broker.  ZK can track
> > this number.  This also helps with debugging and logging (you can tell
> > "aha-- this request came from the second incarnation, not the first."
> >
> > > BTW, to clarify, the main purpose of returning the data at the index
> > > boundary was to get the same benefit of efficient incremental fetch for
> > > both low vol and high vol partitions, which is directly related to the
> > > primary goal in this KIP. The other things (such as avoiding binary
> > search)
> > > are just potential additional gain, and they are also brought up to see
> > if
> > > that could be a "global optimal" solution.
> >
> > I still think these are separate.  The primary goal of the KIP was to make
> > fetch requests where not all partitions are returning data more efficient.
> > This isn't really related to the goal of trying to make accessing
> > historical data more efficient.  In most cases, the data 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2018-01-02 Thread Becket Qin
Hi Colin,

Good point about KIP-226. Maybe a separate broker epoch is needed although
it is a little awkward to let the consumer set this. So was there a
solution to the frequent pause and resume scenario? Did I miss something?

Thanks,

Jiangjie (Becket) Qin

On Wed, Dec 27, 2017 at 1:40 PM, Colin McCabe  wrote:

> On Sat, Dec 23, 2017, at 09:15, Becket Qin wrote:
> > Hi Colin,
> >
> > Thanks for the explanation. I want to clarify a bit more on my thoughts.
> >
> > I am fine with having a separate discussion as long as the follow-up
> > discussion will be incremental on top of this KIP instead of override the
> > protocol in this KIP.
>
> Hi Becket,
>
> Thanks for the clarification.  I do think that the changes we've been
> discussing would be incremental rather than completely replacing what we've
> talked about here.  See my responses inline below.
>
> >
> > I completely agree this KIP is useful by itself. That being said, we want
> > to avoid falling into a "local optimal" solution by just saying because
> it
> > solves the problem in this scope. I think we should also think if the
> > solution aligns with a "global optimal" (systematic optimal) solution as
> > well. That is why I brought up other considerations. If they turned out
> to
> > be orthogonal and should be addressed separately, that's good. But at
> least
> > it is worth thinking about the potential connections between those
> things.
> >
> > One example of such related consideration is the following two seemingly
> > unrelated things:
> >
> > 1. I might have missed the discussion, but it seems the concern of the
> > clients doing frequent pause and resume is still not addressed. Since
> this
> > is a pretty common use case for applications that want to have flow
> > control, or have prioritized consumption, or get consumption fairness, we
> > probably want to see how to handle this case. One of the solution might
> be
> > a long-lived session id spanning the clients' life time.
> >
> > 2. KAFKA-6029. The key problem is that the leader wants to know if a
> fetch
> > request is from a shutting down broker or from a restarted broker.
> >
> > The connection between those two issues is that both of them could be
> > addressed by having a life-long session id for each client (or fetcher,
> to
> > be more accurate). This may indicate that having a life long session id
> > might be a "global optimal" solution so it should be considered in this
> > KIP. Otherwise, a follow up KIP discussion for KAFKA-6029 may either
> > introduce a broker epoch unnecessarily (which will not be used by the
> > consumers at all) or override what we do in this KIP.
>
> Remember that a given follower will have more than one fetch session ID.
> Each fetcher thread will have its own session ID.  And we will eventually
> be able to dynamically add or remove fetcher threads using KIP-226.
> Therefore, we can't use fetch session IDs to uniquely identify a given
> broker incarnation.  Any time we increase the number of fetcher threads, a
> new fetch session ID will show up.
>
> If we want to know if a fetch request is from a shutting down broker or
> from a restarted broker, the most straightforward and robust way would
> probably be to add an incarnation number for each broker.  ZK can track
> this number.  This also helps with debugging and logging (you can tell
> "aha-- this request came from the second incarnation, not the first."
>
> > BTW, to clarify, the main purpose of returning the data at the index
> > boundary was to get the same benefit of efficient incremental fetch for
> > both low vol and high vol partitions, which is directly related to the
> > primary goal in this KIP. The other things (such as avoiding binary
> search)
> > are just potential additional gain, and they are also brought up to see
> if
> > that could be a "global optimal" solution.
>
> I still think these are separate.  The primary goal of the KIP was to make
> fetch requests where not all partitions are returning data more efficient.
> This isn't really related to the goal of trying to make accessing
> historical data more efficient.  In most cases, the data we're accessing is
> very recent data, and index lookups are not an issue.
>
> >
> > Some other replies below.
> > >In order for improvements to succeed, I think that it's important to
> > clearly define the scope and goals.  One good example of this was the
> > AdminClient KIP.  We deliberately avoiding ?>discussing new
> administrative
> > RPCs in that KIP, in order to limit the scope.  This kept the discussion
> > focused on the user interfaces and configuration, rather than on the
> > details of possible >new RPCs.  Once the KIP was completed, it was easy
> for
> > us to add new RPCs later in separate KIPs.
> > Hmm, why is AdminClient is related? All the discussion are about how to
> > make fetch more efficient, right?
> >
> > >Finally, it's not clear that the approach you are proposing is the right
> > way to go.  

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-27 Thread Colin McCabe
On Sat, Dec 23, 2017, at 09:15, Becket Qin wrote:
> Hi Colin,
> 
> Thanks for the explanation. I want to clarify a bit more on my thoughts.
> 
> I am fine with having a separate discussion as long as the follow-up
> discussion will be incremental on top of this KIP instead of override the
> protocol in this KIP.

Hi Becket,

Thanks for the clarification.  I do think that the changes we've been 
discussing would be incremental rather than completely replacing what we've 
talked about here.  See my responses inline below.

> 
> I completely agree this KIP is useful by itself. That being said, we want
> to avoid falling into a "local optimal" solution by just saying because it
> solves the problem in this scope. I think we should also think if the
> solution aligns with a "global optimal" (systematic optimal) solution as
> well. That is why I brought up other considerations. If they turned out to
> be orthogonal and should be addressed separately, that's good. But at least
> it is worth thinking about the potential connections between those things.
> 
> One example of such related consideration is the following two seemingly
> unrelated things:
> 
> 1. I might have missed the discussion, but it seems the concern of the
> clients doing frequent pause and resume is still not addressed. Since this
> is a pretty common use case for applications that want to have flow
> control, or have prioritized consumption, or get consumption fairness, we
> probably want to see how to handle this case. One of the solution might be
> a long-lived session id spanning the clients' life time.
> 
> 2. KAFKA-6029. The key problem is that the leader wants to know if a fetch
> request is from a shutting down broker or from a restarted broker.
> 
> The connection between those two issues is that both of them could be
> addressed by having a life-long session id for each client (or fetcher, to
> be more accurate). This may indicate that having a life long session id
> might be a "global optimal" solution so it should be considered in this
> KIP. Otherwise, a follow up KIP discussion for KAFKA-6029 may either
> introduce a broker epoch unnecessarily (which will not be used by the
> consumers at all) or override what we do in this KIP.

Remember that a given follower will have more than one fetch session ID.  Each 
fetcher thread will have its own session ID.  And we will eventually be able to 
dynamically add or remove fetcher threads using KIP-226.  Therefore, we can't 
use fetch session IDs to uniquely identify a given broker incarnation.  Any 
time we increase the number of fetcher threads, a new fetch session ID will 
show up.

If we want to know if a fetch request is from a shutting down broker or from a 
restarted broker, the most straightforward and robust way would probably be to 
add an incarnation number for each broker.  ZK can track this number.  This 
also helps with debugging and logging (you can tell "aha-- this request came 
from the second incarnation, not the first."

> BTW, to clarify, the main purpose of returning the data at the index
> boundary was to get the same benefit of efficient incremental fetch for
> both low vol and high vol partitions, which is directly related to the
> primary goal in this KIP. The other things (such as avoiding binary search)
> are just potential additional gain, and they are also brought up to see if
> that could be a "global optimal" solution.

I still think these are separate.  The primary goal of the KIP was to make 
fetch requests where not all partitions are returning data more efficient.  
This isn't really related to the goal of trying to make accessing historical 
data more efficient.  In most cases, the data we're accessing is very recent 
data, and index lookups are not an issue.

> 
> Some other replies below.
> >In order for improvements to succeed, I think that it's important to
> clearly define the scope and goals.  One good example of this was the
> AdminClient KIP.  We deliberately avoiding ?>discussing new administrative
> RPCs in that KIP, in order to limit the scope.  This kept the discussion
> focused on the user interfaces and configuration, rather than on the
> details of possible >new RPCs.  Once the KIP was completed, it was easy for
> us to add new RPCs later in separate KIPs.
> Hmm, why is AdminClient is related? All the discussion are about how to
> make fetch more efficient, right?
> 
> >Finally, it's not clear that the approach you are proposing is the right
> way to go.  I think we would need to have a lot more discussion about it.
> One very big disadvantage is that it couples >what we send back on the wire
> tightly to what is on the disk.  It's not clear that we want to do that.
> What if we want to change how things are stored in the future?  How does
> this work with >clients' own concept of fetch sizes?  And so on, and so
> on.  This needs its own discussion thread.
> That might be true. However, the index file by definition is for the files
> stored 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-23 Thread Becket Qin
Hi Colin,

Thanks for the explanation. I want to clarify a bit more on my thoughts.

I am fine with having a separate discussion as long as the follow-up
discussion will be incremental on top of this KIP instead of override the
protocol in this KIP.

I completely agree this KIP is useful by itself. That being said, we want
to avoid falling into a "local optimal" solution by just saying because it
solves the problem in this scope. I think we should also think if the
solution aligns with a "global optimal" (systematic optimal) solution as
well. That is why I brought up other considerations. If they turned out to
be orthogonal and should be addressed separately, that's good. But at least
it is worth thinking about the potential connections between those things.

One example of such related consideration is the following two seemingly
unrelated things:

1. I might have missed the discussion, but it seems the concern of the
clients doing frequent pause and resume is still not addressed. Since this
is a pretty common use case for applications that want to have flow
control, or have prioritized consumption, or get consumption fairness, we
probably want to see how to handle this case. One of the solution might be
a long-lived session id spanning the clients' life time.

2. KAFKA-6029. The key problem is that the leader wants to know if a fetch
request is from a shutting down broker or from a restarted broker.

The connection between those two issues is that both of them could be
addressed by having a life-long session id for each client (or fetcher, to
be more accurate). This may indicate that having a life long session id
might be a "global optimal" solution so it should be considered in this
KIP. Otherwise, a follow up KIP discussion for KAFKA-6029 may either
introduce a broker epoch unnecessarily (which will not be used by the
consumers at all) or override what we do in this KIP.

BTW, to clarify, the main purpose of returning the data at the index
boundary was to get the same benefit of efficient incremental fetch for
both low vol and high vol partitions, which is directly related to the
primary goal in this KIP. The other things (such as avoiding binary search)
are just potential additional gain, and they are also brought up to see if
that could be a "global optimal" solution.

Some other replies below.
>In order for improvements to succeed, I think that it's important to
clearly define the scope and goals.  One good example of this was the
AdminClient KIP.  We deliberately avoiding ?>discussing new administrative
RPCs in that KIP, in order to limit the scope.  This kept the discussion
focused on the user interfaces and configuration, rather than on the
details of possible >new RPCs.  Once the KIP was completed, it was easy for
us to add new RPCs later in separate KIPs.
Hmm, why is AdminClient is related? All the discussion are about how to
make fetch more efficient, right?

>Finally, it's not clear that the approach you are proposing is the right
way to go.  I think we would need to have a lot more discussion about it.
One very big disadvantage is that it couples >what we send back on the wire
tightly to what is on the disk.  It's not clear that we want to do that.
What if we want to change how things are stored in the future?  How does
this work with >clients' own concept of fetch sizes?  And so on, and so
on.  This needs its own discussion thread.
That might be true. However, the index file by definition is for the files
stored on the disk. So if we decide to change the storage layer to
something else, it seems natural to use some other suitable ways to get the
offsets efficiently.

>There are a lot of simpler solutions that might work as well or better.
For example, each partition could keep an in-memory LRU cache of the most
recently used offset to file position >mappings.  Or we could have a thread
periodically touch the latest page or two of memory in the index file for
each partition, to make sure that it didn't fall out of the cache.  In some
offline >discussions, some of these approaches have looked quite
promising.  I've even seen some good performance numbers for prototypes.
In any case, it's a separate problem which needs its >own KIP, I think.
Those are indeed separate discussions. I was not intended to discuss them
in this KIP. Sorry about the confusion.

Thanks and Merry Christmas,

Jiangjie (Becket) Qin


On Sat, Dec 23, 2017 at 1:16 AM, Colin McCabe  wrote:

> On Fri, Dec 22, 2017, at 14:31, Becket Qin wrote:
> > >>
> > >> The point I want to make is that avoiding doing binary search on index
> > >> file and avoid reading the log segments during fetch has some
> additional
> > >> benefits. So if the solution works for the current KIP, it might be a
> > >> better choice.
> >
> > >Let's discuss this in a follow-on KIP.
> >
> > If the discussion will potentially change the protocol in the current
> > proposal. Would it be better to discuss it now instead of in a follow-up
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-23 Thread Colin McCabe
On Fri, Dec 22, 2017, at 14:31, Becket Qin wrote:
> >>
> >> The point I want to make is that avoiding doing binary search on index
> >> file and avoid reading the log segments during fetch has some additional
> >> benefits. So if the solution works for the current KIP, it might be a
> >> better choice.
> 
> >Let's discuss this in a follow-on KIP.
> 
> If the discussion will potentially change the protocol in the current
> proposal. Would it be better to discuss it now instead of in a follow-up
> KIP so we don't have some protocol that immediately requires a change.

Hi Becket,

I think that the problem that you are discussing is different than the problem 
this KIP is designed to address.  This KIP is targeted at eliminating the 
wastefulness of re-transmitting information about partitions that haven't 
changed in every FetchRequest and FetchResponse.  The problem you are 
discussing is dealing with situations where the index file or the data file is 
not in the page cache, and therefore we take a page fault when doing an index 
lookup.

This KIP is useful and valuable on its own.  For example, if you have brokers 
in a public cloud in different availability zones, you may wish to minimize the 
network traffic between them.  Therefore, you don't want every FetchRequest 
between brokers to be a full FetchRequest.  In that case, this KIP is very 
valuable.

In order for improvements to succeed, I think that it's important to clearly 
define the scope and goals.  One good example of this was the AdminClient KIP.  
We deliberately avoiding discussing new administrative RPCs in that KIP, in 
order to limit the scope.  This kept the discussion focused on the user 
interfaces and configuration, rather than on the details of possible new RPCs.  
Once the KIP was completed, it was easy for us to add new RPCs later in 
separate KIPs.

While it's clear that there is probably even more we could do to optimize fetch 
requests, making them incremental seems like a good first cut.  I deliberately 
avoided changing the replication protocol in this KIP, because I think that 
it's a big enough change as-is.  If we want to change the replication protocol 
in the future, there is nothing preventing us... and this change will be a 
useful starting point.

Finally, it's not clear that the approach you are proposing is the right way to 
go.  I think we would need to have a lot more discussion about it.  One very 
big disadvantage is that it couples what we send back on the wire tightly to 
what is on the disk.  It's not clear that we want to do that.  What if we want 
to change how things are stored in the future?  How does this work with 
clients' own concept of fetch sizes?  And so on, and so on.  This needs its own 
discussion thread.

There are a lot of simpler solutions that might work as well or better.  For 
example, each partition could keep an in-memory LRU cache of the most recently 
used offset to file position mappings.  Or we could have a thread periodically 
touch the latest page or two of memory in the index file for each partition, to 
make sure that it didn't fall out of the cache.  In some offline discussions, 
some of these approaches have looked quite promising.  I've even seen some good 
performance numbers for prototypes.  In any case, it's a separate problem which 
needs its own KIP, I think.

best,
Colin

> 
> 
> On Tue, Dec 19, 2017 at 9:26 AM, Colin McCabe  wrote:
> 
> > On Tue, Dec 19, 2017, at 02:16, Jan Filipiak wrote:
> > > Sorry for coming back at this so late.
> > >
> > >
> > >
> > > On 11.12.2017 07:12, Colin McCabe wrote:
> > > > On Sun, Dec 10, 2017, at 22:10, Colin McCabe wrote:
> > > >> On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:
> > > >>> Hi,
> > > >>>
> > > >>> sorry for the late reply, busy times :-/
> > > >>>
> > > >>> I would ask you one thing maybe. Since the timeout
> > > >>> argument seems to be settled I have no further argument
> > > >>> form your side except the "i don't want to".
> > > >>>
> > > >>> Can you see that connection.max.idle.max is the exact time
> > > >>> that expresses "We expect the client to be away for this long,
> > > >>> and come back and continue"?
> > > >> Hi Jan,
> > > >>
> > > >> Sure, connection.max.idle.max is the exact time that we want to keep
> > > >> around a TCP session.  TCP sessions are relatively cheap, so we can
> > > >> afford to keep them around for 10 minutes by default.  Incremental
> > fetch
> > > >> state is less cheap, so we want to set a shorter timeout for it.  We
> > > >> also want new TCP sessions to be able to reuse an existing incremental
> > > >> fetch session rather than creating a new one and waiting for the old
> > one
> > > >> to time out.
> > > >>
> > > >>> also clarified some stuff inline
> > > >>>
> > > >>> Best Jan
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On 05.12.2017 23:14, Colin McCabe wrote:
> > >  On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > > > Hi Colin
> > > >
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-22 Thread Becket Qin
>>
>> The point I want to make is that avoiding doing binary search on index
>> file and avoid reading the log segments during fetch has some additional
>> benefits. So if the solution works for the current KIP, it might be a
>> better choice.

>Let's discuss this in a follow-on KIP.

If the discussion will potentially change the protocol in the current
proposal. Would it be better to discuss it now instead of in a follow-up
KIP so we don't have some protocol that immediately requires a change.


On Tue, Dec 19, 2017 at 9:26 AM, Colin McCabe  wrote:

> On Tue, Dec 19, 2017, at 02:16, Jan Filipiak wrote:
> > Sorry for coming back at this so late.
> >
> >
> >
> > On 11.12.2017 07:12, Colin McCabe wrote:
> > > On Sun, Dec 10, 2017, at 22:10, Colin McCabe wrote:
> > >> On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:
> > >>> Hi,
> > >>>
> > >>> sorry for the late reply, busy times :-/
> > >>>
> > >>> I would ask you one thing maybe. Since the timeout
> > >>> argument seems to be settled I have no further argument
> > >>> form your side except the "i don't want to".
> > >>>
> > >>> Can you see that connection.max.idle.max is the exact time
> > >>> that expresses "We expect the client to be away for this long,
> > >>> and come back and continue"?
> > >> Hi Jan,
> > >>
> > >> Sure, connection.max.idle.max is the exact time that we want to keep
> > >> around a TCP session.  TCP sessions are relatively cheap, so we can
> > >> afford to keep them around for 10 minutes by default.  Incremental
> fetch
> > >> state is less cheap, so we want to set a shorter timeout for it.  We
> > >> also want new TCP sessions to be able to reuse an existing incremental
> > >> fetch session rather than creating a new one and waiting for the old
> one
> > >> to time out.
> > >>
> > >>> also clarified some stuff inline
> > >>>
> > >>> Best Jan
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On 05.12.2017 23:14, Colin McCabe wrote:
> >  On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > > Hi Colin
> > >
> > > Addressing the topic of how to manage slots from the other thread.
> > > With tcp connections all this comes for free essentially.
> >  Hi Jan,
> > 
> >  I don't think that it's accurate to say that cache management
> "comes for
> >  free" by coupling the incremental fetch session with the TCP
> session.
> >  When a new TCP session is started by a fetch request, you still
> have to
> >  decide whether to grant that request an incremental fetch session or
> >  not.  If your answer is that you always grant the request, I would
> argue
> >  that you do not have cache management.
> > >>> First I would say, the client has a big say in this. If the client
> > >>> is not going to issue incremental he shouldn't ask for a cache
> > >>> when the client ask for the cache we still have all options to deny.
> > >> To put it simply, we have to have some cache management above and
> beyond
> > >> just giving out an incremental fetch session to anyone who has a TCP
> > >> session.  Therefore, caching does not become simpler if you couple the
> > >> fetch session to the TCP session.
> > Simply giving out an fetch session for everyone with a connection is too
> > simple,
> > but I think it plays well into the idea of consumers choosing to use the
> > feature
> > therefore only enabling where it brings maximum gains
> > (replicas,MirrorMakers)
> > >>
> >  I guess you could argue that timeouts are cache management, but I
> don't
> >  find that argument persuasive.  Anyone could just create a lot of
> TCP
> >  sessions and use a lot of resources, in that case.  So there is
> >  essentially no limit on memory use.  In any case, TCP sessions don't
> >  help us implement fetch session timeouts.
> > >>> We still have all the options denying the request to keep the state.
> > >>> What you want seems like a max connections / ip safeguard.
> > >>> I can currently take down a broker with to many connections easily.
> > >>>
> > >>>
> > > I still would argue we disable it by default and make a flag in the
> > > broker to ask the leader to maintain the cache while replicating
> and also only
> > > have it optional in consumers (default to off) so one can turn it
> on
> > > where it really hurts.  MirrorMaker and audit consumers
> prominently.
> >  I agree with Jason's point from earlier in the thread.  Adding extra
> >  configuration knobs that aren't really necessary can harm usability.
> >  Certainly asking people to manually turn on a feature "where it
> really
> >  hurts" seems to fall in that category, when we could easily enable
> it
> >  automatically for them.
> > >>> This doesn't make much sense to me.
> > >> There are no tradeoffs to think about from the client's point of view:
> > >> it always wants an incremental fetch session.  So there is no benefit
> to
> > >> making the clients configure an extra setting.  Updating and 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-19 Thread Colin McCabe
On Tue, Dec 19, 2017, at 02:16, Jan Filipiak wrote:
> Sorry for coming back at this so late.
> 
> 
> 
> On 11.12.2017 07:12, Colin McCabe wrote:
> > On Sun, Dec 10, 2017, at 22:10, Colin McCabe wrote:
> >> On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:
> >>> Hi,
> >>>
> >>> sorry for the late reply, busy times :-/
> >>>
> >>> I would ask you one thing maybe. Since the timeout
> >>> argument seems to be settled I have no further argument
> >>> form your side except the "i don't want to".
> >>>
> >>> Can you see that connection.max.idle.max is the exact time
> >>> that expresses "We expect the client to be away for this long,
> >>> and come back and continue"?
> >> Hi Jan,
> >>
> >> Sure, connection.max.idle.max is the exact time that we want to keep
> >> around a TCP session.  TCP sessions are relatively cheap, so we can
> >> afford to keep them around for 10 minutes by default.  Incremental fetch
> >> state is less cheap, so we want to set a shorter timeout for it.  We
> >> also want new TCP sessions to be able to reuse an existing incremental
> >> fetch session rather than creating a new one and waiting for the old one
> >> to time out.
> >>
> >>> also clarified some stuff inline
> >>>
> >>> Best Jan
> >>>
> >>>
> >>>
> >>>
> >>> On 05.12.2017 23:14, Colin McCabe wrote:
>  On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > Hi Colin
> >
> > Addressing the topic of how to manage slots from the other thread.
> > With tcp connections all this comes for free essentially.
>  Hi Jan,
> 
>  I don't think that it's accurate to say that cache management "comes for
>  free" by coupling the incremental fetch session with the TCP session.
>  When a new TCP session is started by a fetch request, you still have to
>  decide whether to grant that request an incremental fetch session or
>  not.  If your answer is that you always grant the request, I would argue
>  that you do not have cache management.
> >>> First I would say, the client has a big say in this. If the client
> >>> is not going to issue incremental he shouldn't ask for a cache
> >>> when the client ask for the cache we still have all options to deny.
> >> To put it simply, we have to have some cache management above and beyond
> >> just giving out an incremental fetch session to anyone who has a TCP
> >> session.  Therefore, caching does not become simpler if you couple the
> >> fetch session to the TCP session.
> Simply giving out an fetch session for everyone with a connection is too 
> simple,
> but I think it plays well into the idea of consumers choosing to use the 
> feature
> therefore only enabling where it brings maximum gains 
> (replicas,MirrorMakers)
> >>
>  I guess you could argue that timeouts are cache management, but I don't
>  find that argument persuasive.  Anyone could just create a lot of TCP
>  sessions and use a lot of resources, in that case.  So there is
>  essentially no limit on memory use.  In any case, TCP sessions don't
>  help us implement fetch session timeouts.
> >>> We still have all the options denying the request to keep the state.
> >>> What you want seems like a max connections / ip safeguard.
> >>> I can currently take down a broker with to many connections easily.
> >>>
> >>>
> > I still would argue we disable it by default and make a flag in the
> > broker to ask the leader to maintain the cache while replicating and 
> > also only
> > have it optional in consumers (default to off) so one can turn it on
> > where it really hurts.  MirrorMaker and audit consumers prominently.
>  I agree with Jason's point from earlier in the thread.  Adding extra
>  configuration knobs that aren't really necessary can harm usability.
>  Certainly asking people to manually turn on a feature "where it really
>  hurts" seems to fall in that category, when we could easily enable it
>  automatically for them.
> >>> This doesn't make much sense to me.
> >> There are no tradeoffs to think about from the client's point of view:
> >> it always wants an incremental fetch session.  So there is no benefit to
> >> making the clients configure an extra setting.  Updating and managing
> >> client configurations is also more difficult than managing broker
> >> configurations for most users.
> >>
> >>> You also wanted to implement
> >>> a "turn of in case of bug"-knob. Having the client indicate if the
> >>> feauture will be used seems reasonable to me.,
> >> True.  However, if there is a bug, we could also roll back the client,
> >> so having this configuration knob is not strictly required.
> >>
> > Otherwise I left a few remarks in-line, which should help to understand
> > my view of the situation better
> >
> > Best Jan
> >
> >
> > On 05.12.2017 08:06, Colin McCabe wrote:
> >> On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> >>> On 03.12.2017 21:55, Colin McCabe wrote:
>  On Sat, 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-19 Thread Jan Filipiak

Sorry for coming back at this so late.



On 11.12.2017 07:12, Colin McCabe wrote:

On Sun, Dec 10, 2017, at 22:10, Colin McCabe wrote:

On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:

Hi,

sorry for the late reply, busy times :-/

I would ask you one thing maybe. Since the timeout
argument seems to be settled I have no further argument
form your side except the "i don't want to".

Can you see that connection.max.idle.max is the exact time
that expresses "We expect the client to be away for this long,
and come back and continue"?

Hi Jan,

Sure, connection.max.idle.max is the exact time that we want to keep
around a TCP session.  TCP sessions are relatively cheap, so we can
afford to keep them around for 10 minutes by default.  Incremental fetch
state is less cheap, so we want to set a shorter timeout for it.  We
also want new TCP sessions to be able to reuse an existing incremental
fetch session rather than creating a new one and waiting for the old one
to time out.


also clarified some stuff inline

Best Jan




On 05.12.2017 23:14, Colin McCabe wrote:

On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:

Hi Colin

Addressing the topic of how to manage slots from the other thread.
With tcp connections all this comes for free essentially.

Hi Jan,

I don't think that it's accurate to say that cache management "comes for
free" by coupling the incremental fetch session with the TCP session.
When a new TCP session is started by a fetch request, you still have to
decide whether to grant that request an incremental fetch session or
not.  If your answer is that you always grant the request, I would argue
that you do not have cache management.

First I would say, the client has a big say in this. If the client
is not going to issue incremental he shouldn't ask for a cache
when the client ask for the cache we still have all options to deny.

To put it simply, we have to have some cache management above and beyond
just giving out an incremental fetch session to anyone who has a TCP
session.  Therefore, caching does not become simpler if you couple the
fetch session to the TCP session.
Simply giving out an fetch session for everyone with a connection is too 
simple,
but I think it plays well into the idea of consumers choosing to use the 
feature
therefore only enabling where it brings maximum gains 
(replicas,MirrorMakers)



I guess you could argue that timeouts are cache management, but I don't
find that argument persuasive.  Anyone could just create a lot of TCP
sessions and use a lot of resources, in that case.  So there is
essentially no limit on memory use.  In any case, TCP sessions don't
help us implement fetch session timeouts.

We still have all the options denying the request to keep the state.
What you want seems like a max connections / ip safeguard.
I can currently take down a broker with to many connections easily.



I still would argue we disable it by default and make a flag in the
broker to ask the leader to maintain the cache while replicating and also only
have it optional in consumers (default to off) so one can turn it on
where it really hurts.  MirrorMaker and audit consumers prominently.

I agree with Jason's point from earlier in the thread.  Adding extra
configuration knobs that aren't really necessary can harm usability.
Certainly asking people to manually turn on a feature "where it really
hurts" seems to fall in that category, when we could easily enable it
automatically for them.

This doesn't make much sense to me.

There are no tradeoffs to think about from the client's point of view:
it always wants an incremental fetch session.  So there is no benefit to
making the clients configure an extra setting.  Updating and managing
client configurations is also more difficult than managing broker
configurations for most users.


You also wanted to implement
a "turn of in case of bug"-knob. Having the client indicate if the
feauture will be used seems reasonable to me.,

True.  However, if there is a bug, we could also roll back the client,
so having this configuration knob is not strictly required.


Otherwise I left a few remarks in-line, which should help to understand
my view of the situation better

Best Jan


On 05.12.2017 08:06, Colin McCabe wrote:

On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:

On 03.12.2017 21:55, Colin McCabe wrote:

On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:

Thanks for the explanation, Colin. A few more questions.


The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correlation ID is not unique
(and has no meaning) outside 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-14 Thread Colin McCabe
Hi all,

I think the KIP has progressed a lot and is ready for a vote soon.  I'll
call a vote tomorrow if there are no more comments.

best,
Colin

On Thu, Dec 14, 2017, at 17:22, Colin McCabe wrote:
> On Tue, Dec 12, 2017, at 11:48, Becket Qin wrote:
> > Hi Colin,
> > 
> > I am not completely sure, but I am hoping that when we do
> > FileChannel.transferTo() the OS will just use a fixed buffer to transfer
> > the data to the socket channel without polluting the page cache. But this
> > might not be true if we are using SSL.
> 
> Hi Becket,
> 
> sendfile always uses the page cache.  See this comment by Linus
> Torvalds: http://yarchive.net/comp/linux/sendfile.html
> 
> > sendfile() wants the source to be in the page cache, because the whole
> > point of sendfile() was to avoid a copy. 
> 
> > 
> > The point I want to make is that avoiding doing binary search on index
> > file and avoid reading the log segments during fetch has some additional
> > benefits. So if the solution works for the current KIP, it might be a
> > better choice.
> 
> Let's discuss this in a follow-on KIP.
> 
> > 
> > Regarding the fixed session for the entire life of the clients, it may be
> > also related to another issue we want to solve with broker epoch in
> > KAFKA-6029. If we can make sure the session id will not change along the
> > life time of clients, we can use that session id instead of creating a
> > separate broker epoch and add that to the FetchRequest.
> 
> These issues are not really related.  That JIRA is proposing a "broker
> epoch" that would uniquely identify different incarnations of the
> broker.  In contrast, the fetch session ID doesn't uniquely identify
> even a single client, because a single client can have multiple fetcher
> threads.  In that case, each thread performing a fetch would have a
> fetcher ID.  Even if you only have a single fetcher thread, a given
> follower will have a different fetch session ID on each different
> leader.
> 
> best,
> Colin
> 
> > 
> > Thanks,
> > 
> > Jiangjie (Becket) Qin
> > 
> > 
> > 
> > On Mon, Dec 11, 2017 at 3:25 PM, Colin McCabe  wrote:
> > 
> > > On Mon, Dec 11, 2017, at 14:51, Becket Qin wrote:
> > > > Hi Jun,
> > > >
> > > > Yes, I agree avoiding reading the log segment is not the primary goal 
> > > > for
> > > > this KIP. I brought this up because recently I saw a significant
> > > > throughput
> > > > impact when a broker is down for 20 - 30 min and rejoins a cluster. The
> > > > bytes in rate could drop by 50% when that broker is trying to catch up
> > > > with
> > > > the leaders even in a big cluster (a single broker should not have such
> > > > big
> > > > impact on the entire cluster).
> > >
> > > Hi Becket,
> > >
> > > It sounds like the broker was fetching older data which wasn't in the
> > > page cache?  That sounds like it could definitely have a negative impact
> > > on the cluster.  It is a little troubling if the impact is a 50% drop in
> > > throughput, though.
> > >
> > > It's a little unclear how to mitigate this, since old data is definitely
> > > not going to be in memory.  Maybe we need to work on making sure that
> > > slow fetches going on by one fetcher do not slow down all the other
> > > worker threads...?
> > >
> > > > And some users also reported such cascading
> > > > degradation, i.e. when one consumer lags behind, the other consumers 
> > > > will
> > > > also start to lag behind. So I think addressing this is an important
> > > > improvement. I will run some test and see if returning at index boundary
> > > > to avoid the log scan would help address this issue. That being said, I
> > > > agree that we don't have to address this issue in this KIP. I can submit
> > > > another KIP later if avoiding the log segment scan helps.
> > >
> > > Thanks, that's really interesting.
> > >
> > > I agree that it might be better in a follow-on KIP.
> > >
> > > Is the goal to improve the cold-cache case?  Maybe avoid looking at the
> > > index file altogether (except for the initial setup)?  That would be a
> > > nice improvement for consumers fetching big sequential chunks of
> > > historic data.
> > >
> > > regards,
> > > Colin
> > >
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Mon, Dec 11, 2017 at 1:06 PM, Dong Lin  wrote:
> > > >
> > > > > Hey Colin,
> > > > >
> > > > > I went over the latest KIP wiki and have a few comments here.
> > > > >
> > > > > 1) The KIP says that client ID is a string if the session belongs to a
> > > > > Kafka consumer. And it is a numerical follower Id if the session
> > > belongs to
> > > > > a follower. Can we have a consistent type for the client Id?
> > > > >
> > > > > 2) "The numeric follower ID, if this fetch session belongs to a Kafka
> > > > > broker". If the broker has multiple replica fetcher thread, do they 
> > > > > all
> > > > > have the same follower Id in teh leader broker?
> > > > >
> > > > > 3) One of the condition 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-14 Thread Colin McCabe
On Tue, Dec 12, 2017, at 11:48, Becket Qin wrote:
> Hi Colin,
> 
> I am not completely sure, but I am hoping that when we do
> FileChannel.transferTo() the OS will just use a fixed buffer to transfer
> the data to the socket channel without polluting the page cache. But this
> might not be true if we are using SSL.

Hi Becket,

sendfile always uses the page cache.  See this comment by Linus
Torvalds: http://yarchive.net/comp/linux/sendfile.html

> sendfile() wants the source to be in the page cache, because the whole
> point of sendfile() was to avoid a copy. 

> 
> The point I want to make is that avoiding doing binary search on index
> file and avoid reading the log segments during fetch has some additional
> benefits. So if the solution works for the current KIP, it might be a
> better choice.

Let's discuss this in a follow-on KIP.

> 
> Regarding the fixed session for the entire life of the clients, it may be
> also related to another issue we want to solve with broker epoch in
> KAFKA-6029. If we can make sure the session id will not change along the
> life time of clients, we can use that session id instead of creating a
> separate broker epoch and add that to the FetchRequest.

These issues are not really related.  That JIRA is proposing a "broker
epoch" that would uniquely identify different incarnations of the
broker.  In contrast, the fetch session ID doesn't uniquely identify
even a single client, because a single client can have multiple fetcher
threads.  In that case, each thread performing a fetch would have a
fetcher ID.  Even if you only have a single fetcher thread, a given
follower will have a different fetch session ID on each different
leader.

best,
Colin

> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> 
> On Mon, Dec 11, 2017 at 3:25 PM, Colin McCabe  wrote:
> 
> > On Mon, Dec 11, 2017, at 14:51, Becket Qin wrote:
> > > Hi Jun,
> > >
> > > Yes, I agree avoiding reading the log segment is not the primary goal for
> > > this KIP. I brought this up because recently I saw a significant
> > > throughput
> > > impact when a broker is down for 20 - 30 min and rejoins a cluster. The
> > > bytes in rate could drop by 50% when that broker is trying to catch up
> > > with
> > > the leaders even in a big cluster (a single broker should not have such
> > > big
> > > impact on the entire cluster).
> >
> > Hi Becket,
> >
> > It sounds like the broker was fetching older data which wasn't in the
> > page cache?  That sounds like it could definitely have a negative impact
> > on the cluster.  It is a little troubling if the impact is a 50% drop in
> > throughput, though.
> >
> > It's a little unclear how to mitigate this, since old data is definitely
> > not going to be in memory.  Maybe we need to work on making sure that
> > slow fetches going on by one fetcher do not slow down all the other
> > worker threads...?
> >
> > > And some users also reported such cascading
> > > degradation, i.e. when one consumer lags behind, the other consumers will
> > > also start to lag behind. So I think addressing this is an important
> > > improvement. I will run some test and see if returning at index boundary
> > > to avoid the log scan would help address this issue. That being said, I
> > > agree that we don't have to address this issue in this KIP. I can submit
> > > another KIP later if avoiding the log segment scan helps.
> >
> > Thanks, that's really interesting.
> >
> > I agree that it might be better in a follow-on KIP.
> >
> > Is the goal to improve the cold-cache case?  Maybe avoid looking at the
> > index file altogether (except for the initial setup)?  That would be a
> > nice improvement for consumers fetching big sequential chunks of
> > historic data.
> >
> > regards,
> > Colin
> >
> >
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Mon, Dec 11, 2017 at 1:06 PM, Dong Lin  wrote:
> > >
> > > > Hey Colin,
> > > >
> > > > I went over the latest KIP wiki and have a few comments here.
> > > >
> > > > 1) The KIP says that client ID is a string if the session belongs to a
> > > > Kafka consumer. And it is a numerical follower Id if the session
> > belongs to
> > > > a follower. Can we have a consistent type for the client Id?
> > > >
> > > > 2) "The numeric follower ID, if this fetch session belongs to a Kafka
> > > > broker". If the broker has multiple replica fetcher thread, do they all
> > > > have the same follower Id in teh leader broker?
> > > >
> > > > 3) One of the condition for evicting an existing session is that "The
> > new
> > > > session belongs to a follower, and the existing session belongs to a
> > > > regular consumer". I am not sure the session from follower should also
> > be
> > > > restricted by the newly added config. It seems that we will always
> > create
> > > > lots for FetchRequest from follower brokers. Maybe the
> > > > "max.incremental.fetch.session.cache.slots" should only be applies if
> > the
> > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-12 Thread Becket Qin
Hi Colin,

I am not completely sure, but I am hoping that when we do
FileChannel.transferTo() the OS will just use a fixed buffer to transfer
the data to the socket channel without polluting the page cache. But this
might not be true if we are using SSL.

The point I want to make is that avoiding doing binary search on index file
and avoid reading the log segments during fetch has some additional
benefits. So if the solution works for the current KIP, it might be a
better choice.

Regarding the fixed session for the entire life of the clients, it may be
also related to another issue we want to solve with broker epoch in
KAFKA-6029. If we can make sure the session id will not change along the
life time of clients, we can use that session id instead of creating a
separate broker epoch and add that to the FetchRequest.

Thanks,

Jiangjie (Becket) Qin



On Mon, Dec 11, 2017 at 3:25 PM, Colin McCabe  wrote:

> On Mon, Dec 11, 2017, at 14:51, Becket Qin wrote:
> > Hi Jun,
> >
> > Yes, I agree avoiding reading the log segment is not the primary goal for
> > this KIP. I brought this up because recently I saw a significant
> > throughput
> > impact when a broker is down for 20 - 30 min and rejoins a cluster. The
> > bytes in rate could drop by 50% when that broker is trying to catch up
> > with
> > the leaders even in a big cluster (a single broker should not have such
> > big
> > impact on the entire cluster).
>
> Hi Becket,
>
> It sounds like the broker was fetching older data which wasn't in the
> page cache?  That sounds like it could definitely have a negative impact
> on the cluster.  It is a little troubling if the impact is a 50% drop in
> throughput, though.
>
> It's a little unclear how to mitigate this, since old data is definitely
> not going to be in memory.  Maybe we need to work on making sure that
> slow fetches going on by one fetcher do not slow down all the other
> worker threads...?
>
> > And some users also reported such cascading
> > degradation, i.e. when one consumer lags behind, the other consumers will
> > also start to lag behind. So I think addressing this is an important
> > improvement. I will run some test and see if returning at index boundary
> > to avoid the log scan would help address this issue. That being said, I
> > agree that we don't have to address this issue in this KIP. I can submit
> > another KIP later if avoiding the log segment scan helps.
>
> Thanks, that's really interesting.
>
> I agree that it might be better in a follow-on KIP.
>
> Is the goal to improve the cold-cache case?  Maybe avoid looking at the
> index file altogether (except for the initial setup)?  That would be a
> nice improvement for consumers fetching big sequential chunks of
> historic data.
>
> regards,
> Colin
>
>
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Dec 11, 2017 at 1:06 PM, Dong Lin  wrote:
> >
> > > Hey Colin,
> > >
> > > I went over the latest KIP wiki and have a few comments here.
> > >
> > > 1) The KIP says that client ID is a string if the session belongs to a
> > > Kafka consumer. And it is a numerical follower Id if the session
> belongs to
> > > a follower. Can we have a consistent type for the client Id?
> > >
> > > 2) "The numeric follower ID, if this fetch session belongs to a Kafka
> > > broker". If the broker has multiple replica fetcher thread, do they all
> > > have the same follower Id in teh leader broker?
> > >
> > > 3) One of the condition for evicting an existing session is that "The
> new
> > > session belongs to a follower, and the existing session belongs to a
> > > regular consumer". I am not sure the session from follower should also
> be
> > > restricted by the newly added config. It seems that we will always
> create
> > > lots for FetchRequest from follower brokers. Maybe the
> > > "max.incremental.fetch.session.cache.slots" should only be applies if
> the
> > > FetchRequest comes from a client consumer?
> > >
> > > 4) Not sure I fully understand how the "The last dirty sequence
> number" is
> > > used. It is mentioned that "Let P1 have a last dirty sequence number of
> > > 100, and P2 have a last dirty sequence number of 101. An incremental
> fetch
> > > request with sequence number 100 will return information about both P1
> and
> > > P2." But would be the fetch offset for P2 in this case, if the last
> fetch
> > > offset stored in the Fetch Session for P2 is associated with the last
> dirty
> > > sequence number 101 for P2? My gut feel is that you would have to
> stored
> > > the fetch offset for sequence number 100 for P2 as well. Did I miss
> > > something here?
> > >
> > > Thanks,
> > > Dong
> > >
> > > On Sun, Dec 10, 2017 at 11:15 PM, Becket Qin 
> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > I see. Yes, that makes sense. Are we going to do that only for the
> > > fetches
> > > > whose per partition fetch size cannot reach the first index entry
> after
> > > the
> > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-11 Thread Colin McCabe
On Mon, Dec 11, 2017, at 14:51, Becket Qin wrote:
> Hi Jun,
> 
> Yes, I agree avoiding reading the log segment is not the primary goal for
> this KIP. I brought this up because recently I saw a significant
> throughput
> impact when a broker is down for 20 - 30 min and rejoins a cluster. The
> bytes in rate could drop by 50% when that broker is trying to catch up
> with
> the leaders even in a big cluster (a single broker should not have such
> big
> impact on the entire cluster).

Hi Becket,

It sounds like the broker was fetching older data which wasn't in the
page cache?  That sounds like it could definitely have a negative impact
on the cluster.  It is a little troubling if the impact is a 50% drop in
throughput, though.

It's a little unclear how to mitigate this, since old data is definitely
not going to be in memory.  Maybe we need to work on making sure that
slow fetches going on by one fetcher do not slow down all the other
worker threads...?

> And some users also reported such cascading
> degradation, i.e. when one consumer lags behind, the other consumers will
> also start to lag behind. So I think addressing this is an important
> improvement. I will run some test and see if returning at index boundary
> to avoid the log scan would help address this issue. That being said, I
> agree that we don't have to address this issue in this KIP. I can submit
> another KIP later if avoiding the log segment scan helps.

Thanks, that's really interesting.

I agree that it might be better in a follow-on KIP.

Is the goal to improve the cold-cache case?  Maybe avoid looking at the
index file altogether (except for the initial setup)?  That would be a
nice improvement for consumers fetching big sequential chunks of
historic data.

regards,
Colin


> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> On Mon, Dec 11, 2017 at 1:06 PM, Dong Lin  wrote:
> 
> > Hey Colin,
> >
> > I went over the latest KIP wiki and have a few comments here.
> >
> > 1) The KIP says that client ID is a string if the session belongs to a
> > Kafka consumer. And it is a numerical follower Id if the session belongs to
> > a follower. Can we have a consistent type for the client Id?
> >
> > 2) "The numeric follower ID, if this fetch session belongs to a Kafka
> > broker". If the broker has multiple replica fetcher thread, do they all
> > have the same follower Id in teh leader broker?
> >
> > 3) One of the condition for evicting an existing session is that "The new
> > session belongs to a follower, and the existing session belongs to a
> > regular consumer". I am not sure the session from follower should also be
> > restricted by the newly added config. It seems that we will always create
> > lots for FetchRequest from follower brokers. Maybe the
> > "max.incremental.fetch.session.cache.slots" should only be applies if the
> > FetchRequest comes from a client consumer?
> >
> > 4) Not sure I fully understand how the "The last dirty sequence number" is
> > used. It is mentioned that "Let P1 have a last dirty sequence number of
> > 100, and P2 have a last dirty sequence number of 101. An incremental fetch
> > request with sequence number 100 will return information about both P1 and
> > P2." But would be the fetch offset for P2 in this case, if the last fetch
> > offset stored in the Fetch Session for P2 is associated with the last dirty
> > sequence number 101 for P2? My gut feel is that you would have to stored
> > the fetch offset for sequence number 100 for P2 as well. Did I miss
> > something here?
> >
> > Thanks,
> > Dong
> >
> > On Sun, Dec 10, 2017 at 11:15 PM, Becket Qin  wrote:
> >
> > > Hi Jun,
> > >
> > > I see. Yes, that makes sense. Are we going to do that only for the
> > fetches
> > > whose per partition fetch size cannot reach the first index entry after
> > the
> > > fetch position, or are we going to do that for any fetch? If we do that
> > for
> > > any fetch, then we will still need to read the actual log segment, which
> > > could be expensive if the data is no longer in the cache. This hurts
> > > performance if some fetches are on the old log segments.
> > >
> > > I took a quick look on the clusters we have. The idle topic ratio varies
> > > depending on the usage of the cluster. For our metric cluster and
> > database
> > > replication clusters almost all the topics are actively used. For
> > tracking
> > > clusters, ~70% topics have data coming in at different rate. For other
> > > clusters such as queuing and data deployment. There are more idle topics
> > > and the traffic is more bursty (I don't have the exact number here).
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Sun, Dec 10, 2017 at 10:17 PM, Colin McCabe 
> > wrote:
> > >
> > > > On Fri, Dec 8, 2017, at 16:56, Jun Rao wrote:
> > > > > Hi, Jiangjie,
> > > > >
> > > > > What I described is almost the same as yours. The only extra thing is
> > > to
> > > > > scan the log segment 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-11 Thread Colin McCabe
On Mon, Dec 11, 2017, at 13:06, Dong Lin wrote:
> Hey Colin,
> 
> I went over the latest KIP wiki and have a few comments here.
> 
> 1) The KIP says that client ID is a string if the session belongs to a
> Kafka consumer. And it is a numerical follower Id if the session belongs
> to a follower. 

Hi Dong,

Right.  The issue is that replicas are identified by integers, whereas
consumers are identified by strings.

> Can we have a consistent type for the client Id?

We could use a string for both, perhaps?  Theoretically, a consumer
could also be named "broker 0" though, right?  So it would not be unique
any more.  Not sure what the best approach is here... what do you think?

> 2) "The numeric follower ID, if this fetch session belongs to a Kafka
> broker". If the broker has multiple replica fetcher thread, do they all
> have the same follower Id in teh leader broker?

Yes, every session created by the same replica will have the same
replica ID.  The fetch session ID will be different for each thread,
though.

> 3) One of the condition for evicting an existing session is that "The new
> session belongs to a follower, and the existing session belongs to a
> regular consumer". I am not sure the session from follower should also be
> restricted by the newly added config. It seems that we will always create
> lots for FetchRequest from follower brokers. Maybe the
> "max.incremental.fetch.session.cache.slots" should only be applies if the
> FetchRequest comes from a client consumer?

Well, replicas may sometimes go down.  In that case, when they come back
up, they will create new fetch sessions.  So we need to be able to evict
the old fetch sessions from the cache.  In the course of normal
operation, though, replicas should always have incremental fetch
sessions, because they are prioritized over consumers.

> 
> 4) Not sure I fully understand how the "The last dirty sequence number"
> is
> used. It is mentioned that "Let P1 have a last dirty sequence number of
> 100, and P2 have a last dirty sequence number of 101. An incremental
> fetch
> request with sequence number 100 will return information about both P1
> and
> P2." But would be the fetch offset for P2 in this case, if the last fetch
> offset stored in the Fetch Session for P2 is associated with the last
> dirty
> sequence number 101 for P2? My gut feel is that you would have to stored
> the fetch offset for sequence number 100 for P2 as well. Did I miss
> something here?

It's OK to return the latest available data.  So if someone sends a
fetch request with seqno 100, and then you get new data, and then the
fetch request with seqno 100 gets resent, it's OK to include the new
data in the second response, even though it wasn't in the first
response.  We are NOT implementing snapshots here, or anything like
that.  We are just trying to guard against the "lost updates" case. 
Also, keep in mind that the fetch offset is not updated until it is sent
in a fetch request, just like now.

best,
Colin

> 
> Thanks,
> Dong
> 
> On Sun, Dec 10, 2017 at 11:15 PM, Becket Qin 
> wrote:
> 
> > Hi Jun,
> >
> > I see. Yes, that makes sense. Are we going to do that only for the fetches
> > whose per partition fetch size cannot reach the first index entry after the
> > fetch position, or are we going to do that for any fetch? If we do that for
> > any fetch, then we will still need to read the actual log segment, which
> > could be expensive if the data is no longer in the cache. This hurts
> > performance if some fetches are on the old log segments.
> >
> > I took a quick look on the clusters we have. The idle topic ratio varies
> > depending on the usage of the cluster. For our metric cluster and database
> > replication clusters almost all the topics are actively used. For tracking
> > clusters, ~70% topics have data coming in at different rate. For other
> > clusters such as queuing and data deployment. There are more idle topics
> > and the traffic is more bursty (I don't have the exact number here).
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sun, Dec 10, 2017 at 10:17 PM, Colin McCabe  wrote:
> >
> > > On Fri, Dec 8, 2017, at 16:56, Jun Rao wrote:
> > > > Hi, Jiangjie,
> > > >
> > > > What I described is almost the same as yours. The only extra thing is
> > to
> > > > scan the log segment from the identified index entry a bit more to
> > find a
> > > > file position that ends at a message set boundary and is less than the
> > > > partition level fetch size. This way, we still preserve the current
> > > > semantic of not returning more bytes than fetch size unless there is a
> > > > single message set larger than the fetch size.
> > > >
> > > > In a typically cluster at LinkedIn, what's the percentage of idle
> > > > partitions?
> > >
> > > Yeah, that would be a great number to get.
> > >
> > > Of course, KIP-227 will also benefit partitions that are not completely
> > > idle.  For instance, a partition that's 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-11 Thread Colin McCabe
On Mon, Dec 11, 2017, at 13:17, Dong Lin wrote:
> On Thu, Dec 7, 2017 at 1:52 PM, Colin McCabe  wrote:
> 
> > On Wed, Dec 6, 2017, at 11:23, Becket Qin wrote:
> > > Hi Colin,
> > >
> > > >A full fetch request will certainly avoid any ambiguity here.  But now
> > > >we're back to sending full fetch requests whenever there are network
> > > >issues, which is worse than the current proposal.  And has the
> > > >congestion collapse problem I talked about earlier when the network is
> > > >wobbling.  We also don't get the other debuggability benefits of being
> > > >able to uniquely associate each update in the incremental fetch session
> > > >with a sequence number.
> > >
> > > I think we would want to optimize for the normal case instead of the
> > > failure case. The failure case is supposed to be rare and if that happens
> > > usually it requires human attention to fix anyways. So reducing the
> > > regular cost in the normal cases probably makes more sense.
> >
> 
> 
> Hmm.. let me chime in and ask a quick question on this.
> 
> My understanding of Becket's proposal is that the FetchRequest will not
> contain per-partition information in the normal cases. According to the
> latest KIP, it is said that "Incremental FetchRequests will only contain
> information about partitions which have changed on the follower". So if
> there is always data available for every partition on the broker, the
> FetchRequest will always contain per-partition information for every
> partition, which makes it essentially a full FetchRequest in normal case.
> Did I miss something here?

Hi Dong,

I think your understanding is correct.  The KIP-227 proposal includes
information about changed partitions in the partition fetch request.  If
every partition has changed, every partition will be included.

I don't think this is a problem.  For one thing, if every partition has
changed, then every partition will have data, which means you will have
a really large FetchResponse.  In that case, most of your network
bandwidth goes to the response anyway, rather than to the request.  And
you cannot get rid of that overhead, because you actually need to fetch
that data.

In any case, I am very skeptical that clusters that have information for
every partition on every fetch request exist in the wild. Remember that,
by default, we return a response to the fetch request when any partition
gets even a single byte of data.  Let's say you have 10,000 partitions. 
Do you have 10,000 produce requests being handled to each partition in
between each fetch request?  All the time?  So that somehow you can
service 10,000 Produce RPCs before you send back the response to a
single pending FetchRequest?  That's not very believable, especially
when you start thinking about internal topics.

I think the only way you could get reasonably close to a true fully
loaded fetch response is if you tuned Kafka for high latency and high
bandwidth.  So you could increase the wait time before sending back any
responses, and increase the minimum response size.  But that's not the
scenario we're addressing here.

best,
Colin

> 
> 
> 
> > >
> > > Thanks,
> >
> > Hi Becket,
> >
> > I agree we should optimize for the normal case.  I believe that the
> > sequence number proposal I put forward does this.  All the competing
> > proposals have been strictly worse for both the normal and error cases.
> > For example, the proposal to rely on the TCP session to establish
> > ordering does not help the normal case.  But it does make the case where
> > there are network issues worse.  It also makes it harder for us to put a
> > limit on the amount of time we will cache, which is worse for the normal
> > case.
> >
> > best,
> > Colin
> >
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Wed, Dec 6, 2017 at 10:58 AM, Colin McCabe 
> > wrote:
> > >
> > > > On Wed, Dec 6, 2017, at 10:49, Jason Gustafson wrote:
> > > > > >
> > > > > > There is already a way in the existing proposal for clients to
> > change
> > > > > > the set of partitions they are interested in, while re-using their
> > same
> > > > > > session and session ID.  We don't need to change how sequence ID
> > works
> > > > > > in order to do this.
> > > > >
> > > > >
> > > > > There is some inconsistency in the KIP about this, so I wasn't sure.
> > In
> > > > > particular, you say this: " The FetchSession maintains information
> > about
> > > > > a specific set of relevant partitions.  Note that the set of relevant
> > > > > partitions is established when the FetchSession is created.  It
> > cannot be
> > > > > changed later." Maybe that could be clarified?
> > > >
> > > > That's a fair point-- I didn't fix this part of the KIP after making an
> > > > update below.  So it was definitely unclear.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > > >
> > > > >
> > > > > > But how does the broker know that it needs to resend the data for
> > > > > > partition P?  After all, if the response 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-11 Thread Becket Qin
Hi Jun,

Yes, I agree avoiding reading the log segment is not the primary goal for
this KIP. I brought this up because recently I saw a significant throughput
impact when a broker is down for 20 - 30 min and rejoins a cluster. The
bytes in rate could drop by 50% when that broker is trying to catch up with
the leaders even in a big cluster (a single broker should not have such big
impact on the entire cluster). And some users also reported such cascading
degradation, i.e. when one consumer lags behind, the other consumers will
also start to lag behind. So I think addressing this is an important
improvement. I will run some test and see if returning at index boundary to
avoid the log scan would help address this issue. That being said, I agree
that we don't have to address this issue in this KIP. I can submit another
KIP later if avoiding the log segment scan helps.

Thanks,

Jiangjie (Becket) Qin

On Mon, Dec 11, 2017 at 1:06 PM, Dong Lin  wrote:

> Hey Colin,
>
> I went over the latest KIP wiki and have a few comments here.
>
> 1) The KIP says that client ID is a string if the session belongs to a
> Kafka consumer. And it is a numerical follower Id if the session belongs to
> a follower. Can we have a consistent type for the client Id?
>
> 2) "The numeric follower ID, if this fetch session belongs to a Kafka
> broker". If the broker has multiple replica fetcher thread, do they all
> have the same follower Id in teh leader broker?
>
> 3) One of the condition for evicting an existing session is that "The new
> session belongs to a follower, and the existing session belongs to a
> regular consumer". I am not sure the session from follower should also be
> restricted by the newly added config. It seems that we will always create
> lots for FetchRequest from follower brokers. Maybe the
> "max.incremental.fetch.session.cache.slots" should only be applies if the
> FetchRequest comes from a client consumer?
>
> 4) Not sure I fully understand how the "The last dirty sequence number" is
> used. It is mentioned that "Let P1 have a last dirty sequence number of
> 100, and P2 have a last dirty sequence number of 101. An incremental fetch
> request with sequence number 100 will return information about both P1 and
> P2." But would be the fetch offset for P2 in this case, if the last fetch
> offset stored in the Fetch Session for P2 is associated with the last dirty
> sequence number 101 for P2? My gut feel is that you would have to stored
> the fetch offset for sequence number 100 for P2 as well. Did I miss
> something here?
>
> Thanks,
> Dong
>
> On Sun, Dec 10, 2017 at 11:15 PM, Becket Qin  wrote:
>
> > Hi Jun,
> >
> > I see. Yes, that makes sense. Are we going to do that only for the
> fetches
> > whose per partition fetch size cannot reach the first index entry after
> the
> > fetch position, or are we going to do that for any fetch? If we do that
> for
> > any fetch, then we will still need to read the actual log segment, which
> > could be expensive if the data is no longer in the cache. This hurts
> > performance if some fetches are on the old log segments.
> >
> > I took a quick look on the clusters we have. The idle topic ratio varies
> > depending on the usage of the cluster. For our metric cluster and
> database
> > replication clusters almost all the topics are actively used. For
> tracking
> > clusters, ~70% topics have data coming in at different rate. For other
> > clusters such as queuing and data deployment. There are more idle topics
> > and the traffic is more bursty (I don't have the exact number here).
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sun, Dec 10, 2017 at 10:17 PM, Colin McCabe 
> wrote:
> >
> > > On Fri, Dec 8, 2017, at 16:56, Jun Rao wrote:
> > > > Hi, Jiangjie,
> > > >
> > > > What I described is almost the same as yours. The only extra thing is
> > to
> > > > scan the log segment from the identified index entry a bit more to
> > find a
> > > > file position that ends at a message set boundary and is less than
> the
> > > > partition level fetch size. This way, we still preserve the current
> > > > semantic of not returning more bytes than fetch size unless there is
> a
> > > > single message set larger than the fetch size.
> > > >
> > > > In a typically cluster at LinkedIn, what's the percentage of idle
> > > > partitions?
> > >
> > > Yeah, that would be a great number to get.
> > >
> > > Of course, KIP-227 will also benefit partitions that are not completely
> > > idle.  For instance, a partition that's getting just one message a
> > > second will appear in many fetch requests, unless every other partition
> > > in the system is also only getting a low rate of incoming messages.
> > >
> > > regards,
> > > Colin
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin 
> > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-11 Thread Dong Lin
Hey Colin,

I went over the latest KIP wiki and have a few comments here.

1) The KIP says that client ID is a string if the session belongs to a
Kafka consumer. And it is a numerical follower Id if the session belongs to
a follower. Can we have a consistent type for the client Id?

2) "The numeric follower ID, if this fetch session belongs to a Kafka
broker". If the broker has multiple replica fetcher thread, do they all
have the same follower Id in teh leader broker?

3) One of the condition for evicting an existing session is that "The new
session belongs to a follower, and the existing session belongs to a
regular consumer". I am not sure the session from follower should also be
restricted by the newly added config. It seems that we will always create
lots for FetchRequest from follower brokers. Maybe the
"max.incremental.fetch.session.cache.slots" should only be applies if the
FetchRequest comes from a client consumer?

4) Not sure I fully understand how the "The last dirty sequence number" is
used. It is mentioned that "Let P1 have a last dirty sequence number of
100, and P2 have a last dirty sequence number of 101. An incremental fetch
request with sequence number 100 will return information about both P1 and
P2." But would be the fetch offset for P2 in this case, if the last fetch
offset stored in the Fetch Session for P2 is associated with the last dirty
sequence number 101 for P2? My gut feel is that you would have to stored
the fetch offset for sequence number 100 for P2 as well. Did I miss
something here?

Thanks,
Dong

On Sun, Dec 10, 2017 at 11:15 PM, Becket Qin  wrote:

> Hi Jun,
>
> I see. Yes, that makes sense. Are we going to do that only for the fetches
> whose per partition fetch size cannot reach the first index entry after the
> fetch position, or are we going to do that for any fetch? If we do that for
> any fetch, then we will still need to read the actual log segment, which
> could be expensive if the data is no longer in the cache. This hurts
> performance if some fetches are on the old log segments.
>
> I took a quick look on the clusters we have. The idle topic ratio varies
> depending on the usage of the cluster. For our metric cluster and database
> replication clusters almost all the topics are actively used. For tracking
> clusters, ~70% topics have data coming in at different rate. For other
> clusters such as queuing and data deployment. There are more idle topics
> and the traffic is more bursty (I don't have the exact number here).
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sun, Dec 10, 2017 at 10:17 PM, Colin McCabe  wrote:
>
> > On Fri, Dec 8, 2017, at 16:56, Jun Rao wrote:
> > > Hi, Jiangjie,
> > >
> > > What I described is almost the same as yours. The only extra thing is
> to
> > > scan the log segment from the identified index entry a bit more to
> find a
> > > file position that ends at a message set boundary and is less than the
> > > partition level fetch size. This way, we still preserve the current
> > > semantic of not returning more bytes than fetch size unless there is a
> > > single message set larger than the fetch size.
> > >
> > > In a typically cluster at LinkedIn, what's the percentage of idle
> > > partitions?
> >
> > Yeah, that would be a great number to get.
> >
> > Of course, KIP-227 will also benefit partitions that are not completely
> > idle.  For instance, a partition that's getting just one message a
> > second will appear in many fetch requests, unless every other partition
> > in the system is also only getting a low rate of incoming messages.
> >
> > regards,
> > Colin
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin 
> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Yes, we still need to handle the corner case. And you are right, it
> is
> > all
> > > > about trade-off between simplicity and the performance gain.
> > > >
> > > > I was thinking that the brokers always return at least
> > > > log.index.interval.bytes per partition to the consumer, just like we
> > will
> > > > return at least one message to the user. This way we don't need to
> > worry
> > > > about the case that the fetch size is smaller than the index
> interval.
> > We
> > > > may just need to let users know this behavior change.
> > > >
> > > > Not sure if I completely understand your solution, but I think we are
> > > > thinking about the same. i.e. for the first fetch asking for offset
> > x0, we
> > > > will need to do a binary search to find the position p0. and then the
> > > > broker will iterate over the index entries starting from the first
> > index
> > > > entry whose offset is greater than p0 until it reaches the index
> > entry(x1,
> > > > p1) so that p1 - p0 is just under the fetch size, but the next entry
> > will
> > > > exceed the fetch size. We then return the bytes from p0 to p1.
> > Meanwhile
> > > > the broker caches the next fetch (x1, 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-11 Thread Jun Rao
Hi, Jiangjie,

Thanks for the info. I was thinking of doing the scan of the log segment on
every fetch request as we do today. The optimization for this KIP is
probably mostly useful for real time consumption, in which case the log
segments that need to be accessed are likely still in pagecache.

Jun

On Sun, Dec 10, 2017 at 11:15 PM, Becket Qin  wrote:

> Hi Jun,
>
> I see. Yes, that makes sense. Are we going to do that only for the fetches
> whose per partition fetch size cannot reach the first index entry after the
> fetch position, or are we going to do that for any fetch? If we do that for
> any fetch, then we will still need to read the actual log segment, which
> could be expensive if the data is no longer in the cache. This hurts
> performance if some fetches are on the old log segments.
>
> I took a quick look on the clusters we have. The idle topic ratio varies
> depending on the usage of the cluster. For our metric cluster and database
> replication clusters almost all the topics are actively used. For tracking
> clusters, ~70% topics have data coming in at different rate. For other
> clusters such as queuing and data deployment. There are more idle topics
> and the traffic is more bursty (I don't have the exact number here).
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sun, Dec 10, 2017 at 10:17 PM, Colin McCabe  wrote:
>
> > On Fri, Dec 8, 2017, at 16:56, Jun Rao wrote:
> > > Hi, Jiangjie,
> > >
> > > What I described is almost the same as yours. The only extra thing is
> to
> > > scan the log segment from the identified index entry a bit more to
> find a
> > > file position that ends at a message set boundary and is less than the
> > > partition level fetch size. This way, we still preserve the current
> > > semantic of not returning more bytes than fetch size unless there is a
> > > single message set larger than the fetch size.
> > >
> > > In a typically cluster at LinkedIn, what's the percentage of idle
> > > partitions?
> >
> > Yeah, that would be a great number to get.
> >
> > Of course, KIP-227 will also benefit partitions that are not completely
> > idle.  For instance, a partition that's getting just one message a
> > second will appear in many fetch requests, unless every other partition
> > in the system is also only getting a low rate of incoming messages.
> >
> > regards,
> > Colin
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin 
> wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Yes, we still need to handle the corner case. And you are right, it
> is
> > all
> > > > about trade-off between simplicity and the performance gain.
> > > >
> > > > I was thinking that the brokers always return at least
> > > > log.index.interval.bytes per partition to the consumer, just like we
> > will
> > > > return at least one message to the user. This way we don't need to
> > worry
> > > > about the case that the fetch size is smaller than the index
> interval.
> > We
> > > > may just need to let users know this behavior change.
> > > >
> > > > Not sure if I completely understand your solution, but I think we are
> > > > thinking about the same. i.e. for the first fetch asking for offset
> > x0, we
> > > > will need to do a binary search to find the position p0. and then the
> > > > broker will iterate over the index entries starting from the first
> > index
> > > > entry whose offset is greater than p0 until it reaches the index
> > entry(x1,
> > > > p1) so that p1 - p0 is just under the fetch size, but the next entry
> > will
> > > > exceed the fetch size. We then return the bytes from p0 to p1.
> > Meanwhile
> > > > the broker caches the next fetch (x1, p1). So when the next fetch
> > comes, it
> > > > will just iterate over the offset index entry starting at (x1, p1).
> > > >
> > > > It is true that in the above approach, the log compacted topic needs
> > to be
> > > > handled. It seems that this can be solved by checking whether the
> > cached
> > > > index and the new log index are still the same index object. If they
> > are
> > > > not the same, we can fall back to binary search with the cached
> > offset. It
> > > > is admittedly more complicated than the current logic, but given the
> > binary
> > > > search logic already exists, it seems the additional object sanity
> > check is
> > > > not too much work.
> > > >
> > > > Not sure if the above implementation is simple enough to justify the
> > > > performance improvement. Let me know if you see potential complexity.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Dec 6, 2017 at 4:48 PM, Jun Rao  wrote:
> > > >
> > > > > Hi, Becket,
> > > > >
> > > > > Yes, I agree that it's rare to have the fetch size smaller than
> index
> > > > > interval. It's just that we still need additional code to handle
> the
> > rare
> > > > > case.
> > > > >
> > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-11 Thread Dong Lin
On Thu, Dec 7, 2017 at 1:52 PM, Colin McCabe  wrote:

> On Wed, Dec 6, 2017, at 11:23, Becket Qin wrote:
> > Hi Colin,
> >
> > >A full fetch request will certainly avoid any ambiguity here.  But now
> > >we're back to sending full fetch requests whenever there are network
> > >issues, which is worse than the current proposal.  And has the
> > >congestion collapse problem I talked about earlier when the network is
> > >wobbling.  We also don't get the other debuggability benefits of being
> > >able to uniquely associate each update in the incremental fetch session
> > >with a sequence number.
> >
> > I think we would want to optimize for the normal case instead of the
> > failure case. The failure case is supposed to be rare and if that happens
> > usually it requires human attention to fix anyways. So reducing the
> > regular cost in the normal cases probably makes more sense.
>


Hmm.. let me chime in and ask a quick question on this.

My understanding of Becket's proposal is that the FetchRequest will not
contain per-partition information in the normal cases. According to the
latest KIP, it is said that "Incremental FetchRequests will only contain
information about partitions which have changed on the follower". So if
there is always data available for every partition on the broker, the
FetchRequest will always contain per-partition information for every
partition, which makes it essentially a full FetchRequest in normal case.
Did I miss something here?



> >
> > Thanks,
>
> Hi Becket,
>
> I agree we should optimize for the normal case.  I believe that the
> sequence number proposal I put forward does this.  All the competing
> proposals have been strictly worse for both the normal and error cases.
> For example, the proposal to rely on the TCP session to establish
> ordering does not help the normal case.  But it does make the case where
> there are network issues worse.  It also makes it harder for us to put a
> limit on the amount of time we will cache, which is worse for the normal
> case.
>
> best,
> Colin
>
> >
> > Jiangjie (Becket) Qin
> >
> > On Wed, Dec 6, 2017 at 10:58 AM, Colin McCabe 
> wrote:
> >
> > > On Wed, Dec 6, 2017, at 10:49, Jason Gustafson wrote:
> > > > >
> > > > > There is already a way in the existing proposal for clients to
> change
> > > > > the set of partitions they are interested in, while re-using their
> same
> > > > > session and session ID.  We don't need to change how sequence ID
> works
> > > > > in order to do this.
> > > >
> > > >
> > > > There is some inconsistency in the KIP about this, so I wasn't sure.
> In
> > > > particular, you say this: " The FetchSession maintains information
> about
> > > > a specific set of relevant partitions.  Note that the set of relevant
> > > > partitions is established when the FetchSession is created.  It
> cannot be
> > > > changed later." Maybe that could be clarified?
> > >
> > > That's a fair point-- I didn't fix this part of the KIP after making an
> > > update below.  So it was definitely unclear.
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > >
> > > > > But how does the broker know that it needs to resend the data for
> > > > > partition P?  After all, if the response had not been dropped, P
> would
> > > > > not have been resent, since it didn't change.  Under the existing
> > > > > scheme, the follower can look at lastDirtyEpoch to find this out.
>  In
> > > > > the new scheme, I don't see how it would know.
> > > >
> > > >
> > > > If a fetch response is lost, the epoch would be bumped by the client
> and
> > > > a
> > > > full fetch would be sent. Doesn't that solve the issue?
> > > >
> > > > -Jason
> > > >
> > > > On Wed, Dec 6, 2017 at 10:40 AM, Colin McCabe 
> > > wrote:
> > > >
> > > > > On Wed, Dec 6, 2017, at 09:32, Jason Gustafson wrote:
> > > > > > >
> > > > > > > Thinking about this again. I do see the reason that we want to
> > > have a
> > > > > epoch
> > > > > > > to avoid out of order registration of the interested set. But
> I am
> > > > > > > wondering if the following semantic would meet what we want
> better:
> > > > > > >  - Session Id: the id assigned to a single client for life long
> > > time.
> > > > > i.e
> > > > > > > it does not change when the interested partitions change.
> > > > > > >  - Epoch: the interested set epoch. Only updated when a full
> fetch
> > > > > request
> > > > > > > comes, which may result in the interested partition set change.
> > > > > > > This will ensure that the registered interested set will
> always be
> > > the
> > > > > > > latest registration. And the clients can change the interested
> > > > > partition
> > > > > > > set without creating another session.
> > > > > >
> > > > > >
> > > > > > I agree this is a bit more intuitive than the sequence number
> and the
> > > > > > ability to reuse the session is beneficial since it causes less
> > > waste of
> > > > > > the cache for session timeouts.
> > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-10 Thread Becket Qin
Hi Jun,

I see. Yes, that makes sense. Are we going to do that only for the fetches
whose per partition fetch size cannot reach the first index entry after the
fetch position, or are we going to do that for any fetch? If we do that for
any fetch, then we will still need to read the actual log segment, which
could be expensive if the data is no longer in the cache. This hurts
performance if some fetches are on the old log segments.

I took a quick look on the clusters we have. The idle topic ratio varies
depending on the usage of the cluster. For our metric cluster and database
replication clusters almost all the topics are actively used. For tracking
clusters, ~70% topics have data coming in at different rate. For other
clusters such as queuing and data deployment. There are more idle topics
and the traffic is more bursty (I don't have the exact number here).

Thanks,

Jiangjie (Becket) Qin

On Sun, Dec 10, 2017 at 10:17 PM, Colin McCabe  wrote:

> On Fri, Dec 8, 2017, at 16:56, Jun Rao wrote:
> > Hi, Jiangjie,
> >
> > What I described is almost the same as yours. The only extra thing is to
> > scan the log segment from the identified index entry a bit more to find a
> > file position that ends at a message set boundary and is less than the
> > partition level fetch size. This way, we still preserve the current
> > semantic of not returning more bytes than fetch size unless there is a
> > single message set larger than the fetch size.
> >
> > In a typically cluster at LinkedIn, what's the percentage of idle
> > partitions?
>
> Yeah, that would be a great number to get.
>
> Of course, KIP-227 will also benefit partitions that are not completely
> idle.  For instance, a partition that's getting just one message a
> second will appear in many fetch requests, unless every other partition
> in the system is also only getting a low rate of incoming messages.
>
> regards,
> Colin
>
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin  wrote:
> >
> > > Hi Jun,
> > >
> > > Yes, we still need to handle the corner case. And you are right, it is
> all
> > > about trade-off between simplicity and the performance gain.
> > >
> > > I was thinking that the brokers always return at least
> > > log.index.interval.bytes per partition to the consumer, just like we
> will
> > > return at least one message to the user. This way we don't need to
> worry
> > > about the case that the fetch size is smaller than the index interval.
> We
> > > may just need to let users know this behavior change.
> > >
> > > Not sure if I completely understand your solution, but I think we are
> > > thinking about the same. i.e. for the first fetch asking for offset
> x0, we
> > > will need to do a binary search to find the position p0. and then the
> > > broker will iterate over the index entries starting from the first
> index
> > > entry whose offset is greater than p0 until it reaches the index
> entry(x1,
> > > p1) so that p1 - p0 is just under the fetch size, but the next entry
> will
> > > exceed the fetch size. We then return the bytes from p0 to p1.
> Meanwhile
> > > the broker caches the next fetch (x1, p1). So when the next fetch
> comes, it
> > > will just iterate over the offset index entry starting at (x1, p1).
> > >
> > > It is true that in the above approach, the log compacted topic needs
> to be
> > > handled. It seems that this can be solved by checking whether the
> cached
> > > index and the new log index are still the same index object. If they
> are
> > > not the same, we can fall back to binary search with the cached
> offset. It
> > > is admittedly more complicated than the current logic, but given the
> binary
> > > search logic already exists, it seems the additional object sanity
> check is
> > > not too much work.
> > >
> > > Not sure if the above implementation is simple enough to justify the
> > > performance improvement. Let me know if you see potential complexity.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Dec 6, 2017 at 4:48 PM, Jun Rao  wrote:
> > >
> > > > Hi, Becket,
> > > >
> > > > Yes, I agree that it's rare to have the fetch size smaller than index
> > > > interval. It's just that we still need additional code to handle the
> rare
> > > > case.
> > > >
> > > > If you go this far, a more general approach (i.e., without returning
> at
> > > the
> > > > index boundary) is the following. We can cache the following
> metadata for
> > > > the next fetch offset: the file position in the log segment, the
> first
> > > > index slot at or after the file position. When serving a fetch
> request,
> > > we
> > > > scan the index entries from the cached index slot until we hit the
> fetch
> > > > size. We can then send the data at the message set boundary and
> update
> > > the
> > > > cached metadata for the next fetch offset. This is kind of
> complicated,
> > > but
> > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-10 Thread Colin McCabe
On Fri, Dec 8, 2017, at 16:56, Jun Rao wrote:
> Hi, Jiangjie,
> 
> What I described is almost the same as yours. The only extra thing is to
> scan the log segment from the identified index entry a bit more to find a
> file position that ends at a message set boundary and is less than the
> partition level fetch size. This way, we still preserve the current
> semantic of not returning more bytes than fetch size unless there is a
> single message set larger than the fetch size.
> 
> In a typically cluster at LinkedIn, what's the percentage of idle
> partitions?

Yeah, that would be a great number to get.

Of course, KIP-227 will also benefit partitions that are not completely
idle.  For instance, a partition that's getting just one message a
second will appear in many fetch requests, unless every other partition
in the system is also only getting a low rate of incoming messages.

regards,
Colin

> 
> Thanks,
> 
> Jun
> 
> 
> On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin  wrote:
> 
> > Hi Jun,
> >
> > Yes, we still need to handle the corner case. And you are right, it is all
> > about trade-off between simplicity and the performance gain.
> >
> > I was thinking that the brokers always return at least
> > log.index.interval.bytes per partition to the consumer, just like we will
> > return at least one message to the user. This way we don't need to worry
> > about the case that the fetch size is smaller than the index interval. We
> > may just need to let users know this behavior change.
> >
> > Not sure if I completely understand your solution, but I think we are
> > thinking about the same. i.e. for the first fetch asking for offset x0, we
> > will need to do a binary search to find the position p0. and then the
> > broker will iterate over the index entries starting from the first index
> > entry whose offset is greater than p0 until it reaches the index entry(x1,
> > p1) so that p1 - p0 is just under the fetch size, but the next entry will
> > exceed the fetch size. We then return the bytes from p0 to p1. Meanwhile
> > the broker caches the next fetch (x1, p1). So when the next fetch comes, it
> > will just iterate over the offset index entry starting at (x1, p1).
> >
> > It is true that in the above approach, the log compacted topic needs to be
> > handled. It seems that this can be solved by checking whether the cached
> > index and the new log index are still the same index object. If they are
> > not the same, we can fall back to binary search with the cached offset. It
> > is admittedly more complicated than the current logic, but given the binary
> > search logic already exists, it seems the additional object sanity check is
> > not too much work.
> >
> > Not sure if the above implementation is simple enough to justify the
> > performance improvement. Let me know if you see potential complexity.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> >
> > On Wed, Dec 6, 2017 at 4:48 PM, Jun Rao  wrote:
> >
> > > Hi, Becket,
> > >
> > > Yes, I agree that it's rare to have the fetch size smaller than index
> > > interval. It's just that we still need additional code to handle the rare
> > > case.
> > >
> > > If you go this far, a more general approach (i.e., without returning at
> > the
> > > index boundary) is the following. We can cache the following metadata for
> > > the next fetch offset: the file position in the log segment, the first
> > > index slot at or after the file position. When serving a fetch request,
> > we
> > > scan the index entries from the cached index slot until we hit the fetch
> > > size. We can then send the data at the message set boundary and update
> > the
> > > cached metadata for the next fetch offset. This is kind of complicated,
> > but
> > > probably not more than your approach if the corner case has to be
> > handled.
> > >
> > > In both the above approach and your approach, we need the additional
> > logic
> > > to handle compacted topic since a log segment (and therefore its index)
> > can
> > > be replaced between two consecutive fetch requests.
> > >
> > > Overall, I agree that the general approach that you proposed applies more
> > > widely since we get the benefit even when all topics are high volume.
> > It's
> > > just that it would be better if we could think of a simpler
> > implementation.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > That is true, but in reality it seems rare that the fetch size is
> > smaller
> > > > than index interval. In the worst case, we may need to do another look
> > > up.
> > > > In the future, when we have the mechanism to inform the clients about
> > the
> > > > broker configurations, the clients may want to configure
> > correspondingly
> > > as
> > > > well, e.g. max message size, max timestamp difference, etc.
> > > >
> > > > On the other hand, we are not guaranteeing 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-10 Thread Colin McCabe
On Sun, Dec 10, 2017, at 22:10, Colin McCabe wrote:
> On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:
> > Hi,
> > 
> > sorry for the late reply, busy times :-/
> > 
> > I would ask you one thing maybe. Since the timeout
> > argument seems to be settled I have no further argument
> > form your side except the "i don't want to".
> > 
> > Can you see that connection.max.idle.max is the exact time
> > that expresses "We expect the client to be away for this long,
> > and come back and continue"?
> 
> Hi Jan,
> 
> Sure, connection.max.idle.max is the exact time that we want to keep
> around a TCP session.  TCP sessions are relatively cheap, so we can
> afford to keep them around for 10 minutes by default.  Incremental fetch
> state is less cheap, so we want to set a shorter timeout for it.  We
> also want new TCP sessions to be able to reuse an existing incremental
> fetch session rather than creating a new one and waiting for the old one
> to time out.
> 
> > 
> > also clarified some stuff inline
> > 
> > Best Jan
> > 
> > 
> > 
> > 
> > On 05.12.2017 23:14, Colin McCabe wrote:
> > > On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > >> Hi Colin
> > >>
> > >> Addressing the topic of how to manage slots from the other thread.
> > >> With tcp connections all this comes for free essentially.
> > > Hi Jan,
> > >
> > > I don't think that it's accurate to say that cache management "comes for
> > > free" by coupling the incremental fetch session with the TCP session.
> > > When a new TCP session is started by a fetch request, you still have to
> > > decide whether to grant that request an incremental fetch session or
> > > not.  If your answer is that you always grant the request, I would argue
> > > that you do not have cache management.
> > First I would say, the client has a big say in this. If the client
> > is not going to issue incremental he shouldn't ask for a cache
> > when the client ask for the cache we still have all options to deny.
> 
> To put it simply, we have to have some cache management above and beyond
> just giving out an incremental fetch session to anyone who has a TCP
> session.  Therefore, caching does not become simpler if you couple the
> fetch session to the TCP session.
> 
> > 
> > >
> > > I guess you could argue that timeouts are cache management, but I don't
> > > find that argument persuasive.  Anyone could just create a lot of TCP
> > > sessions and use a lot of resources, in that case.  So there is
> > > essentially no limit on memory use.  In any case, TCP sessions don't
> > > help us implement fetch session timeouts.
> > We still have all the options denying the request to keep the state.
> > What you want seems like a max connections / ip safeguard.
> > I can currently take down a broker with to many connections easily.
> > 
> > 
> > >> I still would argue we disable it by default and make a flag in the
> > >> broker to ask the leader to maintain the cache while replicating and 
> > >> also only
> > >> have it optional in consumers (default to off) so one can turn it on
> > >> where it really hurts.  MirrorMaker and audit consumers prominently.
> > > I agree with Jason's point from earlier in the thread.  Adding extra
> > > configuration knobs that aren't really necessary can harm usability.
> > > Certainly asking people to manually turn on a feature "where it really
> > > hurts" seems to fall in that category, when we could easily enable it
> > > automatically for them.
> > This doesn't make much sense to me.
> 
> There are no tradeoffs to think about from the client's point of view:
> it always wants an incremental fetch session.  So there is no benefit to
> making the clients configure an extra setting.  Updating and managing
> client configurations is also more difficult than managing broker
> configurations for most users.
> 
> > You also wanted to implement
> > a "turn of in case of bug"-knob. Having the client indicate if the
> > feauture will be used seems reasonable to me.,
> 
> True.  However, if there is a bug, we could also roll back the client,
> so having this configuration knob is not strictly required.
> 
> > >
> > >> Otherwise I left a few remarks in-line, which should help to understand
> > >> my view of the situation better
> > >>
> > >> Best Jan
> > >>
> > >>
> > >> On 05.12.2017 08:06, Colin McCabe wrote:
> > >>> On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> >  On 03.12.2017 21:55, Colin McCabe wrote:
> > > On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> > >> Thanks for the explanation, Colin. A few more questions.
> > >>
> > >>> The session epoch is not complex.  It's just a number which 
> > >>> increments
> > >>> on each incremental fetch.  The session epoch is also useful for
> > >>> debugging-- it allows you to match up requests and responses when
> > >>> looking at log files.
> > >> Currently each request in Kafka has a correlation id to help match 
> > >> the
> > >> requests and 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-10 Thread Colin McCabe
On Fri, Dec 8, 2017, at 01:16, Jan Filipiak wrote:
> Hi,
> 
> sorry for the late reply, busy times :-/
> 
> I would ask you one thing maybe. Since the timeout
> argument seems to be settled I have no further argument
> form your side except the "i don't want to".
> 
> Can you see that connection.max.idle.max is the exact time
> that expresses "We expect the client to be away for this long,
> and come back and continue"?

Hi Jan,

Sure, connection.max.idle.max is the exact time that we want to keep
around a TCP session.  TCP sessions are relatively cheap, so we can
afford to keep them around for 10 minutes by default.  Incremental fetch
state is less cheap, so we want to set a shorter timeout for it.  We
also want new TCP sessions to be able to reuse an existing incremental
fetch session rather than creating a new one and waiting for the old one
to time out.

> 
> also clarified some stuff inline
> 
> Best Jan
> 
> 
> 
> 
> On 05.12.2017 23:14, Colin McCabe wrote:
> > On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> >> Hi Colin
> >>
> >> Addressing the topic of how to manage slots from the other thread.
> >> With tcp connections all this comes for free essentially.
> > Hi Jan,
> >
> > I don't think that it's accurate to say that cache management "comes for
> > free" by coupling the incremental fetch session with the TCP session.
> > When a new TCP session is started by a fetch request, you still have to
> > decide whether to grant that request an incremental fetch session or
> > not.  If your answer is that you always grant the request, I would argue
> > that you do not have cache management.
> First I would say, the client has a big say in this. If the client
> is not going to issue incremental he shouldn't ask for a cache
> when the client ask for the cache we still have all options to deny.

To put it simply, we have to have some cache management above and beyond
just giving out an incremental fetch session to anyone who has a TCP
session.  Therefore, caching does not become simpler if you couple the
fetch session to the TCP session.

> 
> >
> > I guess you could argue that timeouts are cache management, but I don't
> > find that argument persuasive.  Anyone could just create a lot of TCP
> > sessions and use a lot of resources, in that case.  So there is
> > essentially no limit on memory use.  In any case, TCP sessions don't
> > help us implement fetch session timeouts.
> We still have all the options denying the request to keep the state.
> What you want seems like a max connections / ip safeguard.
> I can currently take down a broker with to many connections easily.
> 
> 
> >> I still would argue we disable it by default and make a flag in the
> >> broker to ask the leader to maintain the cache while replicating and also 
> >> only
> >> have it optional in consumers (default to off) so one can turn it on
> >> where it really hurts.  MirrorMaker and audit consumers prominently.
> > I agree with Jason's point from earlier in the thread.  Adding extra
> > configuration knobs that aren't really necessary can harm usability.
> > Certainly asking people to manually turn on a feature "where it really
> > hurts" seems to fall in that category, when we could easily enable it
> > automatically for them.
> This doesn't make much sense to me.

There are no tradeoffs to think about from the client's point of view:
it always wants an incremental fetch session.  So there is no benefit to
making the clients configure an extra setting.  Updating and managing
client configurations is also more difficult than managing broker
configurations for most users.

> You also wanted to implement
> a "turn of in case of bug"-knob. Having the client indicate if the
> feauture will be used seems reasonable to me.,

True.  However, if there is a bug, we could also roll back the client,
so having this configuration knob is not strictly required.

> >
> >> Otherwise I left a few remarks in-line, which should help to understand
> >> my view of the situation better
> >>
> >> Best Jan
> >>
> >>
> >> On 05.12.2017 08:06, Colin McCabe wrote:
> >>> On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
>  On 03.12.2017 21:55, Colin McCabe wrote:
> > On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> >> Thanks for the explanation, Colin. A few more questions.
> >>
> >>> The session epoch is not complex.  It's just a number which increments
> >>> on each incremental fetch.  The session epoch is also useful for
> >>> debugging-- it allows you to match up requests and responses when
> >>> looking at log files.
> >> Currently each request in Kafka has a correlation id to help match the
> >> requests and responses. Is epoch doing something differently?
> > Hi Becket,
> >
> > The correlation ID is used within a single TCP session, to uniquely
> > associate a request with a response.  The correlation ID is not unique
> > (and has no meaning) outside the context of that single TCP 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-08 Thread Jun Rao
Hi, Jiangjie,

What I described is almost the same as yours. The only extra thing is to
scan the log segment from the identified index entry a bit more to find a
file position that ends at a message set boundary and is less than the
partition level fetch size. This way, we still preserve the current
semantic of not returning more bytes than fetch size unless there is a
single message set larger than the fetch size.

In a typically cluster at LinkedIn, what's the percentage of idle
partitions?

Thanks,

Jun


On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin  wrote:

> Hi Jun,
>
> Yes, we still need to handle the corner case. And you are right, it is all
> about trade-off between simplicity and the performance gain.
>
> I was thinking that the brokers always return at least
> log.index.interval.bytes per partition to the consumer, just like we will
> return at least one message to the user. This way we don't need to worry
> about the case that the fetch size is smaller than the index interval. We
> may just need to let users know this behavior change.
>
> Not sure if I completely understand your solution, but I think we are
> thinking about the same. i.e. for the first fetch asking for offset x0, we
> will need to do a binary search to find the position p0. and then the
> broker will iterate over the index entries starting from the first index
> entry whose offset is greater than p0 until it reaches the index entry(x1,
> p1) so that p1 - p0 is just under the fetch size, but the next entry will
> exceed the fetch size. We then return the bytes from p0 to p1. Meanwhile
> the broker caches the next fetch (x1, p1). So when the next fetch comes, it
> will just iterate over the offset index entry starting at (x1, p1).
>
> It is true that in the above approach, the log compacted topic needs to be
> handled. It seems that this can be solved by checking whether the cached
> index and the new log index are still the same index object. If they are
> not the same, we can fall back to binary search with the cached offset. It
> is admittedly more complicated than the current logic, but given the binary
> search logic already exists, it seems the additional object sanity check is
> not too much work.
>
> Not sure if the above implementation is simple enough to justify the
> performance improvement. Let me know if you see potential complexity.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
>
> On Wed, Dec 6, 2017 at 4:48 PM, Jun Rao  wrote:
>
> > Hi, Becket,
> >
> > Yes, I agree that it's rare to have the fetch size smaller than index
> > interval. It's just that we still need additional code to handle the rare
> > case.
> >
> > If you go this far, a more general approach (i.e., without returning at
> the
> > index boundary) is the following. We can cache the following metadata for
> > the next fetch offset: the file position in the log segment, the first
> > index slot at or after the file position. When serving a fetch request,
> we
> > scan the index entries from the cached index slot until we hit the fetch
> > size. We can then send the data at the message set boundary and update
> the
> > cached metadata for the next fetch offset. This is kind of complicated,
> but
> > probably not more than your approach if the corner case has to be
> handled.
> >
> > In both the above approach and your approach, we need the additional
> logic
> > to handle compacted topic since a log segment (and therefore its index)
> can
> > be replaced between two consecutive fetch requests.
> >
> > Overall, I agree that the general approach that you proposed applies more
> > widely since we get the benefit even when all topics are high volume.
> It's
> > just that it would be better if we could think of a simpler
> implementation.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> >
> > > Hi Jun,
> > >
> > > That is true, but in reality it seems rare that the fetch size is
> smaller
> > > than index interval. In the worst case, we may need to do another look
> > up.
> > > In the future, when we have the mechanism to inform the clients about
> the
> > > broker configurations, the clients may want to configure
> correspondingly
> > as
> > > well, e.g. max message size, max timestamp difference, etc.
> > >
> > > On the other hand, we are not guaranteeing that the returned bytes in a
> > > partition is always bounded by the per partition fetch size, because we
> > are
> > > going to return at least one message, so the per partition fetch size
> > seems
> > > already a soft limit. Since we are introducing a new fetch protocol and
> > > this is related, it might be worth considering this option.
> > >
> > > BTW, one reason I bring this up again was because yesterday we had a
> > > presentation from Uber regarding the end to end latency. And they are
> > > seeing this binary search behavior impacting the latency due to page
> > in/out
> > > of the index file.

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-08 Thread Jan Filipiak


On 08.12.2017 10:43, Ismael Juma wrote:

One correction below.

On Fri, Dec 8, 2017 at 11:16 AM, Jan Filipiak 
wrote:


We only check max.message.bytes to late to guard against consumer stalling.
we dont have a notion of max.networkpacket.size before we allocate the
bytebuffer to read it into.


We do: socket.request.max.bytes.

Ismael



perfect, didn't knew we have this in the meantime. :) good that we have it.

Its a very good safeguard. and a nice fail fast for dodgy clients or 
network interfaces.




Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-08 Thread Ismael Juma
One correction below.

On Fri, Dec 8, 2017 at 11:16 AM, Jan Filipiak 
wrote:

> We only check max.message.bytes to late to guard against consumer stalling.
> we dont have a notion of max.networkpacket.size before we allocate the
> bytebuffer to read it into.


We do: socket.request.max.bytes.

Ismael


Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-08 Thread Jan Filipiak

Hi,

sorry for the late reply, busy times :-/

I would ask you one thing maybe. Since the timeout
argument seems to be settled I have no further argument
form your side except the "i don't want to".

Can you see that connection.max.idle.max is the exact time
that expresses "We expect the client to be away for this long,
and come back and continue"?

also clarified some stuff inline

Best Jan




On 05.12.2017 23:14, Colin McCabe wrote:

On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:

Hi Colin

Addressing the topic of how to manage slots from the other thread.
With tcp connections all this comes for free essentially.

Hi Jan,

I don't think that it's accurate to say that cache management "comes for
free" by coupling the incremental fetch session with the TCP session.
When a new TCP session is started by a fetch request, you still have to
decide whether to grant that request an incremental fetch session or
not.  If your answer is that you always grant the request, I would argue
that you do not have cache management.

First I would say, the client has a big say in this. If the client
is not going to issue incremental he shouldn't ask for a cache
when the client ask for the cache we still have all options to deny.



I guess you could argue that timeouts are cache management, but I don't
find that argument persuasive.  Anyone could just create a lot of TCP
sessions and use a lot of resources, in that case.  So there is
essentially no limit on memory use.  In any case, TCP sessions don't
help us implement fetch session timeouts.

We still have all the options denying the request to keep the state.
What you want seems like a max connections / ip safeguard.
I can currently take down a broker with to many connections easily.



I still would argue we disable it by default and make a flag in the
broker to ask the leader to maintain the cache while replicating and also only
have it optional in consumers (default to off) so one can turn it on
where it really hurts.  MirrorMaker and audit consumers prominently.

I agree with Jason's point from earlier in the thread.  Adding extra
configuration knobs that aren't really necessary can harm usability.
Certainly asking people to manually turn on a feature "where it really
hurts" seems to fall in that category, when we could easily enable it
automatically for them.

This doesn't make much sense to me. You also wanted to implement
a "turn of in case of bug"-knob. Having the client indicate if the feauture
will be used seems reasonable to me.,



Otherwise I left a few remarks in-line, which should help to understand
my view of the situation better

Best Jan


On 05.12.2017 08:06, Colin McCabe wrote:

On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:

On 03.12.2017 21:55, Colin McCabe wrote:

On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:

Thanks for the explanation, Colin. A few more questions.


The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correlation ID is not unique
(and has no meaning) outside the context of that single TCP session.

Keep in mind, NetworkClient is in charge of TCP sessions, and generally
tries to hide that information from the upper layers of the code.  So
when you submit a request to NetworkClient, you don't know if that
request creates a TCP session, or reuses an existing one.

Unfortunately, this doesn't work.  Imagine the client misses an
increment fetch response about a partition.  And then the partition is
never updated after that.  The client has no way to know about the
partition, since it won't be included in any future incremental fetch
responses.  And there are no offsets to compare, since the partition is
simply omitted from the response.

I am curious about in which situation would the follower miss a response
of a partition. If the entire FetchResponse is lost (e.g. timeout), the
follower would disconnect and retry. That will result in sending a full
FetchRequest.

Basically, you are proposing that we rely on TCP for reliable delivery
in a distributed system.  That isn't a good idea for a bunch of
different reasons.  First of all, TCP timeouts tend to be very long.  So
if the TCP session timing out is your error detection mechanism, you
have to wait minutes for messages to timeout.  Of course, we add a
timeout on top of that after which we declare the connection bad and
manually close it.  But just because the session is closed on one end
doesn't mean that the other end knows that it is closed.  So the leader
may have to wait quite a long time before TCP decides that yes,
connection X from the follower 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-07 Thread Colin McCabe
On Wed, Dec 6, 2017, at 17:07, Jun Rao wrote:
> Hi, Collin,
> 
> Thanks for the KIP. A few comments below.
> 
> 20. Not sure that I fully understand session ID and session epoch. Is
> session ID tied to a socket connection?

Hi Jun,

No, the session ID is not tied to a socket connection.

> That is if the client creates a
> new connection, will a new session ID be created?

The client can choose to reuse an existing fetch session, or create a
new one, as it chooses.

> 
> 21. "The partition becomes dirty when". It seems that we should include
> log end offset changes.

Good point.  We certainly want to consider the partition dirty when new
data is added.  I thought this was listed already in the wiki, but it
wasn't.  Fixed.

> 
> 22. The proposal introduces a response level of error code. Could we just
> set the same error code at the partition level? Not sure if we need to
> optimize the performance in the error case. This avoids the need for
> validating the consistency of error code at different levels.

If we receive an incremental fetch request with an invalid session ID,
we don't know what partitions were supposed to be included in the
incremental fetch.  That information is included in the session.  But
the session doesn't exist.  I suppose we could create a fake partition
entry and set an error code in for it.  But that seems like an ugly
hack.

In general, I think all batch requests should have both a request-level
error code and a batch-element-level error code.  Nearly all RPC systems
have this.  The fact that we lack it makes it impossible to handle
certain errors correctly.  For example, if someone sends an message with
an unknown API code, we can't respond at all, because we don't know the
schema of the unknown API type.  Instead we just have to close the
connection and hope the client goes away.

The fact that per-batch errors are lumped with per-request errors
complicates the error handling in the AdminClient as well.  For example,
we would like to handle NotControllerException in a general way, by
refreshing metadata and retrying the whole request.  But instead, we
have to have separate code to handle NotControllerException for each
different request type, because per-batch errors are encoded differently
for each request type.  So while I understand the desire to avoid
changing the error handling, I think adding a per-request error code is
probably better here.

> 
> 23. As other people have mentioned, currently, if the leader has more
> data to give than the request level max_bytes, the leader tries to give out
> the data in a fair way across partitions. This is currently achieved by
> rotating the partition list in the fetch request. We probably need to
> take that into consideration somehow if the full partition list is not always
> included in the fetch request.

That's a very good point!  I think the easiest way to handle it is to
lexicographically order the partitions, and then start at the partition
whose index is (sequence_number % num_partitions)  I'll include this in
the next update.

regards,
Colin

> 
> Thanks,
> 
> Jun
> 
> 
> On Wed, Dec 6, 2017 at 11:20 AM, Colin McCabe  wrote:
> 
> > Hi Becket,
> >
> > Thanks for the ideas.  It's interesting to think about how we could
> > further optimize fetches.
> >
> > For now, I think sending offsets in the FetchRequest is still important,
> > because of how we do replication.  The leader considers followers caught
> > up with an offset when they fetch the following offset.  I think we
> > should hold off on changing how replication works in this KIP, and
> > consider these ideas for follow-on KIPs.
> >
> > (more responses below)
> >
> > On Tue, Dec 5, 2017, at 21:38, Becket Qin wrote:
> > > Hi Jun,
> > >
> > > That is true, but in reality it seems rare that the fetch size is smaller
> > > than index interval. In the worst case, we may need to do another look
> > > up. In the future, when we have the mechanism to inform the clients
> > about the
> > > broker configurations, the clients may want to configure correspondingly
> > > as well, e.g. max message size, max timestamp difference, etc.
> > >
> > > On the other hand, we are not guaranteeing that the returned bytes in a
> > > partition is always bounded by the per partition fetch size, because we
> > > are going to return at least one message, so the per partition fetch size
> > > seems already a soft limit. Since we are introducing a new fetch
> > protocol and
> > > this is related, it might be worth considering this option.
> > >
> > > BTW, one reason I bring this up again was because yesterday we had a
> > > presentation from Uber regarding the end to end latency. And they are
> > > seeing this binary search behavior impacting the latency due to page
> > > in/out of the index file.
> >
> > I'm surprised that the index file would get paged out.
> >
> > We don't really need to change the wire protocol to avoid looking at the
> > index file, though.
> >
> > For 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-07 Thread Colin McCabe
On Thu, Dec 7, 2017, at 08:57, Jason Gustafson wrote:
> Hey Colin,
> 
> A full fetch request will certainly avoid any ambiguity here.  But now
> > we're back to sending full fetch requests whenever there are network
> > issues, which is worse than the current proposal.  And has the
> > congestion collapse problem I talked about earlier when the network is
> > wobbling.
> 
> 
> The suggestion was to only bump the epoch (and do a full fetch) when we
> lose a fetch response or when the fetched partitions have changed. As far
> as I can tell, even with sequence numbers, we'd have to do the same.
> Maybe I'm missing something?

Hi Jason,

With the current proposal, you do not have to fall back to a full fetch
if you lose a response.  If the TCP session drops and you do not get a
response to your request, you can simply establish a new TCP session and
send the exact same response, and get the data you need.  This has the
added bonus that you don't have to change how you interact with
NetworkClient-- you can just resend the same message after an I/O error,
without introducing subtle bugs.  I will update the wiki page in a
little bit... hopefully this will be clearer then.

best,
Colin

> 
> -Jason
> 
> 
> On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin  wrote:
> 
> > Hi Jun,
> >
> > Yes, we still need to handle the corner case. And you are right, it is all
> > about trade-off between simplicity and the performance gain.
> >
> > I was thinking that the brokers always return at least
> > log.index.interval.bytes per partition to the consumer, just like we will
> > return at least one message to the user. This way we don't need to worry
> > about the case that the fetch size is smaller than the index interval. We
> > may just need to let users know this behavior change.
> >
> > Not sure if I completely understand your solution, but I think we are
> > thinking about the same. i.e. for the first fetch asking for offset x0, we
> > will need to do a binary search to find the position p0. and then the
> > broker will iterate over the index entries starting from the first index
> > entry whose offset is greater than p0 until it reaches the index entry(x1,
> > p1) so that p1 - p0 is just under the fetch size, but the next entry will
> > exceed the fetch size. We then return the bytes from p0 to p1. Meanwhile
> > the broker caches the next fetch (x1, p1). So when the next fetch comes, it
> > will just iterate over the offset index entry starting at (x1, p1).
> >
> > It is true that in the above approach, the log compacted topic needs to be
> > handled. It seems that this can be solved by checking whether the cached
> > index and the new log index are still the same index object. If they are
> > not the same, we can fall back to binary search with the cached offset. It
> > is admittedly more complicated than the current logic, but given the binary
> > search logic already exists, it seems the additional object sanity check is
> > not too much work.
> >
> > Not sure if the above implementation is simple enough to justify the
> > performance improvement. Let me know if you see potential complexity.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> >
> > On Wed, Dec 6, 2017 at 4:48 PM, Jun Rao  wrote:
> >
> > > Hi, Becket,
> > >
> > > Yes, I agree that it's rare to have the fetch size smaller than index
> > > interval. It's just that we still need additional code to handle the rare
> > > case.
> > >
> > > If you go this far, a more general approach (i.e., without returning at
> > the
> > > index boundary) is the following. We can cache the following metadata for
> > > the next fetch offset: the file position in the log segment, the first
> > > index slot at or after the file position. When serving a fetch request,
> > we
> > > scan the index entries from the cached index slot until we hit the fetch
> > > size. We can then send the data at the message set boundary and update
> > the
> > > cached metadata for the next fetch offset. This is kind of complicated,
> > but
> > > probably not more than your approach if the corner case has to be
> > handled.
> > >
> > > In both the above approach and your approach, we need the additional
> > logic
> > > to handle compacted topic since a log segment (and therefore its index)
> > can
> > > be replaced between two consecutive fetch requests.
> > >
> > > Overall, I agree that the general approach that you proposed applies more
> > > widely since we get the benefit even when all topics are high volume.
> > It's
> > > just that it would be better if we could think of a simpler
> > implementation.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > That is true, but in reality it seems rare that the fetch size is
> > smaller
> > > > than index interval. In the worst case, we may need to do another look
> > > up.
> > > > In the future, 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-07 Thread Colin McCabe
On Wed, Dec 6, 2017, at 11:23, Becket Qin wrote:
> Hi Colin,
> 
> >A full fetch request will certainly avoid any ambiguity here.  But now
> >we're back to sending full fetch requests whenever there are network
> >issues, which is worse than the current proposal.  And has the
> >congestion collapse problem I talked about earlier when the network is
> >wobbling.  We also don't get the other debuggability benefits of being
> >able to uniquely associate each update in the incremental fetch session
> >with a sequence number.
> 
> I think we would want to optimize for the normal case instead of the
> failure case. The failure case is supposed to be rare and if that happens
> usually it requires human attention to fix anyways. So reducing the
> regular cost in the normal cases probably makes more sense.
> 
> Thanks,

Hi Becket,

I agree we should optimize for the normal case.  I believe that the
sequence number proposal I put forward does this.  All the competing
proposals have been strictly worse for both the normal and error cases. 
For example, the proposal to rely on the TCP session to establish
ordering does not help the normal case.  But it does make the case where
there are network issues worse.  It also makes it harder for us to put a
limit on the amount of time we will cache, which is worse for the normal
case.

best,
Colin

> 
> Jiangjie (Becket) Qin
> 
> On Wed, Dec 6, 2017 at 10:58 AM, Colin McCabe  wrote:
> 
> > On Wed, Dec 6, 2017, at 10:49, Jason Gustafson wrote:
> > > >
> > > > There is already a way in the existing proposal for clients to change
> > > > the set of partitions they are interested in, while re-using their same
> > > > session and session ID.  We don't need to change how sequence ID works
> > > > in order to do this.
> > >
> > >
> > > There is some inconsistency in the KIP about this, so I wasn't sure. In
> > > particular, you say this: " The FetchSession maintains information about
> > > a specific set of relevant partitions.  Note that the set of relevant
> > > partitions is established when the FetchSession is created.  It cannot be
> > > changed later." Maybe that could be clarified?
> >
> > That's a fair point-- I didn't fix this part of the KIP after making an
> > update below.  So it was definitely unclear.
> >
> > best,
> > Colin
> >
> > >
> > >
> > > > But how does the broker know that it needs to resend the data for
> > > > partition P?  After all, if the response had not been dropped, P would
> > > > not have been resent, since it didn't change.  Under the existing
> > > > scheme, the follower can look at lastDirtyEpoch to find this out.   In
> > > > the new scheme, I don't see how it would know.
> > >
> > >
> > > If a fetch response is lost, the epoch would be bumped by the client and
> > > a
> > > full fetch would be sent. Doesn't that solve the issue?
> > >
> > > -Jason
> > >
> > > On Wed, Dec 6, 2017 at 10:40 AM, Colin McCabe 
> > wrote:
> > >
> > > > On Wed, Dec 6, 2017, at 09:32, Jason Gustafson wrote:
> > > > > >
> > > > > > Thinking about this again. I do see the reason that we want to
> > have a
> > > > epoch
> > > > > > to avoid out of order registration of the interested set. But I am
> > > > > > wondering if the following semantic would meet what we want better:
> > > > > >  - Session Id: the id assigned to a single client for life long
> > time.
> > > > i.e
> > > > > > it does not change when the interested partitions change.
> > > > > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > > > request
> > > > > > comes, which may result in the interested partition set change.
> > > > > > This will ensure that the registered interested set will always be
> > the
> > > > > > latest registration. And the clients can change the interested
> > > > partition
> > > > > > set without creating another session.
> > > > >
> > > > >
> > > > > I agree this is a bit more intuitive than the sequence number and the
> > > > > ability to reuse the session is beneficial since it causes less
> > waste of
> > > > > the cache for session timeouts.
> > > >
> > > > Hi Jason,
> > > >
> > > > There is already a way in the existing proposal for clients to change
> > > > the set of partitions they are interested in, while re-using their same
> > > > session and session ID.  We don't need to change how sequence ID works
> > > > in order to do this.
> > > >
> > > > > controlled by the client and a bump of the epoch indicates a full
> > fetch
> > > > > request. The client should also bump the epoch if it fails to
> > receive a
> > > > > fetch response. This ensures that the broker cannot receive an old
> > > > > request after the client has reconnected and sent a new one which
> > > > > could cause an invalid session state.
> > > >
> > > > Hmm... I don't think this quite works.
> > > >
> > > > Let's suppose a broker sends out an incremental fetch response
> > > > containing new data for some partition P.  The sequence number of the

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-07 Thread Jason Gustafson
Hey Colin,

A full fetch request will certainly avoid any ambiguity here.  But now
> we're back to sending full fetch requests whenever there are network
> issues, which is worse than the current proposal.  And has the
> congestion collapse problem I talked about earlier when the network is
> wobbling.


The suggestion was to only bump the epoch (and do a full fetch) when we
lose a fetch response or when the fetched partitions have changed. As far
as I can tell, even with sequence numbers, we'd have to do the same. Maybe
I'm missing something?

-Jason


On Wed, Dec 6, 2017 at 6:57 PM, Becket Qin  wrote:

> Hi Jun,
>
> Yes, we still need to handle the corner case. And you are right, it is all
> about trade-off between simplicity and the performance gain.
>
> I was thinking that the brokers always return at least
> log.index.interval.bytes per partition to the consumer, just like we will
> return at least one message to the user. This way we don't need to worry
> about the case that the fetch size is smaller than the index interval. We
> may just need to let users know this behavior change.
>
> Not sure if I completely understand your solution, but I think we are
> thinking about the same. i.e. for the first fetch asking for offset x0, we
> will need to do a binary search to find the position p0. and then the
> broker will iterate over the index entries starting from the first index
> entry whose offset is greater than p0 until it reaches the index entry(x1,
> p1) so that p1 - p0 is just under the fetch size, but the next entry will
> exceed the fetch size. We then return the bytes from p0 to p1. Meanwhile
> the broker caches the next fetch (x1, p1). So when the next fetch comes, it
> will just iterate over the offset index entry starting at (x1, p1).
>
> It is true that in the above approach, the log compacted topic needs to be
> handled. It seems that this can be solved by checking whether the cached
> index and the new log index are still the same index object. If they are
> not the same, we can fall back to binary search with the cached offset. It
> is admittedly more complicated than the current logic, but given the binary
> search logic already exists, it seems the additional object sanity check is
> not too much work.
>
> Not sure if the above implementation is simple enough to justify the
> performance improvement. Let me know if you see potential complexity.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
>
> On Wed, Dec 6, 2017 at 4:48 PM, Jun Rao  wrote:
>
> > Hi, Becket,
> >
> > Yes, I agree that it's rare to have the fetch size smaller than index
> > interval. It's just that we still need additional code to handle the rare
> > case.
> >
> > If you go this far, a more general approach (i.e., without returning at
> the
> > index boundary) is the following. We can cache the following metadata for
> > the next fetch offset: the file position in the log segment, the first
> > index slot at or after the file position. When serving a fetch request,
> we
> > scan the index entries from the cached index slot until we hit the fetch
> > size. We can then send the data at the message set boundary and update
> the
> > cached metadata for the next fetch offset. This is kind of complicated,
> but
> > probably not more than your approach if the corner case has to be
> handled.
> >
> > In both the above approach and your approach, we need the additional
> logic
> > to handle compacted topic since a log segment (and therefore its index)
> can
> > be replaced between two consecutive fetch requests.
> >
> > Overall, I agree that the general approach that you proposed applies more
> > widely since we get the benefit even when all topics are high volume.
> It's
> > just that it would be better if we could think of a simpler
> implementation.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> >
> > > Hi Jun,
> > >
> > > That is true, but in reality it seems rare that the fetch size is
> smaller
> > > than index interval. In the worst case, we may need to do another look
> > up.
> > > In the future, when we have the mechanism to inform the clients about
> the
> > > broker configurations, the clients may want to configure
> correspondingly
> > as
> > > well, e.g. max message size, max timestamp difference, etc.
> > >
> > > On the other hand, we are not guaranteeing that the returned bytes in a
> > > partition is always bounded by the per partition fetch size, because we
> > are
> > > going to return at least one message, so the per partition fetch size
> > seems
> > > already a soft limit. Since we are introducing a new fetch protocol and
> > > this is related, it might be worth considering this option.
> > >
> > > BTW, one reason I bring this up again was because yesterday we had a
> > > presentation from Uber regarding the end to end latency. And they are
> > > seeing this binary search behavior impacting the latency 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Becket Qin
Hi Jun,

Yes, we still need to handle the corner case. And you are right, it is all
about trade-off between simplicity and the performance gain.

I was thinking that the brokers always return at least
log.index.interval.bytes per partition to the consumer, just like we will
return at least one message to the user. This way we don't need to worry
about the case that the fetch size is smaller than the index interval. We
may just need to let users know this behavior change.

Not sure if I completely understand your solution, but I think we are
thinking about the same. i.e. for the first fetch asking for offset x0, we
will need to do a binary search to find the position p0. and then the
broker will iterate over the index entries starting from the first index
entry whose offset is greater than p0 until it reaches the index entry(x1,
p1) so that p1 - p0 is just under the fetch size, but the next entry will
exceed the fetch size. We then return the bytes from p0 to p1. Meanwhile
the broker caches the next fetch (x1, p1). So when the next fetch comes, it
will just iterate over the offset index entry starting at (x1, p1).

It is true that in the above approach, the log compacted topic needs to be
handled. It seems that this can be solved by checking whether the cached
index and the new log index are still the same index object. If they are
not the same, we can fall back to binary search with the cached offset. It
is admittedly more complicated than the current logic, but given the binary
search logic already exists, it seems the additional object sanity check is
not too much work.

Not sure if the above implementation is simple enough to justify the
performance improvement. Let me know if you see potential complexity.

Thanks,

Jiangjie (Becket) Qin





On Wed, Dec 6, 2017 at 4:48 PM, Jun Rao  wrote:

> Hi, Becket,
>
> Yes, I agree that it's rare to have the fetch size smaller than index
> interval. It's just that we still need additional code to handle the rare
> case.
>
> If you go this far, a more general approach (i.e., without returning at the
> index boundary) is the following. We can cache the following metadata for
> the next fetch offset: the file position in the log segment, the first
> index slot at or after the file position. When serving a fetch request, we
> scan the index entries from the cached index slot until we hit the fetch
> size. We can then send the data at the message set boundary and update the
> cached metadata for the next fetch offset. This is kind of complicated, but
> probably not more than your approach if the corner case has to be handled.
>
> In both the above approach and your approach, we need the additional logic
> to handle compacted topic since a log segment (and therefore its index) can
> be replaced between two consecutive fetch requests.
>
> Overall, I agree that the general approach that you proposed applies more
> widely since we get the benefit even when all topics are high volume. It's
> just that it would be better if we could think of a simpler implementation.
>
> Thanks,
>
> Jun
>
> On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
>
> > Hi Jun,
> >
> > That is true, but in reality it seems rare that the fetch size is smaller
> > than index interval. In the worst case, we may need to do another look
> up.
> > In the future, when we have the mechanism to inform the clients about the
> > broker configurations, the clients may want to configure correspondingly
> as
> > well, e.g. max message size, max timestamp difference, etc.
> >
> > On the other hand, we are not guaranteeing that the returned bytes in a
> > partition is always bounded by the per partition fetch size, because we
> are
> > going to return at least one message, so the per partition fetch size
> seems
> > already a soft limit. Since we are introducing a new fetch protocol and
> > this is related, it might be worth considering this option.
> >
> > BTW, one reason I bring this up again was because yesterday we had a
> > presentation from Uber regarding the end to end latency. And they are
> > seeing this binary search behavior impacting the latency due to page
> in/out
> > of the index file.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
> >
> > > Hi, Jiangjie,
> > >
> > > Not sure returning the fetch response at the index boundary is a
> general
> > > solution. The index interval is configurable. If one configures the
> index
> > > interval larger than the per partition fetch size, we probably have to
> > > return data not at the index boundary.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin 
> wrote:
> > >
> > > > Hi Colin,
> > > >
> > > > Thinking about this again. I do see the reason that we want to have a
> > > epoch
> > > > to avoid out of order registration of the interested set. But I am
> > > > wondering if 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Jun Rao
Hi, Collin,

Thanks for the KIP. A few comments below.

20. Not sure that I fully understand session ID and session epoch. Is
session ID tied to a socket connection? That is if the client creates a new
connection, will a new session ID be created?

21. "The partition becomes dirty when". It seems that we should include log
end offset changes.

22. The proposal introduces a response level of error code. Could we just
set the same error code at the partition level? Not sure if we need to
optimize the performance in the error case. This avoids the need for
validating the consistency of error code at different levels.

23. As other people have mentioned, currently, if the leader has more data
to give than the request level max_bytes, the leader tries to give out the
data in a fair way across partitions. This is currently achieved by
rotating the partition list in the fetch request. We probably need to take
that into consideration somehow if the full partition list is not always
included in the fetch request.

Thanks,

Jun


On Wed, Dec 6, 2017 at 11:20 AM, Colin McCabe  wrote:

> Hi Becket,
>
> Thanks for the ideas.  It's interesting to think about how we could
> further optimize fetches.
>
> For now, I think sending offsets in the FetchRequest is still important,
> because of how we do replication.  The leader considers followers caught
> up with an offset when they fetch the following offset.  I think we
> should hold off on changing how replication works in this KIP, and
> consider these ideas for follow-on KIPs.
>
> (more responses below)
>
> On Tue, Dec 5, 2017, at 21:38, Becket Qin wrote:
> > Hi Jun,
> >
> > That is true, but in reality it seems rare that the fetch size is smaller
> > than index interval. In the worst case, we may need to do another look
> > up. In the future, when we have the mechanism to inform the clients
> about the
> > broker configurations, the clients may want to configure correspondingly
> > as well, e.g. max message size, max timestamp difference, etc.
> >
> > On the other hand, we are not guaranteeing that the returned bytes in a
> > partition is always bounded by the per partition fetch size, because we
> > are going to return at least one message, so the per partition fetch size
> > seems already a soft limit. Since we are introducing a new fetch
> protocol and
> > this is related, it might be worth considering this option.
> >
> > BTW, one reason I bring this up again was because yesterday we had a
> > presentation from Uber regarding the end to end latency. And they are
> > seeing this binary search behavior impacting the latency due to page
> > in/out of the index file.
>
> I'm surprised that the index file would get paged out.
>
> We don't really need to change the wire protocol to avoid looking at the
> index file, though.
>
> For example, if a follower makes a request for offset X, and the leader
> gives them back messages from X to Y, the leader can be pretty confident
> that the follower's next request will be for offset Y+1.  So the leader
> could cache the file index for offset Y+1 in memory.
>
> best,
> Colin
>
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
> >
> > > Hi, Jiangjie,
> > >
> > > Not sure returning the fetch response at the index boundary is a
> general
> > > solution. The index interval is configurable. If one configures the
> index
> > > interval larger than the per partition fetch size, we probably have to
> > > return data not at the index boundary.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin 
> wrote:
> > >
> > > > Hi Colin,
> > > >
> > > > Thinking about this again. I do see the reason that we want to have a
> > > epoch
> > > > to avoid out of order registration of the interested set. But I am
> > > > wondering if the following semantic would meet what we want better:
> > > >  - Session Id: the id assigned to a single client for life long
> time. i.e
> > > > it does not change when the interested partitions change.
> > > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > > request
> > > > comes, which may result in the interested partition set change.
> > > > This will ensure that the registered interested set will always be
> the
> > > > latest registration. And the clients can change the interested
> partition
> > > > set without creating another session.
> > > >
> > > > Also I want to bring up the way the leader respond to the
> FetchRequest
> > > > again. I think it would be a big improvement if we just return the
> > > > responses at index entry boundary or log end. There are a few
> benefits:
> > > > 1. The leader does not need the follower to provide the offsets,
> > > > 2. The fetch requests no longer need to do a binary search on the
> index,
> > > it
> > > > just need to do a linear access to the index file, which is much
> cache
> > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Jun Rao
Hi, Becket,

Yes, I agree that it's rare to have the fetch size smaller than index
interval. It's just that we still need additional code to handle the rare
case.

If you go this far, a more general approach (i.e., without returning at the
index boundary) is the following. We can cache the following metadata for
the next fetch offset: the file position in the log segment, the first
index slot at or after the file position. When serving a fetch request, we
scan the index entries from the cached index slot until we hit the fetch
size. We can then send the data at the message set boundary and update the
cached metadata for the next fetch offset. This is kind of complicated, but
probably not more than your approach if the corner case has to be handled.

In both the above approach and your approach, we need the additional logic
to handle compacted topic since a log segment (and therefore its index) can
be replaced between two consecutive fetch requests.

Overall, I agree that the general approach that you proposed applies more
widely since we get the benefit even when all topics are high volume. It's
just that it would be better if we could think of a simpler implementation.

Thanks,

Jun

On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:

> Hi Jun,
>
> That is true, but in reality it seems rare that the fetch size is smaller
> than index interval. In the worst case, we may need to do another look up.
> In the future, when we have the mechanism to inform the clients about the
> broker configurations, the clients may want to configure correspondingly as
> well, e.g. max message size, max timestamp difference, etc.
>
> On the other hand, we are not guaranteeing that the returned bytes in a
> partition is always bounded by the per partition fetch size, because we are
> going to return at least one message, so the per partition fetch size seems
> already a soft limit. Since we are introducing a new fetch protocol and
> this is related, it might be worth considering this option.
>
> BTW, one reason I bring this up again was because yesterday we had a
> presentation from Uber regarding the end to end latency. And they are
> seeing this binary search behavior impacting the latency due to page in/out
> of the index file.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
>
> > Hi, Jiangjie,
> >
> > Not sure returning the fetch response at the index boundary is a general
> > solution. The index interval is configurable. If one configures the index
> > interval larger than the per partition fetch size, we probably have to
> > return data not at the index boundary.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin  wrote:
> >
> > > Hi Colin,
> > >
> > > Thinking about this again. I do see the reason that we want to have a
> > epoch
> > > to avoid out of order registration of the interested set. But I am
> > > wondering if the following semantic would meet what we want better:
> > >  - Session Id: the id assigned to a single client for life long time.
> i.e
> > > it does not change when the interested partitions change.
> > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > request
> > > comes, which may result in the interested partition set change.
> > > This will ensure that the registered interested set will always be the
> > > latest registration. And the clients can change the interested
> partition
> > > set without creating another session.
> > >
> > > Also I want to bring up the way the leader respond to the FetchRequest
> > > again. I think it would be a big improvement if we just return the
> > > responses at index entry boundary or log end. There are a few benefits:
> > > 1. The leader does not need the follower to provide the offsets,
> > > 2. The fetch requests no longer need to do a binary search on the
> index,
> > it
> > > just need to do a linear access to the index file, which is much cache
> > > friendly.
> > >
> > > Assuming the leader can get the last returned offsets to the clients
> > > cheaply, I am still not sure why it is necessary for the followers to
> > > repeat the offsets in the incremental fetch every time. Intuitively it
> > > should only update the offsets when the leader has wrong offsets, in
> most
> > > cases, the incremental fetch request should just be empty. Otherwise we
> > may
> > > not be saving much when there are continuous small requests going to
> each
> > > partition, which could be normal for some low latency systems.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > > On Tue, Dec 5, 2017 at 2:14 PM, Colin McCabe 
> wrote:
> > >
> > > > On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > > > > Hi Colin
> > > > >
> > > > > Addressing the topic of how to manage slots from the other thread.
> > > > > With tcp connections all this comes for free essentially.
> > > >
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Becket Qin
>For now, I think sending offsets in the FetchRequest is still important,
>because of how we do replication.  The leader considers followers caught
>up with an offset when they fetch the following offset.  I think we
>should hold off on changing how replication works in this KIP, and
>consider these ideas for follow-on KIPs.

I agree that now may not be the best timing to do the optimization of
fetching at index entry boundary. It might be better to do this after we
can return the broker configurations to the clients.
But since we are discussing about changing the protocol we do fetch now, it
may worth exploring other options to understand what we could achieve.

>For example, if a follower makes a request for offset X, and the leader
>gives them back messages from X to Y, the leader can be pretty confident
>that the follower's next request will be for offset Y+1.  So the leader
>could cache the file index for offset Y+1 in memory.

The offsets lookup is to find the position, not the offset. For example, if
we have the following three messages:
m1: offset=0, position=0
m2: offset=1, position=500
m3: offset=2, position=1500

Now a fetch response returns bytes from 0 to 1000. The actual message
returned is only m1 because m2 is from position 500 to position 1499. So
the next fetch request will ask for offset 1. In order to know the position
of offset 1, we need to do the binary search.
Returning at the index entry boundary has the benefit of knowing both
offset and position because an index entry already contains offset and
corresponding position. That is why we can save the binary search.

Thanks,

Jiangjie (Becket) Qin





On Wed, Dec 6, 2017 at 11:20 AM, Colin McCabe  wrote:

> Hi Becket,
>
> Thanks for the ideas.  It's interesting to think about how we could
> further optimize fetches.
>
> For now, I think sending offsets in the FetchRequest is still important,
> because of how we do replication.  The leader considers followers caught
> up with an offset when they fetch the following offset.  I think we
> should hold off on changing how replication works in this KIP, and
> consider these ideas for follow-on KIPs.
>
> (more responses below)
>
> On Tue, Dec 5, 2017, at 21:38, Becket Qin wrote:
> > Hi Jun,
> >
> > That is true, but in reality it seems rare that the fetch size is smaller
> > than index interval. In the worst case, we may need to do another look
> > up. In the future, when we have the mechanism to inform the clients
> about the
> > broker configurations, the clients may want to configure correspondingly
> > as well, e.g. max message size, max timestamp difference, etc.
> >
> > On the other hand, we are not guaranteeing that the returned bytes in a
> > partition is always bounded by the per partition fetch size, because we
> > are going to return at least one message, so the per partition fetch size
> > seems already a soft limit. Since we are introducing a new fetch
> protocol and
> > this is related, it might be worth considering this option.
> >
> > BTW, one reason I bring this up again was because yesterday we had a
> > presentation from Uber regarding the end to end latency. And they are
> > seeing this binary search behavior impacting the latency due to page
> > in/out of the index file.
>
> I'm surprised that the index file would get paged out.
>
> We don't really need to change the wire protocol to avoid looking at the
> index file, though.
>
> For example, if a follower makes a request for offset X, and the leader
> gives them back messages from X to Y, the leader can be pretty confident
> that the follower's next request will be for offset Y+1.  So the leader
> could cache the file index for offset Y+1 in memory.
>
> best,
> Colin
>
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
> >
> > > Hi, Jiangjie,
> > >
> > > Not sure returning the fetch response at the index boundary is a
> general
> > > solution. The index interval is configurable. If one configures the
> index
> > > interval larger than the per partition fetch size, we probably have to
> > > return data not at the index boundary.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin 
> wrote:
> > >
> > > > Hi Colin,
> > > >
> > > > Thinking about this again. I do see the reason that we want to have a
> > > epoch
> > > > to avoid out of order registration of the interested set. But I am
> > > > wondering if the following semantic would meet what we want better:
> > > >  - Session Id: the id assigned to a single client for life long
> time. i.e
> > > > it does not change when the interested partitions change.
> > > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > > request
> > > > comes, which may result in the interested partition set change.
> > > > This will ensure that the registered interested set will always be
> the
> > > > latest 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Becket Qin
Hi Colin,

>A full fetch request will certainly avoid any ambiguity here.  But now
>we're back to sending full fetch requests whenever there are network
>issues, which is worse than the current proposal.  And has the
>congestion collapse problem I talked about earlier when the network is
>wobbling.  We also don't get the other debuggability benefits of being
>able to uniquely associate each update in the incremental fetch session
>with a sequence number.

I think we would want to optimize for the normal case instead of the
failure case. The failure case is supposed to be rare and if that happens
usually it requires human attention to fix anyways. So reducing the regular
cost in the normal cases probably makes more sense.

Thanks,

Jiangjie (Becket) Qin

On Wed, Dec 6, 2017 at 10:58 AM, Colin McCabe  wrote:

> On Wed, Dec 6, 2017, at 10:49, Jason Gustafson wrote:
> > >
> > > There is already a way in the existing proposal for clients to change
> > > the set of partitions they are interested in, while re-using their same
> > > session and session ID.  We don't need to change how sequence ID works
> > > in order to do this.
> >
> >
> > There is some inconsistency in the KIP about this, so I wasn't sure. In
> > particular, you say this: " The FetchSession maintains information about
> > a specific set of relevant partitions.  Note that the set of relevant
> > partitions is established when the FetchSession is created.  It cannot be
> > changed later." Maybe that could be clarified?
>
> That's a fair point-- I didn't fix this part of the KIP after making an
> update below.  So it was definitely unclear.
>
> best,
> Colin
>
> >
> >
> > > But how does the broker know that it needs to resend the data for
> > > partition P?  After all, if the response had not been dropped, P would
> > > not have been resent, since it didn't change.  Under the existing
> > > scheme, the follower can look at lastDirtyEpoch to find this out.   In
> > > the new scheme, I don't see how it would know.
> >
> >
> > If a fetch response is lost, the epoch would be bumped by the client and
> > a
> > full fetch would be sent. Doesn't that solve the issue?
> >
> > -Jason
> >
> > On Wed, Dec 6, 2017 at 10:40 AM, Colin McCabe 
> wrote:
> >
> > > On Wed, Dec 6, 2017, at 09:32, Jason Gustafson wrote:
> > > > >
> > > > > Thinking about this again. I do see the reason that we want to
> have a
> > > epoch
> > > > > to avoid out of order registration of the interested set. But I am
> > > > > wondering if the following semantic would meet what we want better:
> > > > >  - Session Id: the id assigned to a single client for life long
> time.
> > > i.e
> > > > > it does not change when the interested partitions change.
> > > > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > > request
> > > > > comes, which may result in the interested partition set change.
> > > > > This will ensure that the registered interested set will always be
> the
> > > > > latest registration. And the clients can change the interested
> > > partition
> > > > > set without creating another session.
> > > >
> > > >
> > > > I agree this is a bit more intuitive than the sequence number and the
> > > > ability to reuse the session is beneficial since it causes less
> waste of
> > > > the cache for session timeouts.
> > >
> > > Hi Jason,
> > >
> > > There is already a way in the existing proposal for clients to change
> > > the set of partitions they are interested in, while re-using their same
> > > session and session ID.  We don't need to change how sequence ID works
> > > in order to do this.
> > >
> > > > controlled by the client and a bump of the epoch indicates a full
> fetch
> > > > request. The client should also bump the epoch if it fails to
> receive a
> > > > fetch response. This ensures that the broker cannot receive an old
> > > > request after the client has reconnected and sent a new one which
> > > > could cause an invalid session state.
> > >
> > > Hmm... I don't think this quite works.
> > >
> > > Let's suppose a broker sends out an incremental fetch response
> > > containing new data for some partition P.  The sequence number of the
> > > fetch response is 100.  If the follower loses the response, under this
> > > proposed scheme, the follower bumps up the sequence number up to 101
> and
> > > retries.
> > >
> > > But how does the broker know that it needs to resend the data for
> > > partition P?  After all, if the response had not been dropped, P would
> > > not have been resent, since it didn't change.  Under the existing
> > > scheme, the follower can look at lastDirtyEpoch to find this out.   In
> > > the new scheme, I don't see how it would know.
> > >
> > > In summary, the incremental fetch sequence ID is useful inside the
> > > broker as well as outside it.
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > -Jason
> > > >
> > > >
> > > > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Colin McCabe
Hi Becket,

Thanks for the ideas.  It's interesting to think about how we could
further optimize fetches.

For now, I think sending offsets in the FetchRequest is still important,
because of how we do replication.  The leader considers followers caught
up with an offset when they fetch the following offset.  I think we
should hold off on changing how replication works in this KIP, and
consider these ideas for follow-on KIPs.

(more responses below)

On Tue, Dec 5, 2017, at 21:38, Becket Qin wrote:
> Hi Jun,
> 
> That is true, but in reality it seems rare that the fetch size is smaller
> than index interval. In the worst case, we may need to do another look
> up. In the future, when we have the mechanism to inform the clients about the
> broker configurations, the clients may want to configure correspondingly
> as well, e.g. max message size, max timestamp difference, etc.
> 
> On the other hand, we are not guaranteeing that the returned bytes in a
> partition is always bounded by the per partition fetch size, because we
> are going to return at least one message, so the per partition fetch size
> seems already a soft limit. Since we are introducing a new fetch protocol and
> this is related, it might be worth considering this option.
> 
> BTW, one reason I bring this up again was because yesterday we had a
> presentation from Uber regarding the end to end latency. And they are
> seeing this binary search behavior impacting the latency due to page
> in/out of the index file.

I'm surprised that the index file would get paged out.

We don't really need to change the wire protocol to avoid looking at the
index file, though.

For example, if a follower makes a request for offset X, and the leader
gives them back messages from X to Y, the leader can be pretty confident
that the follower's next request will be for offset Y+1.  So the leader
could cache the file index for offset Y+1 in memory.

best,
Colin

> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> 
> On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
> 
> > Hi, Jiangjie,
> >
> > Not sure returning the fetch response at the index boundary is a general
> > solution. The index interval is configurable. If one configures the index
> > interval larger than the per partition fetch size, we probably have to
> > return data not at the index boundary.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin  wrote:
> >
> > > Hi Colin,
> > >
> > > Thinking about this again. I do see the reason that we want to have a
> > epoch
> > > to avoid out of order registration of the interested set. But I am
> > > wondering if the following semantic would meet what we want better:
> > >  - Session Id: the id assigned to a single client for life long time. i.e
> > > it does not change when the interested partitions change.
> > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > request
> > > comes, which may result in the interested partition set change.
> > > This will ensure that the registered interested set will always be the
> > > latest registration. And the clients can change the interested partition
> > > set without creating another session.
> > >
> > > Also I want to bring up the way the leader respond to the FetchRequest
> > > again. I think it would be a big improvement if we just return the
> > > responses at index entry boundary or log end. There are a few benefits:
> > > 1. The leader does not need the follower to provide the offsets,
> > > 2. The fetch requests no longer need to do a binary search on the index,
> > it
> > > just need to do a linear access to the index file, which is much cache
> > > friendly.
> > >
> > > Assuming the leader can get the last returned offsets to the clients
> > > cheaply, I am still not sure why it is necessary for the followers to
> > > repeat the offsets in the incremental fetch every time. Intuitively it
> > > should only update the offsets when the leader has wrong offsets, in most
> > > cases, the incremental fetch request should just be empty. Otherwise we
> > may
> > > not be saving much when there are continuous small requests going to each
> > > partition, which could be normal for some low latency systems.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > > On Tue, Dec 5, 2017 at 2:14 PM, Colin McCabe  wrote:
> > >
> > > > On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > > > > Hi Colin
> > > > >
> > > > > Addressing the topic of how to manage slots from the other thread.
> > > > > With tcp connections all this comes for free essentially.
> > > >
> > > > Hi Jan,
> > > >
> > > > I don't think that it's accurate to say that cache management "comes
> > for
> > > > free" by coupling the incremental fetch session with the TCP session.
> > > > When a new TCP session is started by a fetch request, you still have to
> > > > decide whether to grant that request an incremental fetch session or

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Colin McCabe
On Wed, Dec 6, 2017, at 10:49, Jason Gustafson wrote:
> >
> > There is already a way in the existing proposal for clients to change
> > the set of partitions they are interested in, while re-using their same
> > session and session ID.  We don't need to change how sequence ID works
> > in order to do this.
> 
> 
> There is some inconsistency in the KIP about this, so I wasn't sure. In
> particular, you say this: " The FetchSession maintains information about
> a specific set of relevant partitions.  Note that the set of relevant
> partitions is established when the FetchSession is created.  It cannot be
> changed later." Maybe that could be clarified?

That's a fair point-- I didn't fix this part of the KIP after making an
update below.  So it was definitely unclear.

best,
Colin

> 
> 
> > But how does the broker know that it needs to resend the data for
> > partition P?  After all, if the response had not been dropped, P would
> > not have been resent, since it didn't change.  Under the existing
> > scheme, the follower can look at lastDirtyEpoch to find this out.   In
> > the new scheme, I don't see how it would know.
> 
> 
> If a fetch response is lost, the epoch would be bumped by the client and
> a
> full fetch would be sent. Doesn't that solve the issue?
> 
> -Jason
> 
> On Wed, Dec 6, 2017 at 10:40 AM, Colin McCabe  wrote:
> 
> > On Wed, Dec 6, 2017, at 09:32, Jason Gustafson wrote:
> > > >
> > > > Thinking about this again. I do see the reason that we want to have a
> > epoch
> > > > to avoid out of order registration of the interested set. But I am
> > > > wondering if the following semantic would meet what we want better:
> > > >  - Session Id: the id assigned to a single client for life long time.
> > i.e
> > > > it does not change when the interested partitions change.
> > > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > request
> > > > comes, which may result in the interested partition set change.
> > > > This will ensure that the registered interested set will always be the
> > > > latest registration. And the clients can change the interested
> > partition
> > > > set without creating another session.
> > >
> > >
> > > I agree this is a bit more intuitive than the sequence number and the
> > > ability to reuse the session is beneficial since it causes less waste of
> > > the cache for session timeouts.
> >
> > Hi Jason,
> >
> > There is already a way in the existing proposal for clients to change
> > the set of partitions they are interested in, while re-using their same
> > session and session ID.  We don't need to change how sequence ID works
> > in order to do this.
> >
> > > controlled by the client and a bump of the epoch indicates a full fetch
> > > request. The client should also bump the epoch if it fails to receive a
> > > fetch response. This ensures that the broker cannot receive an old
> > > request after the client has reconnected and sent a new one which
> > > could cause an invalid session state.
> >
> > Hmm... I don't think this quite works.
> >
> > Let's suppose a broker sends out an incremental fetch response
> > containing new data for some partition P.  The sequence number of the
> > fetch response is 100.  If the follower loses the response, under this
> > proposed scheme, the follower bumps up the sequence number up to 101 and
> > retries.
> >
> > But how does the broker know that it needs to resend the data for
> > partition P?  After all, if the response had not been dropped, P would
> > not have been resent, since it didn't change.  Under the existing
> > scheme, the follower can look at lastDirtyEpoch to find this out.   In
> > the new scheme, I don't see how it would know.
> >
> > In summary, the incremental fetch sequence ID is useful inside the
> > broker as well as outside it.
> >
> > best,
> > Colin
> >
> > >
> > > -Jason
> > >
> > >
> > > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > That is true, but in reality it seems rare that the fetch size is
> > smaller
> > > > than index interval. In the worst case, we may need to do another look
> > up.
> > > > In the future, when we have the mechanism to inform the clients about
> > the
> > > > broker configurations, the clients may want to configure
> > correspondingly as
> > > > well, e.g. max message size, max timestamp difference, etc.
> > > >
> > > > On the other hand, we are not guaranteeing that the returned bytes in a
> > > > partition is always bounded by the per partition fetch size, because
> > we are
> > > > going to return at least one message, so the per partition fetch size
> > seems
> > > > already a soft limit. Since we are introducing a new fetch protocol and
> > > > this is related, it might be worth considering this option.
> > > >
> > > > BTW, one reason I bring this up again was because yesterday we had a
> > > > presentation from Uber regarding the end to end latency. And they are
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Colin McCabe
On Wed, Dec 6, 2017, at 10:49, Jason Gustafson wrote:
> >
> > There is already a way in the existing proposal for clients to change
> > the set of partitions they are interested in, while re-using their same
> > session and session ID.  We don't need to change how sequence ID works
> > in order to do this.
> 
> 
> There is some inconsistency in the KIP about this, so I wasn't sure. In
> particular, you say this: " The FetchSession maintains information about
> a
> specific set of relevant partitions.  Note that the set of relevant
> partitions is established when the FetchSession is created.  It cannot be
> changed later." Maybe that could be clarified?
> 
> 
> > But how does the broker know that it needs to resend the data for
> > partition P?  After all, if the response had not been dropped, P would
> > not have been resent, since it didn't change.  Under the existing
> > scheme, the follower can look at lastDirtyEpoch to find this out.   In
> > the new scheme, I don't see how it would know.
> 
> 
> If a fetch response is lost, the epoch would be bumped by the client and
> a full fetch would be sent. Doesn't that solve the issue?

A full fetch request will certainly avoid any ambiguity here.  But now
we're back to sending full fetch requests whenever there are network
issues, which is worse than the current proposal.  And has the
congestion collapse problem I talked about earlier when the network is
wobbling.  We also don't get the other debuggability benefits of being
able to uniquely associate each update in the incremental fetch session
with a sequence number.

best,
Colin

> 
> -Jason
> 
> On Wed, Dec 6, 2017 at 10:40 AM, Colin McCabe  wrote:
> 
> > On Wed, Dec 6, 2017, at 09:32, Jason Gustafson wrote:
> > > >
> > > > Thinking about this again. I do see the reason that we want to have a
> > epoch
> > > > to avoid out of order registration of the interested set. But I am
> > > > wondering if the following semantic would meet what we want better:
> > > >  - Session Id: the id assigned to a single client for life long time.
> > i.e
> > > > it does not change when the interested partitions change.
> > > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > request
> > > > comes, which may result in the interested partition set change.
> > > > This will ensure that the registered interested set will always be the
> > > > latest registration. And the clients can change the interested
> > partition
> > > > set without creating another session.
> > >
> > >
> > > I agree this is a bit more intuitive than the sequence number and the
> > > ability to reuse the session is beneficial since it causes less waste of
> > > the cache for session timeouts.
> >
> > Hi Jason,
> >
> > There is already a way in the existing proposal for clients to change
> > the set of partitions they are interested in, while re-using their same
> > session and session ID.  We don't need to change how sequence ID works
> > in order to do this.
> >
> > > controlled by the client and a bump of the epoch indicates a full fetch
> > > request. The client should also bump the epoch if it fails to receive a
> > > fetch response. This ensures that the broker cannot receive an old
> > > request after the client has reconnected and sent a new one which
> > > could cause an invalid session state.
> >
> > Hmm... I don't think this quite works.
> >
> > Let's suppose a broker sends out an incremental fetch response
> > containing new data for some partition P.  The sequence number of the
> > fetch response is 100.  If the follower loses the response, under this
> > proposed scheme, the follower bumps up the sequence number up to 101 and
> > retries.
> >
> > But how does the broker know that it needs to resend the data for
> > partition P?  After all, if the response had not been dropped, P would
> > not have been resent, since it didn't change.  Under the existing
> > scheme, the follower can look at lastDirtyEpoch to find this out.   In
> > the new scheme, I don't see how it would know.
> >
> > In summary, the incremental fetch sequence ID is useful inside the
> > broker as well as outside it.
> >
> > best,
> > Colin
> >
> > >
> > > -Jason
> > >
> > >
> > > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > That is true, but in reality it seems rare that the fetch size is
> > smaller
> > > > than index interval. In the worst case, we may need to do another look
> > up.
> > > > In the future, when we have the mechanism to inform the clients about
> > the
> > > > broker configurations, the clients may want to configure
> > correspondingly as
> > > > well, e.g. max message size, max timestamp difference, etc.
> > > >
> > > > On the other hand, we are not guaranteeing that the returned bytes in a
> > > > partition is always bounded by the per partition fetch size, because
> > we are
> > > > going to return at least one message, so the per partition fetch 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Colin McCabe
On Wed, Dec 6, 2017, at 10:02, Jason Gustafson wrote:
> Hey Colin,
> 
> How about this approach?  We have a tunable for minimum eviction time
> > (default 2 minutes).  We cannot evict a client before this timeout has
> > expired.  We also have a tunable for total number of cache slots.  We
> > never cache more than this number of incremental fetch sessions.
> 
> 
> I think that sounds reasonable. For the sake of discussion, one thing I
> was
> thinking about is whether we should decouple sessionId allocation from
> cache slot usage. By that I mean that every fetcher gets a sessionId, but
> it may or may not occupy a cache slot. The fetch response itself could
> indicate whether the next fetch can be incremental or not. The benefit is
> that you can continue to track the frequency of the fetches for a session
> even after its state has been evicted from the cache. That may lead to
> better cache utilization since we can avoid creating new sessions for
> slow fetchers.

Hmm, that's a really interesting idea.  I agree it would be nice to
leave the door open to more optimizations in the future.  Having an ID
for those fetchers would help with that.  Yeah, let me see if I can
include that in the next update...

I will also try to get rid of some of the "magic numbers" in the current
proposals like using ID 0 to indicate the lack of a session, or using
epoch 0 to indicate reinitializing the session.  As an old C programmer
I have an irrational fondness for magic numbers.  But it's probably
better for maintainability and extensibility just to have flags for this
stuff.

best,


> 
> -Jason
> 
> 
> 
> 
> 
> 
> On Wed, Dec 6, 2017 at 9:32 AM, Jason Gustafson 
> wrote:
> 
> > Thinking about this again. I do see the reason that we want to have a epoch
> >> to avoid out of order registration of the interested set. But I am
> >> wondering if the following semantic would meet what we want better:
> >>  - Session Id: the id assigned to a single client for life long time. i.e
> >> it does not change when the interested partitions change.
> >>  - Epoch: the interested set epoch. Only updated when a full fetch request
> >> comes, which may result in the interested partition set change.
> >> This will ensure that the registered interested set will always be the
> >> latest registration. And the clients can change the interested partition
> >> set without creating another session.
> >
> >
> > I agree this is a bit more intuitive than the sequence number and the
> > ability to reuse the session is beneficial since it causes less waste of
> > the cache for session timeouts. I would say that the epoch should be
> > controlled by the client and a bump of the epoch indicates a full fetch
> > request. The client should also bump the epoch if it fails to receive a
> > fetch response. This ensures that the broker cannot receive an old request
> > after the client has reconnected and sent a new one which could cause an
> > invalid session state.
> >
> > -Jason
> >
> >
> > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> >
> >> Hi Jun,
> >>
> >> That is true, but in reality it seems rare that the fetch size is smaller
> >> than index interval. In the worst case, we may need to do another look up.
> >> In the future, when we have the mechanism to inform the clients about the
> >> broker configurations, the clients may want to configure correspondingly
> >> as
> >> well, e.g. max message size, max timestamp difference, etc.
> >>
> >> On the other hand, we are not guaranteeing that the returned bytes in a
> >> partition is always bounded by the per partition fetch size, because we
> >> are
> >> going to return at least one message, so the per partition fetch size
> >> seems
> >> already a soft limit. Since we are introducing a new fetch protocol and
> >> this is related, it might be worth considering this option.
> >>
> >> BTW, one reason I bring this up again was because yesterday we had a
> >> presentation from Uber regarding the end to end latency. And they are
> >> seeing this binary search behavior impacting the latency due to page
> >> in/out
> >> of the index file.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >>
> >>
> >> On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
> >>
> >> > Hi, Jiangjie,
> >> >
> >> > Not sure returning the fetch response at the index boundary is a general
> >> > solution. The index interval is configurable. If one configures the
> >> index
> >> > interval larger than the per partition fetch size, we probably have to
> >> > return data not at the index boundary.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin 
> >> wrote:
> >> >
> >> > > Hi Colin,
> >> > >
> >> > > Thinking about this again. I do see the reason that we want to have a
> >> > epoch
> >> > > to avoid out of order registration of the interested set. But I am
> >> > > wondering if the following 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Jason Gustafson
>
> There is already a way in the existing proposal for clients to change
> the set of partitions they are interested in, while re-using their same
> session and session ID.  We don't need to change how sequence ID works
> in order to do this.


There is some inconsistency in the KIP about this, so I wasn't sure. In
particular, you say this: " The FetchSession maintains information about a
specific set of relevant partitions.  Note that the set of relevant
partitions is established when the FetchSession is created.  It cannot be
changed later." Maybe that could be clarified?


> But how does the broker know that it needs to resend the data for
> partition P?  After all, if the response had not been dropped, P would
> not have been resent, since it didn't change.  Under the existing
> scheme, the follower can look at lastDirtyEpoch to find this out.   In
> the new scheme, I don't see how it would know.


If a fetch response is lost, the epoch would be bumped by the client and a
full fetch would be sent. Doesn't that solve the issue?

-Jason

On Wed, Dec 6, 2017 at 10:40 AM, Colin McCabe  wrote:

> On Wed, Dec 6, 2017, at 09:32, Jason Gustafson wrote:
> > >
> > > Thinking about this again. I do see the reason that we want to have a
> epoch
> > > to avoid out of order registration of the interested set. But I am
> > > wondering if the following semantic would meet what we want better:
> > >  - Session Id: the id assigned to a single client for life long time.
> i.e
> > > it does not change when the interested partitions change.
> > >  - Epoch: the interested set epoch. Only updated when a full fetch
> request
> > > comes, which may result in the interested partition set change.
> > > This will ensure that the registered interested set will always be the
> > > latest registration. And the clients can change the interested
> partition
> > > set without creating another session.
> >
> >
> > I agree this is a bit more intuitive than the sequence number and the
> > ability to reuse the session is beneficial since it causes less waste of
> > the cache for session timeouts.
>
> Hi Jason,
>
> There is already a way in the existing proposal for clients to change
> the set of partitions they are interested in, while re-using their same
> session and session ID.  We don't need to change how sequence ID works
> in order to do this.
>
> > controlled by the client and a bump of the epoch indicates a full fetch
> > request. The client should also bump the epoch if it fails to receive a
> > fetch response. This ensures that the broker cannot receive an old
> > request after the client has reconnected and sent a new one which
> > could cause an invalid session state.
>
> Hmm... I don't think this quite works.
>
> Let's suppose a broker sends out an incremental fetch response
> containing new data for some partition P.  The sequence number of the
> fetch response is 100.  If the follower loses the response, under this
> proposed scheme, the follower bumps up the sequence number up to 101 and
> retries.
>
> But how does the broker know that it needs to resend the data for
> partition P?  After all, if the response had not been dropped, P would
> not have been resent, since it didn't change.  Under the existing
> scheme, the follower can look at lastDirtyEpoch to find this out.   In
> the new scheme, I don't see how it would know.
>
> In summary, the incremental fetch sequence ID is useful inside the
> broker as well as outside it.
>
> best,
> Colin
>
> >
> > -Jason
> >
> >
> > On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> >
> > > Hi Jun,
> > >
> > > That is true, but in reality it seems rare that the fetch size is
> smaller
> > > than index interval. In the worst case, we may need to do another look
> up.
> > > In the future, when we have the mechanism to inform the clients about
> the
> > > broker configurations, the clients may want to configure
> correspondingly as
> > > well, e.g. max message size, max timestamp difference, etc.
> > >
> > > On the other hand, we are not guaranteeing that the returned bytes in a
> > > partition is always bounded by the per partition fetch size, because
> we are
> > > going to return at least one message, so the per partition fetch size
> seems
> > > already a soft limit. Since we are introducing a new fetch protocol and
> > > this is related, it might be worth considering this option.
> > >
> > > BTW, one reason I bring this up again was because yesterday we had a
> > > presentation from Uber regarding the end to end latency. And they are
> > > seeing this binary search behavior impacting the latency due to page
> in/out
> > > of the index file.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
> > >
> > > > Hi, Jiangjie,
> > > >
> > > > Not sure returning the fetch response at the index boundary is a
> general
> > > > solution. The index interval 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Colin McCabe
On Wed, Dec 6, 2017, at 09:32, Jason Gustafson wrote:
> >
> > Thinking about this again. I do see the reason that we want to have a epoch
> > to avoid out of order registration of the interested set. But I am
> > wondering if the following semantic would meet what we want better:
> >  - Session Id: the id assigned to a single client for life long time. i.e
> > it does not change when the interested partitions change.
> >  - Epoch: the interested set epoch. Only updated when a full fetch request
> > comes, which may result in the interested partition set change.
> > This will ensure that the registered interested set will always be the
> > latest registration. And the clients can change the interested partition
> > set without creating another session.
> 
> 
> I agree this is a bit more intuitive than the sequence number and the
> ability to reuse the session is beneficial since it causes less waste of
> the cache for session timeouts.

Hi Jason,

There is already a way in the existing proposal for clients to change
the set of partitions they are interested in, while re-using their same
session and session ID.  We don't need to change how sequence ID works
in order to do this.

> controlled by the client and a bump of the epoch indicates a full fetch
> request. The client should also bump the epoch if it fails to receive a
> fetch response. This ensures that the broker cannot receive an old
> request after the client has reconnected and sent a new one which
> could cause an invalid session state.

Hmm... I don't think this quite works.  

Let's suppose a broker sends out an incremental fetch response
containing new data for some partition P.  The sequence number of the
fetch response is 100.  If the follower loses the response, under this
proposed scheme, the follower bumps up the sequence number up to 101 and
retries.

But how does the broker know that it needs to resend the data for
partition P?  After all, if the response had not been dropped, P would
not have been resent, since it didn't change.  Under the existing
scheme, the follower can look at lastDirtyEpoch to find this out.   In
the new scheme, I don't see how it would know.  

In summary, the incremental fetch sequence ID is useful inside the
broker as well as outside it.

best,
Colin

> 
> -Jason
> 
> 
> On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
> 
> > Hi Jun,
> >
> > That is true, but in reality it seems rare that the fetch size is smaller
> > than index interval. In the worst case, we may need to do another look up.
> > In the future, when we have the mechanism to inform the clients about the
> > broker configurations, the clients may want to configure correspondingly as
> > well, e.g. max message size, max timestamp difference, etc.
> >
> > On the other hand, we are not guaranteeing that the returned bytes in a
> > partition is always bounded by the per partition fetch size, because we are
> > going to return at least one message, so the per partition fetch size seems
> > already a soft limit. Since we are introducing a new fetch protocol and
> > this is related, it might be worth considering this option.
> >
> > BTW, one reason I bring this up again was because yesterday we had a
> > presentation from Uber regarding the end to end latency. And they are
> > seeing this binary search behavior impacting the latency due to page in/out
> > of the index file.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
> >
> > > Hi, Jiangjie,
> > >
> > > Not sure returning the fetch response at the index boundary is a general
> > > solution. The index interval is configurable. If one configures the index
> > > interval larger than the per partition fetch size, we probably have to
> > > return data not at the index boundary.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin  wrote:
> > >
> > > > Hi Colin,
> > > >
> > > > Thinking about this again. I do see the reason that we want to have a
> > > epoch
> > > > to avoid out of order registration of the interested set. But I am
> > > > wondering if the following semantic would meet what we want better:
> > > >  - Session Id: the id assigned to a single client for life long time.
> > i.e
> > > > it does not change when the interested partitions change.
> > > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > > request
> > > > comes, which may result in the interested partition set change.
> > > > This will ensure that the registered interested set will always be the
> > > > latest registration. And the clients can change the interested
> > partition
> > > > set without creating another session.
> > > >
> > > > Also I want to bring up the way the leader respond to the FetchRequest
> > > > again. I think it would be a big improvement if we just return the
> > > > responses at index entry boundary or log end. There are a 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Jason Gustafson
Hey Colin,

How about this approach?  We have a tunable for minimum eviction time
> (default 2 minutes).  We cannot evict a client before this timeout has
> expired.  We also have a tunable for total number of cache slots.  We
> never cache more than this number of incremental fetch sessions.


I think that sounds reasonable. For the sake of discussion, one thing I was
thinking about is whether we should decouple sessionId allocation from
cache slot usage. By that I mean that every fetcher gets a sessionId, but
it may or may not occupy a cache slot. The fetch response itself could
indicate whether the next fetch can be incremental or not. The benefit is
that you can continue to track the frequency of the fetches for a session
even after its state has been evicted from the cache. That may lead to
better cache utilization since we can avoid creating new sessions for slow
fetchers.

-Jason






On Wed, Dec 6, 2017 at 9:32 AM, Jason Gustafson  wrote:

> Thinking about this again. I do see the reason that we want to have a epoch
>> to avoid out of order registration of the interested set. But I am
>> wondering if the following semantic would meet what we want better:
>>  - Session Id: the id assigned to a single client for life long time. i.e
>> it does not change when the interested partitions change.
>>  - Epoch: the interested set epoch. Only updated when a full fetch request
>> comes, which may result in the interested partition set change.
>> This will ensure that the registered interested set will always be the
>> latest registration. And the clients can change the interested partition
>> set without creating another session.
>
>
> I agree this is a bit more intuitive than the sequence number and the
> ability to reuse the session is beneficial since it causes less waste of
> the cache for session timeouts. I would say that the epoch should be
> controlled by the client and a bump of the epoch indicates a full fetch
> request. The client should also bump the epoch if it fails to receive a
> fetch response. This ensures that the broker cannot receive an old request
> after the client has reconnected and sent a new one which could cause an
> invalid session state.
>
> -Jason
>
>
> On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:
>
>> Hi Jun,
>>
>> That is true, but in reality it seems rare that the fetch size is smaller
>> than index interval. In the worst case, we may need to do another look up.
>> In the future, when we have the mechanism to inform the clients about the
>> broker configurations, the clients may want to configure correspondingly
>> as
>> well, e.g. max message size, max timestamp difference, etc.
>>
>> On the other hand, we are not guaranteeing that the returned bytes in a
>> partition is always bounded by the per partition fetch size, because we
>> are
>> going to return at least one message, so the per partition fetch size
>> seems
>> already a soft limit. Since we are introducing a new fetch protocol and
>> this is related, it might be worth considering this option.
>>
>> BTW, one reason I bring this up again was because yesterday we had a
>> presentation from Uber regarding the end to end latency. And they are
>> seeing this binary search behavior impacting the latency due to page
>> in/out
>> of the index file.
>>
>> Thanks,
>>
>> Jiangjie (Becket) Qin
>>
>>
>>
>> On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
>>
>> > Hi, Jiangjie,
>> >
>> > Not sure returning the fetch response at the index boundary is a general
>> > solution. The index interval is configurable. If one configures the
>> index
>> > interval larger than the per partition fetch size, we probably have to
>> > return data not at the index boundary.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin 
>> wrote:
>> >
>> > > Hi Colin,
>> > >
>> > > Thinking about this again. I do see the reason that we want to have a
>> > epoch
>> > > to avoid out of order registration of the interested set. But I am
>> > > wondering if the following semantic would meet what we want better:
>> > >  - Session Id: the id assigned to a single client for life long time.
>> i.e
>> > > it does not change when the interested partitions change.
>> > >  - Epoch: the interested set epoch. Only updated when a full fetch
>> > request
>> > > comes, which may result in the interested partition set change.
>> > > This will ensure that the registered interested set will always be the
>> > > latest registration. And the clients can change the interested
>> partition
>> > > set without creating another session.
>> > >
>> > > Also I want to bring up the way the leader respond to the FetchRequest
>> > > again. I think it would be a big improvement if we just return the
>> > > responses at index entry boundary or log end. There are a few
>> benefits:
>> > > 1. The leader does not need the follower to provide the offsets,
>> > > 2. The fetch 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-06 Thread Jason Gustafson
>
> Thinking about this again. I do see the reason that we want to have a epoch
> to avoid out of order registration of the interested set. But I am
> wondering if the following semantic would meet what we want better:
>  - Session Id: the id assigned to a single client for life long time. i.e
> it does not change when the interested partitions change.
>  - Epoch: the interested set epoch. Only updated when a full fetch request
> comes, which may result in the interested partition set change.
> This will ensure that the registered interested set will always be the
> latest registration. And the clients can change the interested partition
> set without creating another session.


I agree this is a bit more intuitive than the sequence number and the
ability to reuse the session is beneficial since it causes less waste of
the cache for session timeouts. I would say that the epoch should be
controlled by the client and a bump of the epoch indicates a full fetch
request. The client should also bump the epoch if it fails to receive a
fetch response. This ensures that the broker cannot receive an old request
after the client has reconnected and sent a new one which could cause an
invalid session state.

-Jason


On Tue, Dec 5, 2017 at 9:38 PM, Becket Qin  wrote:

> Hi Jun,
>
> That is true, but in reality it seems rare that the fetch size is smaller
> than index interval. In the worst case, we may need to do another look up.
> In the future, when we have the mechanism to inform the clients about the
> broker configurations, the clients may want to configure correspondingly as
> well, e.g. max message size, max timestamp difference, etc.
>
> On the other hand, we are not guaranteeing that the returned bytes in a
> partition is always bounded by the per partition fetch size, because we are
> going to return at least one message, so the per partition fetch size seems
> already a soft limit. Since we are introducing a new fetch protocol and
> this is related, it might be worth considering this option.
>
> BTW, one reason I bring this up again was because yesterday we had a
> presentation from Uber regarding the end to end latency. And they are
> seeing this binary search behavior impacting the latency due to page in/out
> of the index file.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:
>
> > Hi, Jiangjie,
> >
> > Not sure returning the fetch response at the index boundary is a general
> > solution. The index interval is configurable. If one configures the index
> > interval larger than the per partition fetch size, we probably have to
> > return data not at the index boundary.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin  wrote:
> >
> > > Hi Colin,
> > >
> > > Thinking about this again. I do see the reason that we want to have a
> > epoch
> > > to avoid out of order registration of the interested set. But I am
> > > wondering if the following semantic would meet what we want better:
> > >  - Session Id: the id assigned to a single client for life long time.
> i.e
> > > it does not change when the interested partitions change.
> > >  - Epoch: the interested set epoch. Only updated when a full fetch
> > request
> > > comes, which may result in the interested partition set change.
> > > This will ensure that the registered interested set will always be the
> > > latest registration. And the clients can change the interested
> partition
> > > set without creating another session.
> > >
> > > Also I want to bring up the way the leader respond to the FetchRequest
> > > again. I think it would be a big improvement if we just return the
> > > responses at index entry boundary or log end. There are a few benefits:
> > > 1. The leader does not need the follower to provide the offsets,
> > > 2. The fetch requests no longer need to do a binary search on the
> index,
> > it
> > > just need to do a linear access to the index file, which is much cache
> > > friendly.
> > >
> > > Assuming the leader can get the last returned offsets to the clients
> > > cheaply, I am still not sure why it is necessary for the followers to
> > > repeat the offsets in the incremental fetch every time. Intuitively it
> > > should only update the offsets when the leader has wrong offsets, in
> most
> > > cases, the incremental fetch request should just be empty. Otherwise we
> > may
> > > not be saving much when there are continuous small requests going to
> each
> > > partition, which could be normal for some low latency systems.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > >
> > > On Tue, Dec 5, 2017 at 2:14 PM, Colin McCabe 
> wrote:
> > >
> > > > On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > > > > Hi Colin
> > > > >
> > > > > Addressing the topic of how to manage slots from the other thread.
> > > > > With tcp connections all this comes for free 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Becket Qin
Hi Jun,

That is true, but in reality it seems rare that the fetch size is smaller
than index interval. In the worst case, we may need to do another look up.
In the future, when we have the mechanism to inform the clients about the
broker configurations, the clients may want to configure correspondingly as
well, e.g. max message size, max timestamp difference, etc.

On the other hand, we are not guaranteeing that the returned bytes in a
partition is always bounded by the per partition fetch size, because we are
going to return at least one message, so the per partition fetch size seems
already a soft limit. Since we are introducing a new fetch protocol and
this is related, it might be worth considering this option.

BTW, one reason I bring this up again was because yesterday we had a
presentation from Uber regarding the end to end latency. And they are
seeing this binary search behavior impacting the latency due to page in/out
of the index file.

Thanks,

Jiangjie (Becket) Qin



On Tue, Dec 5, 2017 at 5:55 PM, Jun Rao  wrote:

> Hi, Jiangjie,
>
> Not sure returning the fetch response at the index boundary is a general
> solution. The index interval is configurable. If one configures the index
> interval larger than the per partition fetch size, we probably have to
> return data not at the index boundary.
>
> Thanks,
>
> Jun
>
> On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin  wrote:
>
> > Hi Colin,
> >
> > Thinking about this again. I do see the reason that we want to have a
> epoch
> > to avoid out of order registration of the interested set. But I am
> > wondering if the following semantic would meet what we want better:
> >  - Session Id: the id assigned to a single client for life long time. i.e
> > it does not change when the interested partitions change.
> >  - Epoch: the interested set epoch. Only updated when a full fetch
> request
> > comes, which may result in the interested partition set change.
> > This will ensure that the registered interested set will always be the
> > latest registration. And the clients can change the interested partition
> > set without creating another session.
> >
> > Also I want to bring up the way the leader respond to the FetchRequest
> > again. I think it would be a big improvement if we just return the
> > responses at index entry boundary or log end. There are a few benefits:
> > 1. The leader does not need the follower to provide the offsets,
> > 2. The fetch requests no longer need to do a binary search on the index,
> it
> > just need to do a linear access to the index file, which is much cache
> > friendly.
> >
> > Assuming the leader can get the last returned offsets to the clients
> > cheaply, I am still not sure why it is necessary for the followers to
> > repeat the offsets in the incremental fetch every time. Intuitively it
> > should only update the offsets when the leader has wrong offsets, in most
> > cases, the incremental fetch request should just be empty. Otherwise we
> may
> > not be saving much when there are continuous small requests going to each
> > partition, which could be normal for some low latency systems.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> >
> > On Tue, Dec 5, 2017 at 2:14 PM, Colin McCabe  wrote:
> >
> > > On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > > > Hi Colin
> > > >
> > > > Addressing the topic of how to manage slots from the other thread.
> > > > With tcp connections all this comes for free essentially.
> > >
> > > Hi Jan,
> > >
> > > I don't think that it's accurate to say that cache management "comes
> for
> > > free" by coupling the incremental fetch session with the TCP session.
> > > When a new TCP session is started by a fetch request, you still have to
> > > decide whether to grant that request an incremental fetch session or
> > > not.  If your answer is that you always grant the request, I would
> argue
> > > that you do not have cache management.
> > >
> > > I guess you could argue that timeouts are cache management, but I don't
> > > find that argument persuasive.  Anyone could just create a lot of TCP
> > > sessions and use a lot of resources, in that case.  So there is
> > > essentially no limit on memory use.  In any case, TCP sessions don't
> > > help us implement fetch session timeouts.
> > >
> > > > I still would argue we disable it by default and make a flag in the
> > > > broker to ask the leader to maintain the cache while replicating and
> > > also only
> > > > have it optional in consumers (default to off) so one can turn it on
> > > > where it really hurts.  MirrorMaker and audit consumers prominently.
> > >
> > > I agree with Jason's point from earlier in the thread.  Adding extra
> > > configuration knobs that aren't really necessary can harm usability.
> > > Certainly asking people to manually turn on a feature "where it really
> > > hurts" seems to fall in that category, when we could easily enable it
> > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Jun Rao
Hi, Jiangjie,

Not sure returning the fetch response at the index boundary is a general
solution. The index interval is configurable. If one configures the index
interval larger than the per partition fetch size, we probably have to
return data not at the index boundary.

Thanks,

Jun

On Tue, Dec 5, 2017 at 4:17 PM, Becket Qin  wrote:

> Hi Colin,
>
> Thinking about this again. I do see the reason that we want to have a epoch
> to avoid out of order registration of the interested set. But I am
> wondering if the following semantic would meet what we want better:
>  - Session Id: the id assigned to a single client for life long time. i.e
> it does not change when the interested partitions change.
>  - Epoch: the interested set epoch. Only updated when a full fetch request
> comes, which may result in the interested partition set change.
> This will ensure that the registered interested set will always be the
> latest registration. And the clients can change the interested partition
> set without creating another session.
>
> Also I want to bring up the way the leader respond to the FetchRequest
> again. I think it would be a big improvement if we just return the
> responses at index entry boundary or log end. There are a few benefits:
> 1. The leader does not need the follower to provide the offsets,
> 2. The fetch requests no longer need to do a binary search on the index, it
> just need to do a linear access to the index file, which is much cache
> friendly.
>
> Assuming the leader can get the last returned offsets to the clients
> cheaply, I am still not sure why it is necessary for the followers to
> repeat the offsets in the incremental fetch every time. Intuitively it
> should only update the offsets when the leader has wrong offsets, in most
> cases, the incremental fetch request should just be empty. Otherwise we may
> not be saving much when there are continuous small requests going to each
> partition, which could be normal for some low latency systems.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
>
> On Tue, Dec 5, 2017 at 2:14 PM, Colin McCabe  wrote:
>
> > On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > > Hi Colin
> > >
> > > Addressing the topic of how to manage slots from the other thread.
> > > With tcp connections all this comes for free essentially.
> >
> > Hi Jan,
> >
> > I don't think that it's accurate to say that cache management "comes for
> > free" by coupling the incremental fetch session with the TCP session.
> > When a new TCP session is started by a fetch request, you still have to
> > decide whether to grant that request an incremental fetch session or
> > not.  If your answer is that you always grant the request, I would argue
> > that you do not have cache management.
> >
> > I guess you could argue that timeouts are cache management, but I don't
> > find that argument persuasive.  Anyone could just create a lot of TCP
> > sessions and use a lot of resources, in that case.  So there is
> > essentially no limit on memory use.  In any case, TCP sessions don't
> > help us implement fetch session timeouts.
> >
> > > I still would argue we disable it by default and make a flag in the
> > > broker to ask the leader to maintain the cache while replicating and
> > also only
> > > have it optional in consumers (default to off) so one can turn it on
> > > where it really hurts.  MirrorMaker and audit consumers prominently.
> >
> > I agree with Jason's point from earlier in the thread.  Adding extra
> > configuration knobs that aren't really necessary can harm usability.
> > Certainly asking people to manually turn on a feature "where it really
> > hurts" seems to fall in that category, when we could easily enable it
> > automatically for them.
> >
> > >
> > > Otherwise I left a few remarks in-line, which should help to understand
> > > my view of the situation better
> > >
> > > Best Jan
> > >
> > >
> > > On 05.12.2017 08:06, Colin McCabe wrote:
> > > > On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> > > >>
> > > >> On 03.12.2017 21:55, Colin McCabe wrote:
> > > >>> On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> > >  Thanks for the explanation, Colin. A few more questions.
> > > 
> > > > The session epoch is not complex.  It's just a number which
> > increments
> > > > on each incremental fetch.  The session epoch is also useful for
> > > > debugging-- it allows you to match up requests and responses when
> > > > looking at log files.
> > >  Currently each request in Kafka has a correlation id to help match
> > the
> > >  requests and responses. Is epoch doing something differently?
> > > >>> Hi Becket,
> > > >>>
> > > >>> The correlation ID is used within a single TCP session, to uniquely
> > > >>> associate a request with a response.  The correlation ID is not
> > unique
> > > >>> (and has no meaning) outside the context of that single TCP
> session.
> > > >>>
> > > >>> Keep in mind, 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Becket Qin
Hi Colin,

Thinking about this again. I do see the reason that we want to have a epoch
to avoid out of order registration of the interested set. But I am
wondering if the following semantic would meet what we want better:
 - Session Id: the id assigned to a single client for life long time. i.e
it does not change when the interested partitions change.
 - Epoch: the interested set epoch. Only updated when a full fetch request
comes, which may result in the interested partition set change.
This will ensure that the registered interested set will always be the
latest registration. And the clients can change the interested partition
set without creating another session.

Also I want to bring up the way the leader respond to the FetchRequest
again. I think it would be a big improvement if we just return the
responses at index entry boundary or log end. There are a few benefits:
1. The leader does not need the follower to provide the offsets,
2. The fetch requests no longer need to do a binary search on the index, it
just need to do a linear access to the index file, which is much cache
friendly.

Assuming the leader can get the last returned offsets to the clients
cheaply, I am still not sure why it is necessary for the followers to
repeat the offsets in the incremental fetch every time. Intuitively it
should only update the offsets when the leader has wrong offsets, in most
cases, the incremental fetch request should just be empty. Otherwise we may
not be saving much when there are continuous small requests going to each
partition, which could be normal for some low latency systems.

Thanks,

Jiangjie (Becket) Qin




On Tue, Dec 5, 2017 at 2:14 PM, Colin McCabe  wrote:

> On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> > Hi Colin
> >
> > Addressing the topic of how to manage slots from the other thread.
> > With tcp connections all this comes for free essentially.
>
> Hi Jan,
>
> I don't think that it's accurate to say that cache management "comes for
> free" by coupling the incremental fetch session with the TCP session.
> When a new TCP session is started by a fetch request, you still have to
> decide whether to grant that request an incremental fetch session or
> not.  If your answer is that you always grant the request, I would argue
> that you do not have cache management.
>
> I guess you could argue that timeouts are cache management, but I don't
> find that argument persuasive.  Anyone could just create a lot of TCP
> sessions and use a lot of resources, in that case.  So there is
> essentially no limit on memory use.  In any case, TCP sessions don't
> help us implement fetch session timeouts.
>
> > I still would argue we disable it by default and make a flag in the
> > broker to ask the leader to maintain the cache while replicating and
> also only
> > have it optional in consumers (default to off) so one can turn it on
> > where it really hurts.  MirrorMaker and audit consumers prominently.
>
> I agree with Jason's point from earlier in the thread.  Adding extra
> configuration knobs that aren't really necessary can harm usability.
> Certainly asking people to manually turn on a feature "where it really
> hurts" seems to fall in that category, when we could easily enable it
> automatically for them.
>
> >
> > Otherwise I left a few remarks in-line, which should help to understand
> > my view of the situation better
> >
> > Best Jan
> >
> >
> > On 05.12.2017 08:06, Colin McCabe wrote:
> > > On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> > >>
> > >> On 03.12.2017 21:55, Colin McCabe wrote:
> > >>> On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> >  Thanks for the explanation, Colin. A few more questions.
> > 
> > > The session epoch is not complex.  It's just a number which
> increments
> > > on each incremental fetch.  The session epoch is also useful for
> > > debugging-- it allows you to match up requests and responses when
> > > looking at log files.
> >  Currently each request in Kafka has a correlation id to help match
> the
> >  requests and responses. Is epoch doing something differently?
> > >>> Hi Becket,
> > >>>
> > >>> The correlation ID is used within a single TCP session, to uniquely
> > >>> associate a request with a response.  The correlation ID is not
> unique
> > >>> (and has no meaning) outside the context of that single TCP session.
> > >>>
> > >>> Keep in mind, NetworkClient is in charge of TCP sessions, and
> generally
> > >>> tries to hide that information from the upper layers of the code.  So
> > >>> when you submit a request to NetworkClient, you don't know if that
> > >>> request creates a TCP session, or reuses an existing one.
> > > Unfortunately, this doesn't work.  Imagine the client misses an
> > > increment fetch response about a partition.  And then the
> partition is
> > > never updated after that.  The client has no way to know about the
> > > partition, since it won't be included in 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Colin McCabe
On Tue, Dec 5, 2017, at 13:13, Jan Filipiak wrote:
> Hi Colin
> 
> Addressing the topic of how to manage slots from the other thread.
> With tcp connections all this comes for free essentially.

Hi Jan,

I don't think that it's accurate to say that cache management "comes for
free" by coupling the incremental fetch session with the TCP session. 
When a new TCP session is started by a fetch request, you still have to
decide whether to grant that request an incremental fetch session or
not.  If your answer is that you always grant the request, I would argue
that you do not have cache management.

I guess you could argue that timeouts are cache management, but I don't
find that argument persuasive.  Anyone could just create a lot of TCP
sessions and use a lot of resources, in that case.  So there is
essentially no limit on memory use.  In any case, TCP sessions don't
help us implement fetch session timeouts.

> I still would argue we disable it by default and make a flag in the
> broker to ask the leader to maintain the cache while replicating and also only
> have it optional in consumers (default to off) so one can turn it on 
> where it really hurts.  MirrorMaker and audit consumers prominently.

I agree with Jason's point from earlier in the thread.  Adding extra
configuration knobs that aren't really necessary can harm usability. 
Certainly asking people to manually turn on a feature "where it really
hurts" seems to fall in that category, when we could easily enable it
automatically for them.

> 
> Otherwise I left a few remarks in-line, which should help to understand
> my view of the situation better
> 
> Best Jan
> 
> 
> On 05.12.2017 08:06, Colin McCabe wrote:
> > On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> >>
> >> On 03.12.2017 21:55, Colin McCabe wrote:
> >>> On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
>  Thanks for the explanation, Colin. A few more questions.
> 
> > The session epoch is not complex.  It's just a number which increments
> > on each incremental fetch.  The session epoch is also useful for
> > debugging-- it allows you to match up requests and responses when
> > looking at log files.
>  Currently each request in Kafka has a correlation id to help match the
>  requests and responses. Is epoch doing something differently?
> >>> Hi Becket,
> >>>
> >>> The correlation ID is used within a single TCP session, to uniquely
> >>> associate a request with a response.  The correlation ID is not unique
> >>> (and has no meaning) outside the context of that single TCP session.
> >>>
> >>> Keep in mind, NetworkClient is in charge of TCP sessions, and generally
> >>> tries to hide that information from the upper layers of the code.  So
> >>> when you submit a request to NetworkClient, you don't know if that
> >>> request creates a TCP session, or reuses an existing one.
> > Unfortunately, this doesn't work.  Imagine the client misses an
> > increment fetch response about a partition.  And then the partition is
> > never updated after that.  The client has no way to know about the
> > partition, since it won't be included in any future incremental fetch
> > responses.  And there are no offsets to compare, since the partition is
> > simply omitted from the response.
>  I am curious about in which situation would the follower miss a response
>  of a partition. If the entire FetchResponse is lost (e.g. timeout), the
>  follower would disconnect and retry. That will result in sending a full
>  FetchRequest.
> >>> Basically, you are proposing that we rely on TCP for reliable delivery
> >>> in a distributed system.  That isn't a good idea for a bunch of
> >>> different reasons.  First of all, TCP timeouts tend to be very long.  So
> >>> if the TCP session timing out is your error detection mechanism, you
> >>> have to wait minutes for messages to timeout.  Of course, we add a
> >>> timeout on top of that after which we declare the connection bad and
> >>> manually close it.  But just because the session is closed on one end
> >>> doesn't mean that the other end knows that it is closed.  So the leader
> >>> may have to wait quite a long time before TCP decides that yes,
> >>> connection X from the follower is dead and not coming back, even though
> >>> gremlins ate the FIN packet which the follower attempted to translate.
> >>> If the cache state is tied to that TCP session, we have to keep that
> >>> cache around for a much longer time than we should.
> >> Hi,
> >>
> >> I see this from a different perspective. The cache expiry time
> >> has the same semantic as idle connection time in this scenario.
> >> It is the time range we expect the client to come back an reuse
> >> its broker side state. I would argue that on close we would get an
> >> extra shot at cleaning up the session state early. As opposed to
> >> always wait for that duration for expiry to happen.
> > Hi Jan,
> >
> > The idea here is that the incremental 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Ted Yu
Thanks for responding, Colin.

bq. If we have a bunch of small fetch sessions and a bigger client comes
in, we might have to evict many small sessions to fit the bigger one.

Suppose there were N small fetch sessions and 1 big fetch session comes in.
If the plan is to use number of partitions to approximate heap consumption,
that should be good enough, IMHO.
Evicting only one of the N small fetch sessions may not release enough
memory since the total partition count would increase a lot.

Cheers

On Tue, Dec 5, 2017 at 1:44 PM, Colin McCabe  wrote:

> On Tue, Dec 5, 2017, at 11:24, Ted Yu wrote:
> > bq. We also have a tunable for total number of cache slots. We never
> > cache
> > more than this number of incremental fetch sessions.
> >
> > Is it possible to manage the cache based on heap consumption instead of
> > number of slots ?
> > It seems heap estimation can be done by counting PartitionData (along
> > with overhead for related Map structure).
>
> Hi Ted,
>
> That's an interesting idea.  I think it starts to get complicated,
> though.
>
> For example, suppose we later implement incrementally adding partitions
> to the fetch session.  When a fetch session adds more partitions, it
> uses more memory.  So should this trigger an eviction?
>
> If we have a bunch of small fetch sessions and a bigger client comes in,
> we might have to evict many small sessions to fit the bigger one.  But
> we probably do want to fit the bigger one in, since bigger requests gain
> proportionally more from being incremental.
>
> [ Small digression: In general fetch requests have some fixed cost plus
> a variable cost based on the number of partitions.  The more partitions
> you add, the more the variable cost comes to dominate.  Therefore, it is
> especially good to make big fetch requests into incremental fetch
> requests.  Small fetch requests for one or two partitions may not gain
> much, since their cost is dominated by the fixed cost anyway (message
> header, TCP overhead, IP packet overhead, etc.) ]
>
> Overall, I would still lean towards limiting the number of incremental
> fetch sessions, rather than trying to create a per-partition data memory
> limit.  I think the complexity is probably not worth it.  The memory
> limit is more of a sanity check anyway, than a fine-grained limit.  If
> we can get the really big clients using incremental fetches, and the
> followers using incremental fetches, we have captured most of the
> benefits.  I'm curious if there is a more elegant way to limit
> per-partition that I may have missed, though?
>
> best,
> Colin
>
>
> >
> > Cheers
> >
> > On Tue, Dec 5, 2017 at 11:02 AM, Colin McCabe 
> wrote:
> >
> > > On Tue, Dec 5, 2017, at 08:51, Jason Gustafson wrote:
> > > > Hi Colin,
> > > >
> > > > Thanks for the response. A couple replies:
> > > >
> > > >
> > > > > I’m a bit ambivalent about letting the client choose the session
> > > > > timeout.  What if clients choose timeouts that are too long?
> Hmm
> > > > > I do agree the timeout should be sized proportional to
> > > > > max.poll.interval.ms.
> > > >
> > > >
> > > > We have solved this in other cases by letting the broker enforce a
> > > > maximum timeout. After thinking about it a bit, it's probably
> overkill
> > > in this
> > > > case since the caching is just an optimization. Instead of stressing
> over
> > > > timeouts and such, I am actually wondering if we just need a
> reasonable
> > > > session cache eviction policy. For example, when the number of slots
> is
> > > > exceeded, perhaps you evict the session with the fewest partitions
> or the
> > > > one with the largest interval between fetches. We could give
> priority to
> > > > the replicas. Perhaps this might let us get rid of a few of the
> configs.
> > >
> > > I agree that it would be nice to get rid of the tunable for eviction
> > > time.  However, I'm concerned that if we do, we might run into cache
> > > thrashing.  For example, if we have N cache slots and N+1 clients that
> > > are all fetching continuously, we might have to evict a client on every
> > > single fetch.  It would be much better to give a cache slot to N
> clients
> > > and let the last client do full fetch requests.
> > >
> > > Perhaps we could mitigate this problem by evicting the smallest fetch
> > > session-- the one that is for the smallest number of partitions.  This
> > > would allow "big" clients that fetch many partitions (e.g. MirrorMaker)
> > > to get priority.  But then you run into the problem where someone
> > > fetches a huge number of partitions, and then goes away for a long
> time,
> > > and you never reuse that cache memory.
> > >
> > > How about this approach?  We have a tunable for minimum eviction time
> > > (default 2 minutes).  We cannot evict a client before this timeout has
> > > expired.  We also have a tunable for total number of cache slots.  We
> > > never cache more than this number of incremental fetch sessions.
> > >
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Colin McCabe
On Tue, Dec 5, 2017, at 11:24, Ted Yu wrote:
> bq. We also have a tunable for total number of cache slots. We never
> cache
> more than this number of incremental fetch sessions.
> 
> Is it possible to manage the cache based on heap consumption instead of
> number of slots ?
> It seems heap estimation can be done by counting PartitionData (along
> with overhead for related Map structure).

Hi Ted,

That's an interesting idea.  I think it starts to get complicated,
though.

For example, suppose we later implement incrementally adding partitions
to the fetch session.  When a fetch session adds more partitions, it
uses more memory.  So should this trigger an eviction?

If we have a bunch of small fetch sessions and a bigger client comes in,
we might have to evict many small sessions to fit the bigger one.  But
we probably do want to fit the bigger one in, since bigger requests gain
proportionally more from being incremental.

[ Small digression: In general fetch requests have some fixed cost plus
a variable cost based on the number of partitions.  The more partitions
you add, the more the variable cost comes to dominate.  Therefore, it is
especially good to make big fetch requests into incremental fetch
requests.  Small fetch requests for one or two partitions may not gain
much, since their cost is dominated by the fixed cost anyway (message
header, TCP overhead, IP packet overhead, etc.) ]

Overall, I would still lean towards limiting the number of incremental
fetch sessions, rather than trying to create a per-partition data memory
limit.  I think the complexity is probably not worth it.  The memory
limit is more of a sanity check anyway, than a fine-grained limit.  If
we can get the really big clients using incremental fetches, and the
followers using incremental fetches, we have captured most of the
benefits.  I'm curious if there is a more elegant way to limit
per-partition that I may have missed, though?

best,
Colin


> 
> Cheers
> 
> On Tue, Dec 5, 2017 at 11:02 AM, Colin McCabe  wrote:
> 
> > On Tue, Dec 5, 2017, at 08:51, Jason Gustafson wrote:
> > > Hi Colin,
> > >
> > > Thanks for the response. A couple replies:
> > >
> > >
> > > > I’m a bit ambivalent about letting the client choose the session
> > > > timeout.  What if clients choose timeouts that are too long? Hmm
> > > > I do agree the timeout should be sized proportional to
> > > > max.poll.interval.ms.
> > >
> > >
> > > We have solved this in other cases by letting the broker enforce a
> > > maximum timeout. After thinking about it a bit, it's probably overkill
> > in this
> > > case since the caching is just an optimization. Instead of stressing over
> > > timeouts and such, I am actually wondering if we just need a reasonable
> > > session cache eviction policy. For example, when the number of slots is
> > > exceeded, perhaps you evict the session with the fewest partitions or the
> > > one with the largest interval between fetches. We could give priority to
> > > the replicas. Perhaps this might let us get rid of a few of the configs.
> >
> > I agree that it would be nice to get rid of the tunable for eviction
> > time.  However, I'm concerned that if we do, we might run into cache
> > thrashing.  For example, if we have N cache slots and N+1 clients that
> > are all fetching continuously, we might have to evict a client on every
> > single fetch.  It would be much better to give a cache slot to N clients
> > and let the last client do full fetch requests.
> >
> > Perhaps we could mitigate this problem by evicting the smallest fetch
> > session-- the one that is for the smallest number of partitions.  This
> > would allow "big" clients that fetch many partitions (e.g. MirrorMaker)
> > to get priority.  But then you run into the problem where someone
> > fetches a huge number of partitions, and then goes away for a long time,
> > and you never reuse that cache memory.
> >
> > How about this approach?  We have a tunable for minimum eviction time
> > (default 2 minutes).  We cannot evict a client before this timeout has
> > expired.  We also have a tunable for total number of cache slots.  We
> > never cache more than this number of incremental fetch sessions.
> >
> > Sessions become eligible for eviction after 2 minutes, whether or not
> > the session is active.
> > Fetch Request A will evict Fetch Request B if and only if:
> > 1. A has been active in the last 2 minutes and B has not, OR
> > 2. A was made by a follower and B was made by a consumer, OR
> > 3. A has more partitions than B, OR
> > 4. A is newer than B
> >
> > Then, in a setup where consumers are fetching different numbers of
> > partitions, we will eventually converge on giving incremental fetch
> > sessions to the big consumers, and not to the small consumers.  In a
> > setup where consumers are all of equal size but the cache is too small
> > for all of them, we still thrash, but slowly.  Nobody can be evicted
> > before their 2 minutes are up.  So in 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Jan Filipiak

Hi Colin

Addressing the topic of how to manage slots from the other thread.
With tcp connections all this comes for free essentially.
I still would argue we disable it by default and make a flag in the broker
to ask the leader to maintain the cache while replicating and also only
have it optional in consumers (default to off) so one can turn it on 
where it really hurts.

MirrorMaker and audit consumers prominently.

Otherwise I left a few remarks in-line, which should help to understand
my view of the situation better

Best Jan


On 05.12.2017 08:06, Colin McCabe wrote:

On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:


On 03.12.2017 21:55, Colin McCabe wrote:

On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:

Thanks for the explanation, Colin. A few more questions.


The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correlation ID is not unique
(and has no meaning) outside the context of that single TCP session.

Keep in mind, NetworkClient is in charge of TCP sessions, and generally
tries to hide that information from the upper layers of the code.  So
when you submit a request to NetworkClient, you don't know if that
request creates a TCP session, or reuses an existing one.

Unfortunately, this doesn't work.  Imagine the client misses an
increment fetch response about a partition.  And then the partition is
never updated after that.  The client has no way to know about the
partition, since it won't be included in any future incremental fetch
responses.  And there are no offsets to compare, since the partition is
simply omitted from the response.

I am curious about in which situation would the follower miss a response
of a partition. If the entire FetchResponse is lost (e.g. timeout), the
follower would disconnect and retry. That will result in sending a full
FetchRequest.

Basically, you are proposing that we rely on TCP for reliable delivery
in a distributed system.  That isn't a good idea for a bunch of
different reasons.  First of all, TCP timeouts tend to be very long.  So
if the TCP session timing out is your error detection mechanism, you
have to wait minutes for messages to timeout.  Of course, we add a
timeout on top of that after which we declare the connection bad and
manually close it.  But just because the session is closed on one end
doesn't mean that the other end knows that it is closed.  So the leader
may have to wait quite a long time before TCP decides that yes,
connection X from the follower is dead and not coming back, even though
gremlins ate the FIN packet which the follower attempted to translate.
If the cache state is tied to that TCP session, we have to keep that
cache around for a much longer time than we should.

Hi,

I see this from a different perspective. The cache expiry time
has the same semantic as idle connection time in this scenario.
It is the time range we expect the client to come back an reuse
its broker side state. I would argue that on close we would get an
extra shot at cleaning up the session state early. As opposed to
always wait for that duration for expiry to happen.

Hi Jan,

The idea here is that the incremental fetch cache expiry time can be
much shorter than the TCP session timeout.  In general the TCP session
timeout is common to all TCP connections, and very long.  To make these
numbers a little more concrete, the TCP session timeout is often
configured to be 2 hours on Linux.  (See
https://www.cyberciti.biz/tips/linux-increasing-or-decreasing-tcp-sockets-timeouts.html
)  The timeout I was proposing for incremental fetch sessions was one or
two minutes at most.

Currently this is taken care of by
connections.max.idle.ms on the broker and defaults to something of few 
minutes.

Also something we could let the client change if we really wanted to.
So there is no need to worry about coupling our implementation to some 
timeouts

given by the OS, with TCP one always has full control over the worst times +
one gets the extra shot cleaning up early when the close comes through. 
Which

is the majority of the cases.




Secondly, from a software engineering perspective, it's not a good idea
to try to tightly tie together TCP and our code.  We would have to
rework how we interact with NetworkClient so that we are aware of things
like TCP sessions closing or opening.  We would have to be careful
preserve the ordering of incoming messages when doing things like
putting incoming requests on to a queue to be processed by multiple
threads.  It's just a lot of complexity to add, and there's no upside.

I see the point here. And I had a 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Ted Yu
bq. We also have a tunable for total number of cache slots. We never cache
more than this number of incremental fetch sessions.

Is it possible to manage the cache based on heap consumption instead of
number of slots ?
It seems heap estimation can be done by counting PartitionData (along with
overhead for related Map structure).

Cheers

On Tue, Dec 5, 2017 at 11:02 AM, Colin McCabe  wrote:

> On Tue, Dec 5, 2017, at 08:51, Jason Gustafson wrote:
> > Hi Colin,
> >
> > Thanks for the response. A couple replies:
> >
> >
> > > I’m a bit ambivalent about letting the client choose the session
> > > timeout.  What if clients choose timeouts that are too long? Hmm
> > > I do agree the timeout should be sized proportional to
> > > max.poll.interval.ms.
> >
> >
> > We have solved this in other cases by letting the broker enforce a
> > maximum timeout. After thinking about it a bit, it's probably overkill
> in this
> > case since the caching is just an optimization. Instead of stressing over
> > timeouts and such, I am actually wondering if we just need a reasonable
> > session cache eviction policy. For example, when the number of slots is
> > exceeded, perhaps you evict the session with the fewest partitions or the
> > one with the largest interval between fetches. We could give priority to
> > the replicas. Perhaps this might let us get rid of a few of the configs.
>
> I agree that it would be nice to get rid of the tunable for eviction
> time.  However, I'm concerned that if we do, we might run into cache
> thrashing.  For example, if we have N cache slots and N+1 clients that
> are all fetching continuously, we might have to evict a client on every
> single fetch.  It would be much better to give a cache slot to N clients
> and let the last client do full fetch requests.
>
> Perhaps we could mitigate this problem by evicting the smallest fetch
> session-- the one that is for the smallest number of partitions.  This
> would allow "big" clients that fetch many partitions (e.g. MirrorMaker)
> to get priority.  But then you run into the problem where someone
> fetches a huge number of partitions, and then goes away for a long time,
> and you never reuse that cache memory.
>
> How about this approach?  We have a tunable for minimum eviction time
> (default 2 minutes).  We cannot evict a client before this timeout has
> expired.  We also have a tunable for total number of cache slots.  We
> never cache more than this number of incremental fetch sessions.
>
> Sessions become eligible for eviction after 2 minutes, whether or not
> the session is active.
> Fetch Request A will evict Fetch Request B if and only if:
> 1. A has been active in the last 2 minutes and B has not, OR
> 2. A was made by a follower and B was made by a consumer, OR
> 3. A has more partitions than B, OR
> 4. A is newer than B
>
> Then, in a setup where consumers are fetching different numbers of
> partitions, we will eventually converge on giving incremental fetch
> sessions to the big consumers, and not to the small consumers.  In a
> setup where consumers are all of equal size but the cache is too small
> for all of them, we still thrash, but slowly.  Nobody can be evicted
> before their 2 minutes are up.  So in general, the overhead of the extra
> full requests is still low.  If someone makes a big request and then
> shuts down, they get cleaned up after 2 minutes, because of condition
> #1.  And there are only two tunables needed: cache size and eviction
> time.
>
> >
> > The main reason is if there is a bug in the incremental fetch feature.
> > >
> >
> > This was in response to my question about removing the consumer config.
> > And sure, any new feature may have bugs, but that's what we have testing
> for
> > ;). Users can always fall back to a previous version if there are any
> > major problems. As you know, it's tough removing configs once they are
> there,
> > so I think we should try to add them only if they make sense in the long
> > term.
>
> That's a fair point.  I guess if we do need to disable incremental
> fetches in production because of a bug, we can modify the broker
> configuration to do so (by setting 0 cache slots).
>
> best,
> Colin
>
> >
> > Thanks,
> > Jason
> >
> > On Mon, Dec 4, 2017 at 11:06 PM, Colin McCabe 
> wrote:
> >
> > > On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> > > >
> > > >
> > > > On 03.12.2017 21:55, Colin McCabe wrote:
> > > > > On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> > > > >> Thanks for the explanation, Colin. A few more questions.
> > > > >>
> > > > >>> The session epoch is not complex.  It's just a number which
> > > increments
> > > > >>> on each incremental fetch.  The session epoch is also useful for
> > > > >>> debugging-- it allows you to match up requests and responses when
> > > > >>> looking at log files.
> > > > >> Currently each request in Kafka has a correlation id to help
> match the
> > > > >> requests and responses. Is epoch 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Colin McCabe
On Tue, Dec 5, 2017, at 08:51, Jason Gustafson wrote:
> Hi Colin,
> 
> Thanks for the response. A couple replies:
> 
> 
> > I’m a bit ambivalent about letting the client choose the session
> > timeout.  What if clients choose timeouts that are too long? Hmm
> > I do agree the timeout should be sized proportional to
> > max.poll.interval.ms.
> 
> 
> We have solved this in other cases by letting the broker enforce a
> maximum timeout. After thinking about it a bit, it's probably overkill in this
> case since the caching is just an optimization. Instead of stressing over
> timeouts and such, I am actually wondering if we just need a reasonable
> session cache eviction policy. For example, when the number of slots is
> exceeded, perhaps you evict the session with the fewest partitions or the
> one with the largest interval between fetches. We could give priority to
> the replicas. Perhaps this might let us get rid of a few of the configs.

I agree that it would be nice to get rid of the tunable for eviction
time.  However, I'm concerned that if we do, we might run into cache
thrashing.  For example, if we have N cache slots and N+1 clients that
are all fetching continuously, we might have to evict a client on every
single fetch.  It would be much better to give a cache slot to N clients
and let the last client do full fetch requests.

Perhaps we could mitigate this problem by evicting the smallest fetch
session-- the one that is for the smallest number of partitions.  This
would allow "big" clients that fetch many partitions (e.g. MirrorMaker)
to get priority.  But then you run into the problem where someone
fetches a huge number of partitions, and then goes away for a long time,
and you never reuse that cache memory.

How about this approach?  We have a tunable for minimum eviction time
(default 2 minutes).  We cannot evict a client before this timeout has
expired.  We also have a tunable for total number of cache slots.  We
never cache more than this number of incremental fetch sessions.

Sessions become eligible for eviction after 2 minutes, whether or not
the session is active.
Fetch Request A will evict Fetch Request B if and only if:
1. A has been active in the last 2 minutes and B has not, OR
2. A was made by a follower and B was made by a consumer, OR
3. A has more partitions than B, OR
4. A is newer than B

Then, in a setup where consumers are fetching different numbers of
partitions, we will eventually converge on giving incremental fetch
sessions to the big consumers, and not to the small consumers.  In a
setup where consumers are all of equal size but the cache is too small
for all of them, we still thrash, but slowly.  Nobody can be evicted
before their 2 minutes are up.  So in general, the overhead of the extra
full requests is still low.  If someone makes a big request and then
shuts down, they get cleaned up after 2 minutes, because of condition
#1.  And there are only two tunables needed: cache size and eviction
time.

> 
> The main reason is if there is a bug in the incremental fetch feature.
> >
> 
> This was in response to my question about removing the consumer config.
> And sure, any new feature may have bugs, but that's what we have testing for
> ;). Users can always fall back to a previous version if there are any
> major problems. As you know, it's tough removing configs once they are there,
> so I think we should try to add them only if they make sense in the long
> term.

That's a fair point.  I guess if we do need to disable incremental
fetches in production because of a bug, we can modify the broker
configuration to do so (by setting 0 cache slots).

best,
Colin

> 
> Thanks,
> Jason
> 
> On Mon, Dec 4, 2017 at 11:06 PM, Colin McCabe  wrote:
> 
> > On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> > >
> > >
> > > On 03.12.2017 21:55, Colin McCabe wrote:
> > > > On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> > > >> Thanks for the explanation, Colin. A few more questions.
> > > >>
> > > >>> The session epoch is not complex.  It's just a number which
> > increments
> > > >>> on each incremental fetch.  The session epoch is also useful for
> > > >>> debugging-- it allows you to match up requests and responses when
> > > >>> looking at log files.
> > > >> Currently each request in Kafka has a correlation id to help match the
> > > >> requests and responses. Is epoch doing something differently?
> > > > Hi Becket,
> > > >
> > > > The correlation ID is used within a single TCP session, to uniquely
> > > > associate a request with a response.  The correlation ID is not unique
> > > > (and has no meaning) outside the context of that single TCP session.
> > > >
> > > > Keep in mind, NetworkClient is in charge of TCP sessions, and generally
> > > > tries to hide that information from the upper layers of the code.  So
> > > > when you submit a request to NetworkClient, you don't know if that
> > > > request creates a TCP session, or reuses an existing one.
> > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Colin McCabe
On Sun, Dec 3, 2017, at 16:28, Becket Qin wrote:
> >The correlation ID is used within a single TCP session, to uniquely
> >associate a request with a response.  The correlation ID is not unique
> >(and has no meaning) outside the context of that single TCP session.
> >
> >Keep in mind, NetworkClient is in charge of TCP sessions, and generally
> >tries to hide that information from the upper layers of the code.  So
> >when you submit a request to NetworkClient, you don't know if that
> >request creates a TCP session, or reuses an existing one.
> 
> Hmm, the correlation id is an application level information in each Kafka
> request. It is maintained by o.a.k.c.NetworkClient. It is not associated
> with TCP sessions. So even the TCP session disconnects and reconnects,
> the correlation id is not reset and will still be monotonically increasing.

Hi Becket,

That's a fair point.  I was thinking of previous RPC systems I worked
with.  But in Kafka, you're right that the correlation ID is maintained
by a single counter in NetworkClient, rather than being a counter
per-connection.

In any case, the correlation ID is there in order to associate a request
with a response within a single TCP session.  It's not unique, even on a
single node, if there is more than one NetworkClient.  It will get reset
to 0 any time we restart the process or re-create the NetworkClient
object.

> 
> Maybe I did not make it clear. I am not suggesting anything relying on
> TCP or transport layer. Everything is handled at application layer. From the
> clients perspective, the timeout is not defined as TCP timeout, it is
> defined as the upper bound of time it will wait before receiving a
> response. If the client did not receive a response before the timeout is
> reached, it will just retry. My suggestion was that as long as a
> FetchRequest needs to be retried, no matter for what reason, we just use
> a full FetchRequest. This does not depend on NetworkClient implementations,
> i.e. regardless of whether the retry is on the existing TCP connection or
> a new one.

So, with this proposal, if the TCP session drops, then the client needs
to retransmit, right?  That's why I said this proposal couples the TCP
session with the incremental fetch session.  In general, I don't see why
you would want to couple those two things.

If the network is under heavy load, it might cause a few TCP sessions to
drop.  If a dropped TCP session means that someone has to fall back to
sending a much larger full fetch request, that's a positive feedback
loop.  It could lead to congestion collapse.

In general, I think that the current KIP proposal, which allows an
incremental fetch session to persist across multiple TCP sessions, is
superior to a proposal which doesn't allow that.  It also avoids
worrying about message reordering within the server due to multiple
worker threads and delayed requests.  It's just simpler, easier, and
more efficient to have the sequence number than to not have it.

> 
> The question we are trying to answer here is essentially how to let the
> leader and followers agree on the messages in the log. And we are
> comparing
> the following two solutions:
> 1. Use something like a TCP ACK with epoch at Request/Response level.
> 2. Piggy back the leader knowledge at partition level for the follower to
> confirm.

The existing KIP proposal is not really similar to a TCP ACK.  A TCP ACK
involves sending back an actual ACK packet.  The KIP-227 proposal just
has an incrementing sequence number which the client increments each
time it successfully receives a response.

> 
> Personally I think (2) is better because (2) is more direct. The leader
> is the one who maintains all the state (LEOs) of the followers. At the end
> of the day, the leader just wants to make sure all those states are correct.
> (2) directly confirms those states with the followers instead of
> inferring that from a epoch.

The problem is that when using incremental updates, we can't "directly
confirm" that the follower and the leader are in sync.  For example,
suppose the follower loses a response which gives an update for some
partition.  Then, the partition is not changed after that.  The follower
has no way of knowing that that data is missing, just by looking at the
responses.  That's why it is so important to keep the follower and the
leader in lockstep by using the sequence number.

> Note that there is a subtle but maybe important
> difference between our use case of epoch and TCP seq. The difference is
> that a TCP ACK confirms all the packets with a lower seq has been
> received.
> In our case, a high epoch request does not mean all the data in the
> previous response was successful. So in the KIP, the statement of "When
> the leader receives a fetch request with epoch N + 1, it knows that the data
> it sent back for the fetch request with epoch N was successfully processed
> by the follower." could be tricky or expensive to make right in some cases.

Hmm, let me 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-05 Thread Jason Gustafson
Hi Colin,

Thanks for the response. A couple replies:


> I’m a bit ambivalent about letting the client choose the session
> timeout.  What if clients choose timeouts that are too long? Hmm
> I do agree the timeout should be sized proportional to
> max.poll.interval.ms.


We have solved this in other cases by letting the broker enforce a maximum
timeout. After thinking about it a bit, it's probably overkill in this case
since the caching is just an optimization. Instead of stressing over
timeouts and such, I am actually wondering if we just need a reasonable
session cache eviction policy. For example, when the number of slots is
exceeded, perhaps you evict the session with the fewest partitions or the
one with the largest interval between fetches. We could give priority to
the replicas. Perhaps this might let us get rid of a few of the configs.

The main reason is if there is a bug in the incremental fetch feature.
>

This was in response to my question about removing the consumer config. And
sure, any new feature may have bugs, but that's what we have testing for
;). Users can always fall back to a previous version if there are any major
problems. As you know, it's tough removing configs once they are there, so
I think we should try to add them only if they make sense in the long term.

Thanks,
Jason

On Mon, Dec 4, 2017 at 11:06 PM, Colin McCabe  wrote:

> On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> >
> >
> > On 03.12.2017 21:55, Colin McCabe wrote:
> > > On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> > >> Thanks for the explanation, Colin. A few more questions.
> > >>
> > >>> The session epoch is not complex.  It's just a number which
> increments
> > >>> on each incremental fetch.  The session epoch is also useful for
> > >>> debugging-- it allows you to match up requests and responses when
> > >>> looking at log files.
> > >> Currently each request in Kafka has a correlation id to help match the
> > >> requests and responses. Is epoch doing something differently?
> > > Hi Becket,
> > >
> > > The correlation ID is used within a single TCP session, to uniquely
> > > associate a request with a response.  The correlation ID is not unique
> > > (and has no meaning) outside the context of that single TCP session.
> > >
> > > Keep in mind, NetworkClient is in charge of TCP sessions, and generally
> > > tries to hide that information from the upper layers of the code.  So
> > > when you submit a request to NetworkClient, you don't know if that
> > > request creates a TCP session, or reuses an existing one.
> > >>> Unfortunately, this doesn't work.  Imagine the client misses an
> > >>> increment fetch response about a partition.  And then the partition
> is
> > >>> never updated after that.  The client has no way to know about the
> > >>> partition, since it won't be included in any future incremental fetch
> > >>> responses.  And there are no offsets to compare, since the partition
> is
> > >>> simply omitted from the response.
> > >> I am curious about in which situation would the follower miss a
> response
> > >> of a partition. If the entire FetchResponse is lost (e.g. timeout),
> the
> > >> follower would disconnect and retry. That will result in sending a
> full
> > >> FetchRequest.
> > > Basically, you are proposing that we rely on TCP for reliable delivery
> > > in a distributed system.  That isn't a good idea for a bunch of
> > > different reasons.  First of all, TCP timeouts tend to be very long.
> So
> > > if the TCP session timing out is your error detection mechanism, you
> > > have to wait minutes for messages to timeout.  Of course, we add a
> > > timeout on top of that after which we declare the connection bad and
> > > manually close it.  But just because the session is closed on one end
> > > doesn't mean that the other end knows that it is closed.  So the leader
> > > may have to wait quite a long time before TCP decides that yes,
> > > connection X from the follower is dead and not coming back, even though
> > > gremlins ate the FIN packet which the follower attempted to translate.
> > > If the cache state is tied to that TCP session, we have to keep that
> > > cache around for a much longer time than we should.
> > Hi,
> >
> > I see this from a different perspective. The cache expiry time
> > has the same semantic as idle connection time in this scenario.
> > It is the time range we expect the client to come back an reuse
> > its broker side state. I would argue that on close we would get an
> > extra shot at cleaning up the session state early. As opposed to
> > always wait for that duration for expiry to happen.
>
> Hi Jan,
>
> The idea here is that the incremental fetch cache expiry time can be
> much shorter than the TCP session timeout.  In general the TCP session
> timeout is common to all TCP connections, and very long.  To make these
> numbers a little more concrete, the TCP session timeout is often
> configured to be 2 hours on Linux.  (See
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-04 Thread Colin McCabe
On Mon, Dec 4, 2017, at 02:27, Jan Filipiak wrote:
> 
> 
> On 03.12.2017 21:55, Colin McCabe wrote:
> > On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> >> Thanks for the explanation, Colin. A few more questions.
> >>
> >>> The session epoch is not complex.  It's just a number which increments
> >>> on each incremental fetch.  The session epoch is also useful for
> >>> debugging-- it allows you to match up requests and responses when
> >>> looking at log files.
> >> Currently each request in Kafka has a correlation id to help match the
> >> requests and responses. Is epoch doing something differently?
> > Hi Becket,
> >
> > The correlation ID is used within a single TCP session, to uniquely
> > associate a request with a response.  The correlation ID is not unique
> > (and has no meaning) outside the context of that single TCP session.
> >
> > Keep in mind, NetworkClient is in charge of TCP sessions, and generally
> > tries to hide that information from the upper layers of the code.  So
> > when you submit a request to NetworkClient, you don't know if that
> > request creates a TCP session, or reuses an existing one.
> >>> Unfortunately, this doesn't work.  Imagine the client misses an
> >>> increment fetch response about a partition.  And then the partition is
> >>> never updated after that.  The client has no way to know about the
> >>> partition, since it won't be included in any future incremental fetch
> >>> responses.  And there are no offsets to compare, since the partition is
> >>> simply omitted from the response.
> >> I am curious about in which situation would the follower miss a response
> >> of a partition. If the entire FetchResponse is lost (e.g. timeout), the
> >> follower would disconnect and retry. That will result in sending a full
> >> FetchRequest.
> > Basically, you are proposing that we rely on TCP for reliable delivery
> > in a distributed system.  That isn't a good idea for a bunch of
> > different reasons.  First of all, TCP timeouts tend to be very long.  So
> > if the TCP session timing out is your error detection mechanism, you
> > have to wait minutes for messages to timeout.  Of course, we add a
> > timeout on top of that after which we declare the connection bad and
> > manually close it.  But just because the session is closed on one end
> > doesn't mean that the other end knows that it is closed.  So the leader
> > may have to wait quite a long time before TCP decides that yes,
> > connection X from the follower is dead and not coming back, even though
> > gremlins ate the FIN packet which the follower attempted to translate.
> > If the cache state is tied to that TCP session, we have to keep that
> > cache around for a much longer time than we should.
> Hi,
> 
> I see this from a different perspective. The cache expiry time
> has the same semantic as idle connection time in this scenario.
> It is the time range we expect the client to come back an reuse
> its broker side state. I would argue that on close we would get an
> extra shot at cleaning up the session state early. As opposed to
> always wait for that duration for expiry to happen.

Hi Jan,

The idea here is that the incremental fetch cache expiry time can be
much shorter than the TCP session timeout.  In general the TCP session
timeout is common to all TCP connections, and very long.  To make these
numbers a little more concrete, the TCP session timeout is often
configured to be 2 hours on Linux.  (See
https://www.cyberciti.biz/tips/linux-increasing-or-decreasing-tcp-sockets-timeouts.html
)  The timeout I was proposing for incremental fetch sessions was one or
two minutes at most.

> 
> > Secondly, from a software engineering perspective, it's not a good idea
> > to try to tightly tie together TCP and our code.  We would have to
> > rework how we interact with NetworkClient so that we are aware of things
> > like TCP sessions closing or opening.  We would have to be careful
> > preserve the ordering of incoming messages when doing things like
> > putting incoming requests on to a queue to be processed by multiple
> > threads.  It's just a lot of complexity to add, and there's no upside.
> I see the point here. And I had a small chat with Dong Lin already
> making me aware of this. I tried out the approaches and propose the 
> following:
> 
> The client start and does a full fetch. It then does incremental fetches.
> The connection to the broker dies and is re-established by NetworkClient 
> under the hood.
> The broker sees an incremental fetch without having state => returns
> error:
> Client sees the error, does a full fetch and goes back to incrementally 
> fetching.
> 
> having this 1 additional error round trip is essentially the same as 
> when something
> with the sessions or epoch changed unexpectedly to the client (say
> expiry).
> 
> So its nothing extra added but the conditions are easier to evaluate.
> Especially since we do everything with NetworkClient. Other implementers 
> on the
> protocol are 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-04 Thread Colin McCabe
On Mon, Dec 4, 2017, at 16:44, Jason Gustafson wrote:
> Hi Colin,
>
> Just a few minor points/suggestions:
>
> 1. The last stable offset can only be advanced when the high watermark> 
> advances, so I think you can ignore it in your designation of a
> "dirty"> partition.

Hi Jason,

Good catch.

>
> 2. I think the fetch "epoch" is more properly a "sequence number"
>in its> current usage. The use of "epoch" makes me think of fencing,
> which is not> the case here. Alternatively, you could make it a proper epoch
> which the> client might bump when it fails to receive a fetch response.
> Seems like> that would be enough to address the potential reordering issues
> that you> have alluded to on TCP disconnects. Using the correlationId as
> suggested> above may also be viable, but I think we have so far resisted
> using this> field for higher-level purposes.

Yeah, “fetch sequence number” is a better description.  The client will
resend the incremental request will the same sequence number if the
response to the first request was dropped.
>
> 3. I'm wondering if a broker config is the best way to control the
> session
> timeout. The consumer is designed for applications which handle
> processing
> in the poll loop which means that the rate of fetches is
> effectively tied> (or at least related) to the rate poll() is being called by 
> the
> application. This is governed on the client by
> max.poll.interval.ms. If> the
> consumer is not fetching too frequently, maybe the benefit of
> caching is> not that high anyway, so it might be nice to prevent these
> consumers from> tying up the cache in the first place.
>

I’m a bit ambivalent about letting the client choose the session
timeout.  What if clients choose timeouts that are too long? Hmm
I do agree the timeout should be sized proportional to
max.poll.interval.ms.
> 4. Is there any reason why the consumer wouldn't always attempt to use> the 
> incremental fetch? In other words, do we need the config to
> enable it?> Feels a bit low-level for users to be thinking about and we have 
> a way> for the broker to refuse to create a session if it seems not
> worthwhile.
The main reason is if there is a bug in the incremental fetch feature.

>
> 5. The use of the non-zero sessionId in the fetch response for one
>of the> cases is a little odd. Can you explain a bit more why this is
> needed? It> seems like the only case the client needs to distinguish is
> whether the> session was created or not.

It has two uses— to let the client know the id of the session which was
created, and to make it easier to read the fetch responses in the logs.
>
> By the way, it would be helpful to make the new or modified
> fields in the> Fetch API bold. One other nit: can you add the separate
> IncrementalFetch> API to the list of rejected alternatives?
>

Yeah, I’ll add that in.

Best,
Colin

> Thanks,
> Jason
>
> On Mon, Dec 4, 2017 at 2:27 AM, Jan Filipiak
> > wrote:
>
> >
> >
> > On 03.12.2017 21:55, Colin McCabe wrote:
> >
> >> On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> >>
> >>> Thanks for the explanation, Colin. A few more questions.
> >>>
> >>> The session epoch is not complex.  It's just a number which
> >>> increments>  on each incremental fetch.  The session epoch is also 
> >>> useful for>  debugging-- it allows you to match up requests and 
> >>> responses when>  looking at log files.
> 
> >>> Currently each request in Kafka has a correlation id to help
> >>> match the> >>> requests and responses. Is epoch doing something 
> >>> differently?
> >>>
> >> Hi Becket,
> >>
> >> The correlation ID is used within a single TCP session, to uniquely> >> 
> >> associate a request with a response.  The correlation ID is not
> >> unique> >> (and has no meaning) outside the context of that single TCP
> >> session.> >>
> >> Keep in mind, NetworkClient is in charge of TCP sessions, and
> >> generally> >> tries to hide that information from the upper layers of the
> >> code.  So> >> when you submit a request to NetworkClient, you don't know 
> >> if that> >> request creates a TCP session, or reuses an existing one.
> >>
> >>> Unfortunately, this doesn't work.  Imagine the client misses an
>  increment fetch response about a partition.  And then the
>  partition is>  never updated after that.  The client has no way to 
>  know about
>  the>  partition, since it won't be included in any future incremental
>  fetch>  responses.  And there are no offsets to compare, since the
>  partition is>  simply omitted from the response.
> 
> >>> I am curious about in which situation would the follower miss a
> >>> response> >>> of a partition. If the entire FetchResponse is lost (e.g.
> >>> timeout), the> >>> follower would disconnect and retry. That will result 
> >>> in sending a
> >>> full> >>> FetchRequest.
> >>>
> >> Basically, you are proposing that we rely on TCP for reliable
> >> delivery> >> in 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-04 Thread Jason Gustafson
Hi Colin,

Just a few minor points/suggestions:

1. The last stable offset can only be advanced when the high watermark
advances, so I think you can ignore it in your designation of a "dirty"
partition.

2. I think the fetch "epoch" is more properly a "sequence number" in its
current usage. The use of "epoch" makes me think of fencing, which is not
the case here. Alternatively, you could make it a proper epoch which the
client might bump when it fails to receive a fetch response. Seems like
that would be enough to address the potential reordering issues that you
have alluded to on TCP disconnects. Using the correlationId as suggested
above may also be viable, but I think we have so far resisted using this
field for higher-level purposes.

3. I'm wondering if a broker config is the best way to control the session
timeout. The consumer is designed for applications which handle processing
in the poll loop which means that the rate of fetches is effectively tied
(or at least related) to the rate poll() is being called by the
application. This is governed on the client by max.poll.interval.ms. If the
consumer is not fetching too frequently, maybe the benefit of caching is
not that high anyway, so it might be nice to prevent these consumers from
tying up the cache in the first place.

4. Is there any reason why the consumer wouldn't always attempt to use the
incremental fetch? In other words, do we need the config to enable it?
Feels a bit low-level for users to be thinking about and we have a way for
the broker to refuse to create a session if it seems not worthwhile.

5. The use of the non-zero sessionId in the fetch response for one of the
cases is a little odd. Can you explain a bit more why this is needed? It
seems like the only case the client needs to distinguish is whether the
session was created or not.

By the way, it would be helpful to make the new or modified fields in the
Fetch API bold. One other nit: can you add the separate IncrementalFetch
API to the list of rejected alternatives?

Thanks,
Jason

On Mon, Dec 4, 2017 at 2:27 AM, Jan Filipiak 
wrote:

>
>
> On 03.12.2017 21:55, Colin McCabe wrote:
>
>> On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
>>
>>> Thanks for the explanation, Colin. A few more questions.
>>>
>>> The session epoch is not complex.  It's just a number which increments
 on each incremental fetch.  The session epoch is also useful for
 debugging-- it allows you to match up requests and responses when
 looking at log files.

>>> Currently each request in Kafka has a correlation id to help match the
>>> requests and responses. Is epoch doing something differently?
>>>
>> Hi Becket,
>>
>> The correlation ID is used within a single TCP session, to uniquely
>> associate a request with a response.  The correlation ID is not unique
>> (and has no meaning) outside the context of that single TCP session.
>>
>> Keep in mind, NetworkClient is in charge of TCP sessions, and generally
>> tries to hide that information from the upper layers of the code.  So
>> when you submit a request to NetworkClient, you don't know if that
>> request creates a TCP session, or reuses an existing one.
>>
>>> Unfortunately, this doesn't work.  Imagine the client misses an
 increment fetch response about a partition.  And then the partition is
 never updated after that.  The client has no way to know about the
 partition, since it won't be included in any future incremental fetch
 responses.  And there are no offsets to compare, since the partition is
 simply omitted from the response.

>>> I am curious about in which situation would the follower miss a response
>>> of a partition. If the entire FetchResponse is lost (e.g. timeout), the
>>> follower would disconnect and retry. That will result in sending a full
>>> FetchRequest.
>>>
>> Basically, you are proposing that we rely on TCP for reliable delivery
>> in a distributed system.  That isn't a good idea for a bunch of
>> different reasons.  First of all, TCP timeouts tend to be very long.  So
>> if the TCP session timing out is your error detection mechanism, you
>> have to wait minutes for messages to timeout.  Of course, we add a
>> timeout on top of that after which we declare the connection bad and
>> manually close it.  But just because the session is closed on one end
>> doesn't mean that the other end knows that it is closed.  So the leader
>> may have to wait quite a long time before TCP decides that yes,
>> connection X from the follower is dead and not coming back, even though
>> gremlins ate the FIN packet which the follower attempted to translate.
>> If the cache state is tied to that TCP session, we have to keep that
>> cache around for a much longer time than we should.
>>
> Hi,
>
> I see this from a different perspective. The cache expiry time
> has the same semantic as idle connection time in this scenario.
> It is the time range we expect the client to come back an reuse

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-04 Thread Jan Filipiak



On 03.12.2017 21:55, Colin McCabe wrote:

On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:

Thanks for the explanation, Colin. A few more questions.


The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correlation ID is not unique
(and has no meaning) outside the context of that single TCP session.

Keep in mind, NetworkClient is in charge of TCP sessions, and generally
tries to hide that information from the upper layers of the code.  So
when you submit a request to NetworkClient, you don't know if that
request creates a TCP session, or reuses an existing one.

Unfortunately, this doesn't work.  Imagine the client misses an
increment fetch response about a partition.  And then the partition is
never updated after that.  The client has no way to know about the
partition, since it won't be included in any future incremental fetch
responses.  And there are no offsets to compare, since the partition is
simply omitted from the response.

I am curious about in which situation would the follower miss a response
of a partition. If the entire FetchResponse is lost (e.g. timeout), the
follower would disconnect and retry. That will result in sending a full
FetchRequest.

Basically, you are proposing that we rely on TCP for reliable delivery
in a distributed system.  That isn't a good idea for a bunch of
different reasons.  First of all, TCP timeouts tend to be very long.  So
if the TCP session timing out is your error detection mechanism, you
have to wait minutes for messages to timeout.  Of course, we add a
timeout on top of that after which we declare the connection bad and
manually close it.  But just because the session is closed on one end
doesn't mean that the other end knows that it is closed.  So the leader
may have to wait quite a long time before TCP decides that yes,
connection X from the follower is dead and not coming back, even though
gremlins ate the FIN packet which the follower attempted to translate.
If the cache state is tied to that TCP session, we have to keep that
cache around for a much longer time than we should.

Hi,

I see this from a different perspective. The cache expiry time
has the same semantic as idle connection time in this scenario.
It is the time range we expect the client to come back an reuse
its broker side state. I would argue that on close we would get an
extra shot at cleaning up the session state early. As opposed to
always wait for that duration for expiry to happen.


Secondly, from a software engineering perspective, it's not a good idea
to try to tightly tie together TCP and our code.  We would have to
rework how we interact with NetworkClient so that we are aware of things
like TCP sessions closing or opening.  We would have to be careful
preserve the ordering of incoming messages when doing things like
putting incoming requests on to a queue to be processed by multiple
threads.  It's just a lot of complexity to add, and there's no upside.

I see the point here. And I had a small chat with Dong Lin already
making me aware of this. I tried out the approaches and propose the 
following:


The client start and does a full fetch. It then does incremental fetches.
The connection to the broker dies and is re-established by NetworkClient 
under the hood.

The broker sees an incremental fetch without having state => returns error:
Client sees the error, does a full fetch and goes back to incrementally 
fetching.


having this 1 additional error round trip is essentially the same as 
when something

with the sessions or epoch changed unexpectedly to the client (say expiry).

So its nothing extra added but the conditions are easier to evaluate.
Especially since we do everything with NetworkClient. Other implementers 
on the
protocol are free to optimizes this and do not do the errornours 
roundtrip on the

new connection.
Its a great plus that the client can know when the error is gonna 
happen. instead of
the server to always have to report back if something changes 
unexpectedly for the client



Imagine that I made an argument that client IDs are "complex" and should
be removed from our APIs.  After all, we can just look at the remote IP
address and TCP port of each connection.  Would you think that was a
good idea?  The client ID is useful when looking at logs.  For example,
if a rebalance is having problems, you want to know what clients were
having a problem.  So having the client ID field to guide you is
actually much less "complex" in practice than not having an ID.

I still cant follow why the correlation idea will not help here.
Correlating 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-03 Thread Becket Qin
>The correlation ID is used within a single TCP session, to uniquely
>associate a request with a response.  The correlation ID is not unique
>(and has no meaning) outside the context of that single TCP session.
>
>Keep in mind, NetworkClient is in charge of TCP sessions, and generally
>tries to hide that information from the upper layers of the code.  So
>when you submit a request to NetworkClient, you don't know if that
>request creates a TCP session, or reuses an existing one.

Hmm, the correlation id is an application level information in each Kafka
request. It is maintained by o.a.k.c.NetworkClient. It is not associated
with TCP sessions. So even the TCP session disconnects and reconnects, the
correlation id is not reset and will still be monotonically increasing.

Maybe I did not make it clear. I am not suggesting anything relying on TCP
or transport layer. Everything is handled at application layer. From the
clients perspective, the timeout is not defined as TCP timeout, it is
defined as the upper bound of time it will wait before receiving a
response. If the client did not receive a response before the timeout is
reached, it will just retry. My suggestion was that as long as a
FetchRequest needs to be retried, no matter for what reason, we just use a
full FetchRequest. This does not depend on NetworkClient implementations,
i.e. regardless of whether the retry is on the existing TCP connection or a
new one.

The question we are trying to answer here is essentially how to let the
leader and followers agree on the messages in the log. And we are comparing
the following two solutions:
1. Use something like a TCP ACK with epoch at Request/Response level.
2. Piggy back the leader knowledge at partition level for the follower to
confirm.

Personally I think (2) is better because (2) is more direct. The leader is
the one who maintains all the state (LEOs) of the followers. At the end of
the day, the leader just wants to make sure all those states are correct.
(2) directly confirms those states with the followers instead of inferring
that from a epoch. Note that there is a subtle but maybe important
difference between our use case of epoch and TCP seq. The difference is
that a TCP ACK confirms all the packets with a lower seq has been received.
In our case, a high epoch request does not mean all the data in the
previous response was successful. So in the KIP, the statement of "When the
leader receives a fetch request with epoch N + 1, it knows that the data it
sent back for the fetch request with epoch N was successfully processed by
the follower." could be tricky or expensive to make right in some cases.

Not sure if we have considered this, but when thinking of the above
comparison, the following two potential issues came up:

1. Thinking about the case of a consumer. If consumer.seek() or
consumer.pause() is called. the consumer has essentially updated its
interested set of topics or positions. This will needs a full FetchRequest
to update the position on the leader. And thus create a new session. Now if
users call seek()/pause() very often, the broker could run out of fetch
session slot pretty quickly.

2. Corrupted messages. If a fetch response has a corrupt message, the
follower will back off for a while and try fetch again. During the back off
period, the follower will not be fetching from the partition with corrupt
message. And after the back off the partition will be added back. With the
current design, it seems the follower will need to keep creating new
sessions.

In the above two cases, it might still be useful to let the session id be
unique for each client instance (just like the producer id for the
idempotent produce) and allow the client to update the leader side
interested partitions and position with full FetchRequest without creating
a new session id.

Thanks,

Jiangjie (Becket) Qin



On Sun, Dec 3, 2017 at 12:55 PM, Colin McCabe  wrote:

> On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> > Thanks for the explanation, Colin. A few more questions.
> >
> > >The session epoch is not complex.  It's just a number which increments
> > >on each incremental fetch.  The session epoch is also useful for
> > >debugging-- it allows you to match up requests and responses when
> > >looking at log files.
> >
> > Currently each request in Kafka has a correlation id to help match the
> > requests and responses. Is epoch doing something differently?
>
> Hi Becket,
>
> The correlation ID is used within a single TCP session, to uniquely
> associate a request with a response.  The correlation ID is not unique
> (and has no meaning) outside the context of that single TCP session.
>
> Keep in mind, NetworkClient is in charge of TCP sessions, and generally
> tries to hide that information from the upper layers of the code.  So
> when you submit a request to NetworkClient, you don't know if that
> request creates a TCP session, or reuses an existing one.
>
> >
> > >Unfortunately, this 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-03 Thread Colin McCabe
On Sat, Dec 2, 2017, at 23:21, Becket Qin wrote:
> Thanks for the explanation, Colin. A few more questions.
> 
> >The session epoch is not complex.  It's just a number which increments
> >on each incremental fetch.  The session epoch is also useful for
> >debugging-- it allows you to match up requests and responses when
> >looking at log files.
> 
> Currently each request in Kafka has a correlation id to help match the
> requests and responses. Is epoch doing something differently?

Hi Becket,

The correlation ID is used within a single TCP session, to uniquely
associate a request with a response.  The correlation ID is not unique
(and has no meaning) outside the context of that single TCP session.

Keep in mind, NetworkClient is in charge of TCP sessions, and generally
tries to hide that information from the upper layers of the code.  So
when you submit a request to NetworkClient, you don't know if that
request creates a TCP session, or reuses an existing one.

> 
> >Unfortunately, this doesn't work.  Imagine the client misses an
> >increment fetch response about a partition.  And then the partition is
> >never updated after that.  The client has no way to know about the
> >partition, since it won't be included in any future incremental fetch
> >responses.  And there are no offsets to compare, since the partition is
> >simply omitted from the response.
> 
> I am curious about in which situation would the follower miss a response
> of a partition. If the entire FetchResponse is lost (e.g. timeout), the
> follower would disconnect and retry. That will result in sending a full
> FetchRequest.

Basically, you are proposing that we rely on TCP for reliable delivery
in a distributed system.  That isn't a good idea for a bunch of
different reasons.  First of all, TCP timeouts tend to be very long.  So
if the TCP session timing out is your error detection mechanism, you
have to wait minutes for messages to timeout.  Of course, we add a
timeout on top of that after which we declare the connection bad and
manually close it.  But just because the session is closed on one end
doesn't mean that the other end knows that it is closed.  So the leader
may have to wait quite a long time before TCP decides that yes,
connection X from the follower is dead and not coming back, even though
gremlins ate the FIN packet which the follower attempted to translate. 
If the cache state is tied to that TCP session, we have to keep that
cache around for a much longer time than we should.

Secondly, from a software engineering perspective, it's not a good idea
to try to tightly tie together TCP and our code.  We would have to
rework how we interact with NetworkClient so that we are aware of things
like TCP sessions closing or opening.  We would have to be careful
preserve the ordering of incoming messages when doing things like
putting incoming requests on to a queue to be processed by multiple
threads.  It's just a lot of complexity to add, and there's no upside.

Imagine that I made an argument that client IDs are "complex" and should
be removed from our APIs.  After all, we can just look at the remote IP
address and TCP port of each connection.  Would you think that was a
good idea?  The client ID is useful when looking at logs.  For example,
if a rebalance is having problems, you want to know what clients were
having a problem.  So having the client ID field to guide you is
actually much less "complex" in practice than not having an ID.

Similarly, if metadata responses had epoch numbers (simple incrementing
numbers), we would not have to debug problems like clients accidentally
getting old metadata from servers that had been partitioned off from the
network for a while.  Clients would know the difference between old and
new metadata.  So putting epochs in to the metadata request is much less
"complex" operationally, even though it's an extra field in the request.
 This has been discussed before on the mailing list.

So I think the bottom line for me is that having the session ID and
session epoch, while it adds two extra fields, reduces operational
complexity and increases debuggability.  It avoids tightly coupling us
to assumptions about reliable ordered delivery which tend to be violated
in practice in multiple layers of the stack.  Finally, it  avoids the
necessity of refactoring NetworkClient.

best,
Colin


> If there is an error such as NotLeaderForPartition is
> returned for some partitions, the follower can always send a full
> FetchRequest. Is there a scenario that only some of the partitions in a
> FetchResponse is lost?
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> On Sat, Dec 2, 2017 at 2:37 PM, Colin McCabe  wrote:
> 
> > On Fri, Dec 1, 2017, at 11:46, Dong Lin wrote:
> > > On Thu, Nov 30, 2017 at 9:37 AM, Colin McCabe 
> > wrote:
> > >
> > > > On Wed, Nov 29, 2017, at 18:59, Dong Lin wrote:
> > > > > Hey Colin,
> > > > >
> > > > > Thanks much for the update. I have a few 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-03 Thread Jan Filipiak


On 02.12.2017 23:34, Colin McCabe wrote:

On Thu, Nov 30, 2017, at 23:29, Jan Filipiak wrote:

Hi,

this discussion is going a little bit far from what I intended this
thread for.
I can see all of this beeing related.

To let you guys know what I am currently thinking is the following:

I do think the handling of Id's and epoch is rather complicated. I think
the complexity
comes from aiming for to much.

1. Currently all the work is towards making fetchRequest
completely empty. This brings all sorts of pain with regards to the
broker actually needs
to know what he send even though it tries to use sendfile as much as
possible.
2. Currently all the work is towards also making empty fetch request
across TCP sessions.

In this thread I aimed to relax our goals with regards to point 2.
Connection resets for us
are really the exceptions and I would argue, trying to introduce
complexity for sparing
1 full request on connection reset is not worth it. Therefore I argued
to keep the Server
side information with the Session instead somewhere global. Its not
going to bring in the
results.

As the discussion unvields I also want to challenge our approach for
point 1.
I do not see a reason to introduce complexity (and
   especially on the fetch answer path). Did we consider that from the
client we just send the offsets
we want to fetch and skip the topic partition description and just use
the order to match the information
on the broker side again? This would also reduce the fetch sizes a lot
while skipping a ton of complexity.

Hi Jan,

We need to solve the problem of the fetch request taking
O(num_partitions) space and time to process.  A solution that keeps the
O(num_partitions) behavior, but improves it by a constant factor,
doesn't really solve the problem.  And omitting some partition
information, but leaving other partition information in place,
definitely falls in that category, wouldn't you agree?  Also, as others
have noted, if you omit the partition IDs, you run into a lot of
problems surrounding changes in the partition membership.

best,
Colin

Hi Colin,

I agree that a fetch request sending only offsets still growths with the 
number of partitions.
Processing time, I can only follow as it comes to parsing, but I don't 
see a difference in the

work a broker has todo more for received offsets than cached offsets.

Given we still have the 100.000 partition case a fetchrequest as I 
suggest would savely
get below <1MB. How much of an improvement this is really depends on 
your set up.


Say you have all of these in 1 topic you are saving effectively maybe 
50% already.

As you increase topics and depending on how long you topic names are you get
extra savings.
In my playground cluster this is 160 topics, average 10 partitions, 2 
brokers
average and mean topic length is 54 and replication factor 2. This would 
result in
a saving of 5,5 bytes / topic-partition fetched. So from currently 21,5 
bytes per topic-partion
it would go down to basically 8, almost 2/3 saving. On our production 
cluster which has
a higher broker to replication factor ratio the savings are bigger. The 
Average of replicated partitions per
topic there is ~3 . This is roughly 75% percent saving in fetch request 
size. For us,
since we have many slowly changing smaller topics, varint encoding of 
offsets would give another big boost

as many fit into 2-3 bytes.


I do not quite understand what it means to omit partition-ids and 
changing ownership. The partition ID can be retrieved
by ordinal position from the brokers cache. The broker serving the fetch 
request
should not care if this consumer owns the partition in terms of its 
group membership. If the broker should no longer be the leader
of the partition he can return "not leader for partition" as usual. 
Maybe you can point me where this has been explained as I

couldn't really find a place where it got clear to me.

I think 75% saving and more is realistic and even though linear to the 
number of partitions fetch a very practical aproach that fits
the design principles "the consumer decides" a lot better. I am still 
trying to fully understand how the plan is to update the offsets broker
wise. No need to explain that here as I think I know where to look it 
up, I guess that is introduces a lot of complexity with sendfile and
an additional index lookup that I have a hard time believing it will pay 
off. Both in source code complexity and efficiency.


I intend to send you an answer on the other threads as soon as I get to 
it.  Hope this explains my view of

the size trade-off well enough. Would very much appreciate your opinion.

Best Jan




Hope these ideas are interesting

best Jan


On 01.12.2017 01:47, Becket Qin wrote:

Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be good if
we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few messages), the next
fetch 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-02 Thread Becket Qin
Thanks for the explanation, Colin. A few more questions.

>The session epoch is not complex.  It's just a number which increments
>on each incremental fetch.  The session epoch is also useful for
>debugging-- it allows you to match up requests and responses when
>looking at log files.

Currently each request in Kafka has a correlation id to help match the
requests and responses. Is epoch doing something differently?

>Unfortunately, this doesn't work.  Imagine the client misses an
>increment fetch response about a partition.  And then the partition is
>never updated after that.  The client has no way to know about the
>partition, since it won't be included in any future incremental fetch
>responses.  And there are no offsets to compare, since the partition is
>simply omitted from the response.

I am curious about in which situation would the follower miss a response of
a partition. If the entire FetchResponse is lost (e.g. timeout), the
follower would disconnect and retry. That will result in sending a full
FetchRequest. If there is an error such as NotLeaderForPartition is
returned for some partitions, the follower can always send a full
FetchRequest. Is there a scenario that only some of the partitions in a
FetchResponse is lost?

Thanks,

Jiangjie (Becket) Qin


On Sat, Dec 2, 2017 at 2:37 PM, Colin McCabe  wrote:

> On Fri, Dec 1, 2017, at 11:46, Dong Lin wrote:
> > On Thu, Nov 30, 2017 at 9:37 AM, Colin McCabe 
> wrote:
> >
> > > On Wed, Nov 29, 2017, at 18:59, Dong Lin wrote:
> > > > Hey Colin,
> > > >
> > > > Thanks much for the update. I have a few questions below:
> > > >
> > > > 1. I am not very sure that we need Fetch Session Epoch. It seems that
> > > > Fetch
> > > > Session Epoch is only needed to help leader distinguish between "a
> full
> > > > fetch request" and "a full fetch request and request a new
> incremental
> > > > fetch session". Alternatively, follower can also indicate "a full
> fetch
> > > > request and request a new incremental fetch session" by setting Fetch
> > > > Session ID to -1 without using Fetch Session Epoch. Does this make
> sense?
> > >
> > > Hi Dong,
> > >
> > > The fetch session epoch is very important for ensuring correctness.  It
> > > prevents corrupted or incomplete fetch data due to network reordering
> or
> > > loss.
> > >
> > > For example, consider a scenario where the follower sends a fetch
> > > request to the leader.  The leader responds, but the response is lost
> > > because of network problems which affected the TCP session.  In that
> > > case, the follower must establish a new TCP session and re-send the
> > > incremental fetch request.  But the leader does not know that the
> > > follower didn't receive the previous incremental fetch response.  It is
> > > only the incremental fetch epoch which lets the leader know that it
> > > needs to resend that data, and not data which comes afterwards.
> > >
> > > You could construct similar scenarios with message reordering,
> > > duplication, etc.  Basically, this is a stateful protocol on an
> > > unreliable network, and you need to know whether the follower got the
> > > previous data you sent before you move on.  And you need to handle
> > > issues like duplicated or delayed requests.  These issues do not affect
> > > the full fetch request, because it is not stateful-- any full fetch
> > > request can be understood and properly responded to in isolation.
> > >
> >
> > Thanks for the explanation. This makes sense. On the other hand I would
> > be interested in learning more about whether Becket's solution can help
> > simplify the protocol by not having the echo field and whether that is
> > worth doing.
>
> Hi Dong,
>
> I commented about this in the other thread.  A solution which doesn't
> maintain session information doesn't work here.
>
> >
> >
> >
> > >
> > > >
> > > > 2. It is said that Incremental FetchRequest will include partitions
> whose
> > > > fetch offset or maximum number of fetch bytes has been changed. If
> > > > follower's logStartOffet of a partition has changed, should this
> > > > partition also be included in the next FetchRequest to the leader?
> > > Otherwise, it
> > > > may affect the handling of DeleteRecordsRequest because leader may
> not
> > > know
> > > > the corresponding data has been deleted on the follower.
> > >
> > > Yeah, the follower should include the partition if the logStartOffset
> > > has changed.  That should be spelled out on the KIP.  Fixed.
> > >
> > > >
> > > > 3. In the section "Per-Partition Data", a partition is not considered
> > > > dirty if its log start offset has changed. Later in the section
> > > "FetchRequest
> > > > Changes", it is said that incremental fetch responses will include a
> > > > partition if its logStartOffset has changed. It seems inconsistent.
> Can
> > > > you update the KIP to clarify it?
> > > >
> > >
> > > In the "Per-Partition Data" section, it does say that logStartOffset
> > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-02 Thread Colin McCabe
On Fri, Dec 1, 2017, at 11:46, Dong Lin wrote:
> On Thu, Nov 30, 2017 at 9:37 AM, Colin McCabe  wrote:
> 
> > On Wed, Nov 29, 2017, at 18:59, Dong Lin wrote:
> > > Hey Colin,
> > >
> > > Thanks much for the update. I have a few questions below:
> > >
> > > 1. I am not very sure that we need Fetch Session Epoch. It seems that
> > > Fetch
> > > Session Epoch is only needed to help leader distinguish between "a full
> > > fetch request" and "a full fetch request and request a new incremental
> > > fetch session". Alternatively, follower can also indicate "a full fetch
> > > request and request a new incremental fetch session" by setting Fetch
> > > Session ID to -1 without using Fetch Session Epoch. Does this make sense?
> >
> > Hi Dong,
> >
> > The fetch session epoch is very important for ensuring correctness.  It
> > prevents corrupted or incomplete fetch data due to network reordering or
> > loss.
> >
> > For example, consider a scenario where the follower sends a fetch
> > request to the leader.  The leader responds, but the response is lost
> > because of network problems which affected the TCP session.  In that
> > case, the follower must establish a new TCP session and re-send the
> > incremental fetch request.  But the leader does not know that the
> > follower didn't receive the previous incremental fetch response.  It is
> > only the incremental fetch epoch which lets the leader know that it
> > needs to resend that data, and not data which comes afterwards.
> >
> > You could construct similar scenarios with message reordering,
> > duplication, etc.  Basically, this is a stateful protocol on an
> > unreliable network, and you need to know whether the follower got the
> > previous data you sent before you move on.  And you need to handle
> > issues like duplicated or delayed requests.  These issues do not affect
> > the full fetch request, because it is not stateful-- any full fetch
> > request can be understood and properly responded to in isolation.
> >
> 
> Thanks for the explanation. This makes sense. On the other hand I would
> be interested in learning more about whether Becket's solution can help
> simplify the protocol by not having the echo field and whether that is
> worth doing.

Hi Dong,

I commented about this in the other thread.  A solution which doesn't
maintain session information doesn't work here.

> 
> 
> 
> >
> > >
> > > 2. It is said that Incremental FetchRequest will include partitions whose
> > > fetch offset or maximum number of fetch bytes has been changed. If
> > > follower's logStartOffet of a partition has changed, should this
> > > partition also be included in the next FetchRequest to the leader?
> > Otherwise, it
> > > may affect the handling of DeleteRecordsRequest because leader may not
> > know
> > > the corresponding data has been deleted on the follower.
> >
> > Yeah, the follower should include the partition if the logStartOffset
> > has changed.  That should be spelled out on the KIP.  Fixed.
> >
> > >
> > > 3. In the section "Per-Partition Data", a partition is not considered
> > > dirty if its log start offset has changed. Later in the section
> > "FetchRequest
> > > Changes", it is said that incremental fetch responses will include a
> > > partition if its logStartOffset has changed. It seems inconsistent. Can
> > > you update the KIP to clarify it?
> > >
> >
> > In the "Per-Partition Data" section, it does say that logStartOffset
> > changes make a partition dirty, though, right?  The first bullet point
> > is:
> >
> > > * The LogCleaner deletes messages, and this changes the log start offset
> > of the partition on the leader., or
> >
> 
> Ah I see. I think I didn't notice this because statement assumes that the
> LogStartOffset in the leader only changes due to LogCleaner. In fact the
> LogStartOffset can change on the leader due to either log retention and
> DeleteRecordsRequest. I haven't verified whether LogCleaner can change
> LogStartOffset though. It may be a bit better to just say that a
> partition is considered dirty if LogStartOffset changes.

I agree.  It should be straightforward to just resend the partition if
logStartOffset changes.

> 
> 
> >
> > > 4. In "Fetch Session Caching" section, it is said that each broker has a
> > > limited number of slots. How is this number determined? Does this require
> > > a new broker config for this number?
> >
> > Good point.  I added two broker configuration parameters to control this
> > number.
> >
> 
> I am curious to see whether we can avoid some of these new configs. For
> example, incremental.fetch.session.cache.slots.per.broker is probably not
> necessary because if a leader knows that a FetchRequest comes from a
> follower, we probably want the leader to always cache the information
> from that follower. Does this make sense?

Yeah, maybe we can avoid having
incremental.fetch.session.cache.slots.per.broker.

> 
> Maybe we can discuss the config later after there is agreement 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-02 Thread Colin McCabe
On Thu, Nov 30, 2017, at 23:29, Jan Filipiak wrote:
> Hi,
> 
> this discussion is going a little bit far from what I intended this 
> thread for.
> I can see all of this beeing related.
> 
> To let you guys know what I am currently thinking is the following:
> 
> I do think the handling of Id's and epoch is rather complicated. I think 
> the complexity
> comes from aiming for to much.
> 
> 1. Currently all the work is towards making fetchRequest
> completely empty. This brings all sorts of pain with regards to the 
> broker actually needs
> to know what he send even though it tries to use sendfile as much as 
> possible.
> 2. Currently all the work is towards also making empty fetch request 
> across TCP sessions.
> 
> In this thread I aimed to relax our goals with regards to point 2. 
> Connection resets for us
> are really the exceptions and I would argue, trying to introduce 
> complexity for sparing
> 1 full request on connection reset is not worth it. Therefore I argued 
> to keep the Server
> side information with the Session instead somewhere global. Its not 
> going to bring in the
> results.
> 
> As the discussion unvields I also want to challenge our approach for 
> point 1.
> I do not see a reason to introduce complexity (and
>   especially on the fetch answer path). Did we consider that from the 
> client we just send the offsets
> we want to fetch and skip the topic partition description and just use 
> the order to match the information
> on the broker side again? This would also reduce the fetch sizes a lot 
> while skipping a ton of complexity.

Hi Jan,

We need to solve the problem of the fetch request taking
O(num_partitions) space and time to process.  A solution that keeps the
O(num_partitions) behavior, but improves it by a constant factor,
doesn't really solve the problem.  And omitting some partition
information, but leaving other partition information in place,
definitely falls in that category, wouldn't you agree?  Also, as others
have noted, if you omit the partition IDs, you run into a lot of
problems surrounding changes in the partition membership.

best,
Colin

> 
> Hope these ideas are interesting
> 
> best Jan
> 
> 
> On 01.12.2017 01:47, Becket Qin wrote:
> > Hi Colin,
> >
> > Thanks for updating the KIP. I have two comments:
> >
> > 1. The session epoch seems introducing some complexity. It would be good if
> > we don't have to maintain the epoch.
> > 2. If all the partitions has data returned (even a few messages), the next
> > fetch would be equivalent to a full request. This means the clusters with
> > continuously small throughput may not save much from the incremental fetch.
> >
> > I am wondering if we can avoid session epoch maintenance and address the
> > fetch efficiency in general with some modifications to the solution. Not
> > sure if the following would work, but just want to give my ideas.
> >
> > To solve 1, the basic idea is to let the leader return the partition data
> > with its expected client's position for each partition. If the client
> > disagree with the leader's expectation, a full FetchRequest is then sent to
> > ask the leader to update the client's position.
> > To solve 2, when possible, we just let the leader to infer the clients
> > position instead of asking the clients to provide the position, so the
> > incremental fetch can be empty in most cases.
> >
> > More specifically, the protocol will have the following change.
> > 1. Add a new flag called FullFetch to the FetchRequest.
> > 1) A full FetchRequest is the same as the current FetchRequest with
> > FullFetch=true.
> > 2) An incremental FetchRequest is always empty with FullFetch=false.
> > 2. Add a new field called ExpectedPosition(INT64) to each partition data in
> > the FetchResponse.
> >
> > The leader logic:
> > 1. The leader keeps a map from client-id (client-uuid) to the interested
> > partitions of that client. For each interested partition, the leader keeps
> > the client's position for that client.
> > 2. When the leader receives a full fetch request (FullFetch=true), the
> > leader
> >  1) replaces the interested partitions for the client id with the
> > partitions in that full fetch request.
> >  2) updates the client position with the offset specified in that full
> > fetch request.
> >  3) if the client is a follower, update the high watermark, etc.
> > 3. When the leader receives an incremental fetch request (typically empty),
> > the leader returns the data from all the interested partitions (if any)
> > according to the position in the interested partitions map.
> > 4. In the FetchResponse, the leader will include an ExpectedFetchingOffset
> > that the leader thinks the client is fetching at. The value is the client
> > position of the partition in the interested partition map. This is just to
> > confirm with the client that the client position in the leader is correct.
> > 5. After sending back the FetchResponse, the leader updates the position of

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-02 Thread Colin McCabe
On Thu, Nov 30, 2017, at 16:47, Becket Qin wrote:
> Hi Colin,
> 
> Thanks for updating the KIP. I have two comments:
> 
> 1. The session epoch seems introducing some complexity. It would be good
> if we don't have to maintain the epoch.

The session epoch is not complex.  It's just a number which increments
on each incremental fetch.  The session epoch is also useful for
debugging-- it allows you to match up requests and responses when
looking at log files.

> 2. If all the partitions has data returned (even a few messages), the
> next fetch would be equivalent to a full request. This means the clusters with
> continuously small throughput may not save much from the incremental
> fetch.

Clients who want to minimize latency will very often end up getting
mostly-empty FetchResponses.  If you want the lowest latency when using
acks=all, you will want all the followers to get FetchResponses even if
only a single byte has changed on a single partition.  Indeed, the
default value for replica.fetch.min.bytes is 1.  So we are very, very
eager about propagating changes back to the followers.

Even assuming that there is someone producing at full blast to every
single partition 24/7 (already an extremely unlikely scenario), if the
producers are using linger.ms > 0, they will be performing some
batching.  So it should still be common for followers to get fetch
responses that have no information about a lot of partitions, because a
batch just arrived for partition 1, but the producers are still
accumulating messages to send in the next batches for partitions 2, 3,
4, etc.

> 
> I am wondering if we can avoid session epoch maintenance and address the
> fetch efficiency in general with some modifications to the solution. Not
> sure if the following would work, but just want to give my ideas.
> 
> To solve 1, the basic idea is to let the leader return the partition data
> with its expected client's position for each partition. If the client
> disagree with the leader's expectation, a full FetchRequest is then sent
> to ask the leader to update the client's position.

Unfortunately, this doesn't work.  Imagine the client misses an
increment fetch response about a partition.  And then the partition is
never updated after that.  The client has no way to know about the
partition, since it won't be included in any future incremental fetch
responses.  And there are no offsets to compare, since the partition is
simply omitted from the response.

best,
Colin

> To solve 2, when possible, we just let the leader to infer the clients
> position instead of asking the clients to provide the position, so the
> incremental fetch can be empty in most cases.
> 
> More specifically, the protocol will have the following change.
> 1. Add a new flag called FullFetch to the FetchRequest.
>1) A full FetchRequest is the same as the current FetchRequest with
> FullFetch=true.
>2) An incremental FetchRequest is always empty with FullFetch=false.
> 2. Add a new field called ExpectedPosition(INT64) to each partition data
> in
> the FetchResponse.
> 
> The leader logic:
> 1. The leader keeps a map from client-id (client-uuid) to the interested
> partitions of that client. For each interested partition, the leader
> keeps
> the client's position for that client.
> 2. When the leader receives a full fetch request (FullFetch=true), the
> leader
> 1) replaces the interested partitions for the client id with the
> partitions in that full fetch request.
> 2) updates the client position with the offset specified in that full
> fetch request.
> 3) if the client is a follower, update the high watermark, etc.
> 3. When the leader receives an incremental fetch request (typically
> empty),
> the leader returns the data from all the interested partitions (if any)
> according to the position in the interested partitions map.
> 4. In the FetchResponse, the leader will include an
> ExpectedFetchingOffset
> that the leader thinks the client is fetching at. The value is the client
> position of the partition in the interested partition map. This is just
> to
> confirm with the client that the client position in the leader is
> correct.
> 5. After sending back the FetchResponse, the leader updates the position
> of
> the client's interested partitions. (There may be some overhead for the
> leader to know of offsets, but I think the trick of returning at index
> entry boundary or log end will work efficiently).
> 6. The leader will expire the client interested partitions if the client
> hasn't fetch for some time. And if an incremental request is received
> when
> the map does not contain the client info, an error will be returned to
> the
> client to ask for a FullFetch.
> 
> The clients logic:
> 1. Start with sending a full FetchRequest, including partitions and
> offsets.
> 2. When get a response, check the ExpectedOffsets in the fetch response
> and
> see if that matches the current log end.
> 1) If the ExpectedFetchOffset matches the 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-01 Thread Dong Lin
On Thu, Nov 30, 2017 at 9:37 AM, Colin McCabe  wrote:

> On Wed, Nov 29, 2017, at 18:59, Dong Lin wrote:
> > Hey Colin,
> >
> > Thanks much for the update. I have a few questions below:
> >
> > 1. I am not very sure that we need Fetch Session Epoch. It seems that
> > Fetch
> > Session Epoch is only needed to help leader distinguish between "a full
> > fetch request" and "a full fetch request and request a new incremental
> > fetch session". Alternatively, follower can also indicate "a full fetch
> > request and request a new incremental fetch session" by setting Fetch
> > Session ID to -1 without using Fetch Session Epoch. Does this make sense?
>
> Hi Dong,
>
> The fetch session epoch is very important for ensuring correctness.  It
> prevents corrupted or incomplete fetch data due to network reordering or
> loss.
>
> For example, consider a scenario where the follower sends a fetch
> request to the leader.  The leader responds, but the response is lost
> because of network problems which affected the TCP session.  In that
> case, the follower must establish a new TCP session and re-send the
> incremental fetch request.  But the leader does not know that the
> follower didn't receive the previous incremental fetch response.  It is
> only the incremental fetch epoch which lets the leader know that it
> needs to resend that data, and not data which comes afterwards.
>
> You could construct similar scenarios with message reordering,
> duplication, etc.  Basically, this is a stateful protocol on an
> unreliable network, and you need to know whether the follower got the
> previous data you sent before you move on.  And you need to handle
> issues like duplicated or delayed requests.  These issues do not affect
> the full fetch request, because it is not stateful-- any full fetch
> request can be understood and properly responded to in isolation.
>

Thanks for the explanation. This makes sense. On the other hand I would be
interested in learning more about whether Becket's solution can help
simplify the protocol by not having the echo field and whether that is
worth doing.



>
> >
> > 2. It is said that Incremental FetchRequest will include partitions whose
> > fetch offset or maximum number of fetch bytes has been changed. If
> > follower's logStartOffet of a partition has changed, should this
> > partition also be included in the next FetchRequest to the leader?
> Otherwise, it
> > may affect the handling of DeleteRecordsRequest because leader may not
> know
> > the corresponding data has been deleted on the follower.
>
> Yeah, the follower should include the partition if the logStartOffset
> has changed.  That should be spelled out on the KIP.  Fixed.
>
> >
> > 3. In the section "Per-Partition Data", a partition is not considered
> > dirty if its log start offset has changed. Later in the section
> "FetchRequest
> > Changes", it is said that incremental fetch responses will include a
> > partition if its logStartOffset has changed. It seems inconsistent. Can
> > you update the KIP to clarify it?
> >
>
> In the "Per-Partition Data" section, it does say that logStartOffset
> changes make a partition dirty, though, right?  The first bullet point
> is:
>
> > * The LogCleaner deletes messages, and this changes the log start offset
> of the partition on the leader., or
>

Ah I see. I think I didn't notice this because statement assumes that the
LogStartOffset in the leader only changes due to LogCleaner. In fact the
LogStartOffset can change on the leader due to either log retention and
DeleteRecordsRequest. I haven't verified whether LogCleaner can change
LogStartOffset though. It may be a bit better to just say that a partition
is considered dirty if LogStartOffset changes.


>
> > 4. In "Fetch Session Caching" section, it is said that each broker has a
> > limited number of slots. How is this number determined? Does this require
> > a new broker config for this number?
>
> Good point.  I added two broker configuration parameters to control this
> number.
>

I am curious to see whether we can avoid some of these new configs. For
example, incremental.fetch.session.cache.slots.per.broker is probably not
necessary because if a leader knows that a FetchRequest comes from a
follower, we probably want the leader to always cache the information from
that follower. Does this make sense?

Maybe we can discuss the config later after there is agreement on how the
protocol would look like.


>
> > What is the error code if broker does
> > not have new log for the incoming FetchRequest?
>
> Hmm, is there a typo in this question?  Maybe you meant to ask what
> happens if there is no new cache slot for the incoming FetchRequest?
> That's not an error-- the incremental fetch session ID just gets set to
> 0, indicating no incremental fetch session was created.
>

Yeah there is a typo. You have answered my question.


>
> >
> > 5. Can you clarify what happens if follower adds a partition to the
> > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-01 Thread Jan Filipiak

BTW:

the shuffle problem would exist in all our solutions. An empty fetch 
request had the same issue about
what order to serve the topic and partitions. So my suggestion is not 
introducing this problem.


Best Jan

On 01.12.2017 08:29, Jan Filipiak wrote:

Hi,

this discussion is going a little bit far from what I intended this 
thread for.

I can see all of this beeing related.

To let you guys know what I am currently thinking is the following:

I do think the handling of Id's and epoch is rather complicated. I 
think the complexity

comes from aiming for to much.

1. Currently all the work is towards making fetchRequest
completely empty. This brings all sorts of pain with regards to the 
broker actually needs
to know what he send even though it tries to use sendfile as much as 
possible.
2. Currently all the work is towards also making empty fetch request 
across TCP sessions.


In this thread I aimed to relax our goals with regards to point 2. 
Connection resets for us
are really the exceptions and I would argue, trying to introduce 
complexity for sparing
1 full request on connection reset is not worth it. Therefore I argued 
to keep the Server
side information with the Session instead somewhere global. Its not 
going to bring in the

results.

As the discussion unvields I also want to challenge our approach for 
point 1.

I do not see a reason to introduce complexity (and
 especially on the fetch answer path). Did we consider that from the 
client we just send the offsets
we want to fetch and skip the topic partition description and just use 
the order to match the information
on the broker side again? This would also reduce the fetch sizes a lot 
while skipping a ton of complexity.


Hope these ideas are interesting

best Jan


On 01.12.2017 01:47, Becket Qin wrote:

Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be 
good if

we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few messages), the 
next
fetch would be equivalent to a full request. This means the clusters 
with
continuously small throughput may not save much from the incremental 
fetch.


I am wondering if we can avoid session epoch maintenance and address the
fetch efficiency in general with some modifications to the solution. Not
sure if the following would work, but just want to give my ideas.

To solve 1, the basic idea is to let the leader return the partition 
data

with its expected client's position for each partition. If the client
disagree with the leader's expectation, a full FetchRequest is then 
sent to

ask the leader to update the client's position.
To solve 2, when possible, we just let the leader to infer the clients
position instead of asking the clients to provide the position, so the
incremental fetch can be empty in most cases.

More specifically, the protocol will have the following change.
1. Add a new flag called FullFetch to the FetchRequest.
1) A full FetchRequest is the same as the current FetchRequest with
FullFetch=true.
2) An incremental FetchRequest is always empty with FullFetch=false.
2. Add a new field called ExpectedPosition(INT64) to each partition 
data in

the FetchResponse.

The leader logic:
1. The leader keeps a map from client-id (client-uuid) to the interested
partitions of that client. For each interested partition, the leader 
keeps

the client's position for that client.
2. When the leader receives a full fetch request (FullFetch=true), the
leader
 1) replaces the interested partitions for the client id with the
partitions in that full fetch request.
 2) updates the client position with the offset specified in that 
full

fetch request.
 3) if the client is a follower, update the high watermark, etc.
3. When the leader receives an incremental fetch request (typically 
empty),

the leader returns the data from all the interested partitions (if any)
according to the position in the interested partitions map.
4. In the FetchResponse, the leader will include an 
ExpectedFetchingOffset
that the leader thinks the client is fetching at. The value is the 
client
position of the partition in the interested partition map. This is 
just to
confirm with the client that the client position in the leader is 
correct.
5. After sending back the FetchResponse, the leader updates the 
position of

the client's interested partitions. (There may be some overhead for the
leader to know of offsets, but I think the trick of returning at index
entry boundary or log end will work efficiently).
6. The leader will expire the client interested partitions if the client
hasn't fetch for some time. And if an incremental request is received 
when
the map does not contain the client info, an error will be returned 
to the

client to ask for a FullFetch.

The clients logic:
1. Start with sending a full FetchRequest, including partitions and 
offsets.
2. When get a response, check the 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-12-01 Thread Jan Filipiak

Hi,

good catch about the rotation.
This is probably not a too big blocker. Plenty of ideas spring to my mind
of how this can be done. Maybe one can offer different algorithms here.
(nothing, random shuffle, client sends bitmask which it wants to fetch 
first, broker logic... many more)


Thank you for considering my ideas. I am pretty convinced we don't need
to aim for the 100% empty fetch request across TCP sessions. Maybe my ideas
offer decent tradeoffs.

Best Jan





On 01.12.2017 08:43, Becket Qin wrote:

Hi Jan,

I agree that we probably don't want to make the protocol too complicated
just for exception cases.

The current FetchRequest contains an ordered list of partitions that may
rotate based on the priority. Therefore it is kind of difficult to do the
order matching. But you brought a good point about order, we may want to
migrate the rotation logic from the clients to the server. Not sure if this
will introduce some complexity to the broker. Intuitively it seems fine.
The logic would basically be similar to the draining logic in the
RecordAccumulator of the producer.

Thanks,

Jiangjie (Becket) Qin

On Thu, Nov 30, 2017 at 11:29 PM, Jan Filipiak 
wrote:


Hi,

this discussion is going a little bit far from what I intended this thread
for.
I can see all of this beeing related.

To let you guys know what I am currently thinking is the following:

I do think the handling of Id's and epoch is rather complicated. I think
the complexity
comes from aiming for to much.

1. Currently all the work is towards making fetchRequest
completely empty. This brings all sorts of pain with regards to the broker
actually needs
to know what he send even though it tries to use sendfile as much as
possible.
2. Currently all the work is towards also making empty fetch request
across TCP sessions.

In this thread I aimed to relax our goals with regards to point 2.
Connection resets for us
are really the exceptions and I would argue, trying to introduce
complexity for sparing
1 full request on connection reset is not worth it. Therefore I argued to
keep the Server
side information with the Session instead somewhere global. Its not going
to bring in the
results.

As the discussion unvields I also want to challenge our approach for point
1.
I do not see a reason to introduce complexity (and
  especially on the fetch answer path). Did we consider that from the
client we just send the offsets
we want to fetch and skip the topic partition description and just use the
order to match the information
on the broker side again? This would also reduce the fetch sizes a lot
while skipping a ton of complexity.

Hope these ideas are interesting

best Jan



On 01.12.2017 01:47, Becket Qin wrote:


Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be good
if
we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few messages), the next
fetch would be equivalent to a full request. This means the clusters with
continuously small throughput may not save much from the incremental
fetch.

I am wondering if we can avoid session epoch maintenance and address the
fetch efficiency in general with some modifications to the solution. Not
sure if the following would work, but just want to give my ideas.

To solve 1, the basic idea is to let the leader return the partition data
with its expected client's position for each partition. If the client
disagree with the leader's expectation, a full FetchRequest is then sent
to
ask the leader to update the client's position.
To solve 2, when possible, we just let the leader to infer the clients
position instead of asking the clients to provide the position, so the
incremental fetch can be empty in most cases.

More specifically, the protocol will have the following change.
1. Add a new flag called FullFetch to the FetchRequest.
 1) A full FetchRequest is the same as the current FetchRequest with
FullFetch=true.
 2) An incremental FetchRequest is always empty with FullFetch=false.
2. Add a new field called ExpectedPosition(INT64) to each partition data
in
the FetchResponse.

The leader logic:
1. The leader keeps a map from client-id (client-uuid) to the interested
partitions of that client. For each interested partition, the leader keeps
the client's position for that client.
2. When the leader receives a full fetch request (FullFetch=true), the
leader
  1) replaces the interested partitions for the client id with the
partitions in that full fetch request.
  2) updates the client position with the offset specified in that full
fetch request.
  3) if the client is a follower, update the high watermark, etc.
3. When the leader receives an incremental fetch request (typically
empty),
the leader returns the data from all the interested partitions (if any)
according to the position in the interested partitions map.
4. In the FetchResponse, the leader will 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-30 Thread Becket Qin
Hi Jan,

I agree that we probably don't want to make the protocol too complicated
just for exception cases.

The current FetchRequest contains an ordered list of partitions that may
rotate based on the priority. Therefore it is kind of difficult to do the
order matching. But you brought a good point about order, we may want to
migrate the rotation logic from the clients to the server. Not sure if this
will introduce some complexity to the broker. Intuitively it seems fine.
The logic would basically be similar to the draining logic in the
RecordAccumulator of the producer.

Thanks,

Jiangjie (Becket) Qin

On Thu, Nov 30, 2017 at 11:29 PM, Jan Filipiak 
wrote:

> Hi,
>
> this discussion is going a little bit far from what I intended this thread
> for.
> I can see all of this beeing related.
>
> To let you guys know what I am currently thinking is the following:
>
> I do think the handling of Id's and epoch is rather complicated. I think
> the complexity
> comes from aiming for to much.
>
> 1. Currently all the work is towards making fetchRequest
> completely empty. This brings all sorts of pain with regards to the broker
> actually needs
> to know what he send even though it tries to use sendfile as much as
> possible.
> 2. Currently all the work is towards also making empty fetch request
> across TCP sessions.
>
> In this thread I aimed to relax our goals with regards to point 2.
> Connection resets for us
> are really the exceptions and I would argue, trying to introduce
> complexity for sparing
> 1 full request on connection reset is not worth it. Therefore I argued to
> keep the Server
> side information with the Session instead somewhere global. Its not going
> to bring in the
> results.
>
> As the discussion unvields I also want to challenge our approach for point
> 1.
> I do not see a reason to introduce complexity (and
>  especially on the fetch answer path). Did we consider that from the
> client we just send the offsets
> we want to fetch and skip the topic partition description and just use the
> order to match the information
> on the broker side again? This would also reduce the fetch sizes a lot
> while skipping a ton of complexity.
>
> Hope these ideas are interesting
>
> best Jan
>
>
>
> On 01.12.2017 01:47, Becket Qin wrote:
>
>> Hi Colin,
>>
>> Thanks for updating the KIP. I have two comments:
>>
>> 1. The session epoch seems introducing some complexity. It would be good
>> if
>> we don't have to maintain the epoch.
>> 2. If all the partitions has data returned (even a few messages), the next
>> fetch would be equivalent to a full request. This means the clusters with
>> continuously small throughput may not save much from the incremental
>> fetch.
>>
>> I am wondering if we can avoid session epoch maintenance and address the
>> fetch efficiency in general with some modifications to the solution. Not
>> sure if the following would work, but just want to give my ideas.
>>
>> To solve 1, the basic idea is to let the leader return the partition data
>> with its expected client's position for each partition. If the client
>> disagree with the leader's expectation, a full FetchRequest is then sent
>> to
>> ask the leader to update the client's position.
>> To solve 2, when possible, we just let the leader to infer the clients
>> position instead of asking the clients to provide the position, so the
>> incremental fetch can be empty in most cases.
>>
>> More specifically, the protocol will have the following change.
>> 1. Add a new flag called FullFetch to the FetchRequest.
>> 1) A full FetchRequest is the same as the current FetchRequest with
>> FullFetch=true.
>> 2) An incremental FetchRequest is always empty with FullFetch=false.
>> 2. Add a new field called ExpectedPosition(INT64) to each partition data
>> in
>> the FetchResponse.
>>
>> The leader logic:
>> 1. The leader keeps a map from client-id (client-uuid) to the interested
>> partitions of that client. For each interested partition, the leader keeps
>> the client's position for that client.
>> 2. When the leader receives a full fetch request (FullFetch=true), the
>> leader
>>  1) replaces the interested partitions for the client id with the
>> partitions in that full fetch request.
>>  2) updates the client position with the offset specified in that full
>> fetch request.
>>  3) if the client is a follower, update the high watermark, etc.
>> 3. When the leader receives an incremental fetch request (typically
>> empty),
>> the leader returns the data from all the interested partitions (if any)
>> according to the position in the interested partitions map.
>> 4. In the FetchResponse, the leader will include an ExpectedFetchingOffset
>> that the leader thinks the client is fetching at. The value is the client
>> position of the partition in the interested partition map. This is just to
>> confirm with the client that the client position in the leader is correct.
>> 5. After sending back the 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-30 Thread Jan Filipiak

Hi,

this discussion is going a little bit far from what I intended this 
thread for.

I can see all of this beeing related.

To let you guys know what I am currently thinking is the following:

I do think the handling of Id's and epoch is rather complicated. I think 
the complexity

comes from aiming for to much.

1. Currently all the work is towards making fetchRequest
completely empty. This brings all sorts of pain with regards to the 
broker actually needs
to know what he send even though it tries to use sendfile as much as 
possible.
2. Currently all the work is towards also making empty fetch request 
across TCP sessions.


In this thread I aimed to relax our goals with regards to point 2. 
Connection resets for us
are really the exceptions and I would argue, trying to introduce 
complexity for sparing
1 full request on connection reset is not worth it. Therefore I argued 
to keep the Server
side information with the Session instead somewhere global. Its not 
going to bring in the

results.

As the discussion unvields I also want to challenge our approach for 
point 1.

I do not see a reason to introduce complexity (and
 especially on the fetch answer path). Did we consider that from the 
client we just send the offsets
we want to fetch and skip the topic partition description and just use 
the order to match the information
on the broker side again? This would also reduce the fetch sizes a lot 
while skipping a ton of complexity.


Hope these ideas are interesting

best Jan


On 01.12.2017 01:47, Becket Qin wrote:

Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be good if
we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few messages), the next
fetch would be equivalent to a full request. This means the clusters with
continuously small throughput may not save much from the incremental fetch.

I am wondering if we can avoid session epoch maintenance and address the
fetch efficiency in general with some modifications to the solution. Not
sure if the following would work, but just want to give my ideas.

To solve 1, the basic idea is to let the leader return the partition data
with its expected client's position for each partition. If the client
disagree with the leader's expectation, a full FetchRequest is then sent to
ask the leader to update the client's position.
To solve 2, when possible, we just let the leader to infer the clients
position instead of asking the clients to provide the position, so the
incremental fetch can be empty in most cases.

More specifically, the protocol will have the following change.
1. Add a new flag called FullFetch to the FetchRequest.
1) A full FetchRequest is the same as the current FetchRequest with
FullFetch=true.
2) An incremental FetchRequest is always empty with FullFetch=false.
2. Add a new field called ExpectedPosition(INT64) to each partition data in
the FetchResponse.

The leader logic:
1. The leader keeps a map from client-id (client-uuid) to the interested
partitions of that client. For each interested partition, the leader keeps
the client's position for that client.
2. When the leader receives a full fetch request (FullFetch=true), the
leader
 1) replaces the interested partitions for the client id with the
partitions in that full fetch request.
 2) updates the client position with the offset specified in that full
fetch request.
 3) if the client is a follower, update the high watermark, etc.
3. When the leader receives an incremental fetch request (typically empty),
the leader returns the data from all the interested partitions (if any)
according to the position in the interested partitions map.
4. In the FetchResponse, the leader will include an ExpectedFetchingOffset
that the leader thinks the client is fetching at. The value is the client
position of the partition in the interested partition map. This is just to
confirm with the client that the client position in the leader is correct.
5. After sending back the FetchResponse, the leader updates the position of
the client's interested partitions. (There may be some overhead for the
leader to know of offsets, but I think the trick of returning at index
entry boundary or log end will work efficiently).
6. The leader will expire the client interested partitions if the client
hasn't fetch for some time. And if an incremental request is received when
the map does not contain the client info, an error will be returned to the
client to ask for a FullFetch.

The clients logic:
1. Start with sending a full FetchRequest, including partitions and offsets.
2. When get a response, check the ExpectedOffsets in the fetch response and
see if that matches the current log end.
 1) If the ExpectedFetchOffset matches the current log end, the next
fetch request will be an incremental fetch request.
 2) if the ExpectedFetchOffset does not match the current log end, the
next 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-30 Thread Becket Qin
Hi Colin,

Thanks for updating the KIP. I have two comments:

1. The session epoch seems introducing some complexity. It would be good if
we don't have to maintain the epoch.
2. If all the partitions has data returned (even a few messages), the next
fetch would be equivalent to a full request. This means the clusters with
continuously small throughput may not save much from the incremental fetch.

I am wondering if we can avoid session epoch maintenance and address the
fetch efficiency in general with some modifications to the solution. Not
sure if the following would work, but just want to give my ideas.

To solve 1, the basic idea is to let the leader return the partition data
with its expected client's position for each partition. If the client
disagree with the leader's expectation, a full FetchRequest is then sent to
ask the leader to update the client's position.
To solve 2, when possible, we just let the leader to infer the clients
position instead of asking the clients to provide the position, so the
incremental fetch can be empty in most cases.

More specifically, the protocol will have the following change.
1. Add a new flag called FullFetch to the FetchRequest.
   1) A full FetchRequest is the same as the current FetchRequest with
FullFetch=true.
   2) An incremental FetchRequest is always empty with FullFetch=false.
2. Add a new field called ExpectedPosition(INT64) to each partition data in
the FetchResponse.

The leader logic:
1. The leader keeps a map from client-id (client-uuid) to the interested
partitions of that client. For each interested partition, the leader keeps
the client's position for that client.
2. When the leader receives a full fetch request (FullFetch=true), the
leader
1) replaces the interested partitions for the client id with the
partitions in that full fetch request.
2) updates the client position with the offset specified in that full
fetch request.
3) if the client is a follower, update the high watermark, etc.
3. When the leader receives an incremental fetch request (typically empty),
the leader returns the data from all the interested partitions (if any)
according to the position in the interested partitions map.
4. In the FetchResponse, the leader will include an ExpectedFetchingOffset
that the leader thinks the client is fetching at. The value is the client
position of the partition in the interested partition map. This is just to
confirm with the client that the client position in the leader is correct.
5. After sending back the FetchResponse, the leader updates the position of
the client's interested partitions. (There may be some overhead for the
leader to know of offsets, but I think the trick of returning at index
entry boundary or log end will work efficiently).
6. The leader will expire the client interested partitions if the client
hasn't fetch for some time. And if an incremental request is received when
the map does not contain the client info, an error will be returned to the
client to ask for a FullFetch.

The clients logic:
1. Start with sending a full FetchRequest, including partitions and offsets.
2. When get a response, check the ExpectedOffsets in the fetch response and
see if that matches the current log end.
1) If the ExpectedFetchOffset matches the current log end, the next
fetch request will be an incremental fetch request.
2) if the ExpectedFetchOffset does not match the current log end, the
next fetch request will be a full FetchRequest.
3. Whenever the partition offset is actively changed (e.g. consumer.seek(),
follower log truncation, etc), a full FetchRequest is sent.
4. Whenever the interested partition set changes (e.g.
consumer.subscribe()/assign() is called, replica reassignment happens), a
full FetchRequest is sent.
5. Whenever the client needs to retry a fetch, a FullFetch is sent.

The benefits of this approach are:
1. Regardless of the traffic pattern in the cluster, In most cases the
fetch request will be empty.
2. No need to maintain session epochs.

What do you think?

Thanks,

Jiangjie (Becket) Qin


On Thu, Nov 30, 2017 at 9:37 AM, Colin McCabe  wrote:

> On Wed, Nov 29, 2017, at 18:59, Dong Lin wrote:
> > Hey Colin,
> >
> > Thanks much for the update. I have a few questions below:
> >
> > 1. I am not very sure that we need Fetch Session Epoch. It seems that
> > Fetch
> > Session Epoch is only needed to help leader distinguish between "a full
> > fetch request" and "a full fetch request and request a new incremental
> > fetch session". Alternatively, follower can also indicate "a full fetch
> > request and request a new incremental fetch session" by setting Fetch
> > Session ID to -1 without using Fetch Session Epoch. Does this make sense?
>
> Hi Dong,
>
> The fetch session epoch is very important for ensuring correctness.  It
> prevents corrupted or incomplete fetch data due to network reordering or
> loss.
>
> For example, consider a scenario where the follower sends a fetch
> request 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-30 Thread Colin McCabe
On Wed, Nov 29, 2017, at 18:59, Dong Lin wrote:
> Hey Colin,
> 
> Thanks much for the update. I have a few questions below:
> 
> 1. I am not very sure that we need Fetch Session Epoch. It seems that
> Fetch
> Session Epoch is only needed to help leader distinguish between "a full
> fetch request" and "a full fetch request and request a new incremental
> fetch session". Alternatively, follower can also indicate "a full fetch
> request and request a new incremental fetch session" by setting Fetch
> Session ID to -1 without using Fetch Session Epoch. Does this make sense?

Hi Dong,

The fetch session epoch is very important for ensuring correctness.  It
prevents corrupted or incomplete fetch data due to network reordering or
loss.

For example, consider a scenario where the follower sends a fetch
request to the leader.  The leader responds, but the response is lost
because of network problems which affected the TCP session.  In that
case, the follower must establish a new TCP session and re-send the
incremental fetch request.  But the leader does not know that the
follower didn't receive the previous incremental fetch response.  It is
only the incremental fetch epoch which lets the leader know that it
needs to resend that data, and not data which comes afterwards.

You could construct similar scenarios with message reordering,
duplication, etc.  Basically, this is a stateful protocol on an
unreliable network, and you need to know whether the follower got the
previous data you sent before you move on.  And you need to handle
issues like duplicated or delayed requests.  These issues do not affect
the full fetch request, because it is not stateful-- any full fetch
request can be understood and properly responded to in isolation.

> 
> 2. It is said that Incremental FetchRequest will include partitions whose
> fetch offset or maximum number of fetch bytes has been changed. If
> follower's logStartOffet of a partition has changed, should this
> partition also be included in the next FetchRequest to the leader? Otherwise, 
> it
> may affect the handling of DeleteRecordsRequest because leader may not know
> the corresponding data has been deleted on the follower.

Yeah, the follower should include the partition if the logStartOffset
has changed.  That should be spelled out on the KIP.  Fixed.

> 
> 3. In the section "Per-Partition Data", a partition is not considered
> dirty if its log start offset has changed. Later in the section "FetchRequest
> Changes", it is said that incremental fetch responses will include a
> partition if its logStartOffset has changed. It seems inconsistent. Can
> you update the KIP to clarify it?
> 

In the "Per-Partition Data" section, it does say that logStartOffset
changes make a partition dirty, though, right?  The first bullet point
is:

> * The LogCleaner deletes messages, and this changes the log start offset of 
> the partition on the leader., or

> 4. In "Fetch Session Caching" section, it is said that each broker has a
> limited number of slots. How is this number determined? Does this require
> a new broker config for this number?

Good point.  I added two broker configuration parameters to control this
number.

> What is the error code if broker does
> not have new log for the incoming FetchRequest?

Hmm, is there a typo in this question?  Maybe you meant to ask what
happens if there is no new cache slot for the incoming FetchRequest? 
That's not an error-- the incremental fetch session ID just gets set to
0, indicating no incremental fetch session was created.

> 
> 5. Can you clarify what happens if follower adds a partition to the
> ReplicaFetcherThread after receiving LeaderAndIsrRequest? Does leader
> needs to generate a new session for this ReplicaFetcherThread or does it 
> re-use
> the existing session?  If it uses a new session, is the old session
> actively deleted from the slot?

The basic idea is that you can't make changes, except by sending a full
fetch request.  However, perhaps we can allow the client to re-use its
existing session ID.  If the client sets sessionId = id, epoch = 0, it
could re-initialize the session.

> 
> 
> BTW, I think it may be useful if the KIP can include the example workflow
> of how this feature will be used in case of partition change and so on.

Yeah, that might help.

best,
Colin

> 
> Thanks,
> Dong
> 
> 
> On Wed, Nov 29, 2017 at 12:13 PM, Colin McCabe 
> wrote:
> 
> > I updated the KIP with the ideas we've been discussing.
> >
> > best,
> > Colin
> >
> > On Tue, Nov 28, 2017, at 08:38, Colin McCabe wrote:
> > > On Mon, Nov 27, 2017, at 22:30, Jan Filipiak wrote:
> > > > Hi Colin, thank you  for this KIP, it can become a really useful thing.
> > > >
> > > > I just scanned through the discussion so far and wanted to start a
> > > > thread to make as decision about keeping the
> > > > cache with the Connection / Session or having some sort of UUID indN
> > exed
> > > > global Map.
> > > >
> > > > Sorry if that has 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-29 Thread Dong Lin
Hey Colin,

Thanks much for the update. I have a few questions below:

1. I am not very sure that we need Fetch Session Epoch. It seems that Fetch
Session Epoch is only needed to help leader distinguish between "a full
fetch request" and "a full fetch request and request a new incremental
fetch session". Alternatively, follower can also indicate "a full fetch
request and request a new incremental fetch session" by setting Fetch
Session ID to -1 without using Fetch Session Epoch. Does this make sense?

2. It is said that Incremental FetchRequest will include partitions whose
fetch offset or maximum number of fetch bytes has been changed. If
follower's logStartOffet of a partition has changed, should this partition
also be included in the next FetchRequest to the leader? Otherwise, it may
affect the handling of DeleteRecordsRequest because leader may not know the
corresponding data has been deleted on the follower.

3. In the section "Per-Partition Data", a partition is not considered dirty
if its log start offset has changed. Later in the section "FetchRequest
Changes", it is said that incremental fetch responses will include a
partition if its logStartOffset has changed. It seems inconsistent. Can you
update the KIP to clarify it?

4. In "Fetch Session Caching" section, it is said that each broker has a
limited number of slots. How is this number determined? Does this require a
new broker config for this number? What is the error code if broker does
not have new log for the incoming FetchRequest?

5. Can you clarify what happens if follower adds a partition to the
ReplicaFetcherThread after receiving LeaderAndIsrRequest? Does leader needs
to generate a new session for this ReplicaFetcherThread or does it re-use
the existing session? If it uses a new session, is the old session actively
deleted from the slot?


BTW, I think it may be useful if the KIP can include the example workflow
of how this feature will be used in case of partition change and so on.

Thanks,
Dong


On Wed, Nov 29, 2017 at 12:13 PM, Colin McCabe  wrote:

> I updated the KIP with the ideas we've been discussing.
>
> best,
> Colin
>
> On Tue, Nov 28, 2017, at 08:38, Colin McCabe wrote:
> > On Mon, Nov 27, 2017, at 22:30, Jan Filipiak wrote:
> > > Hi Colin, thank you  for this KIP, it can become a really useful thing.
> > >
> > > I just scanned through the discussion so far and wanted to start a
> > > thread to make as decision about keeping the
> > > cache with the Connection / Session or having some sort of UUID indN
> exed
> > > global Map.
> > >
> > > Sorry if that has been settled already and I missed it. In this case
> > > could anyone point me to the discussion?
> >
> > Hi Jan,
> >
> > I don't think anyone has discussed the idea of tying the cache to an
> > individual TCP session yet.  I agree that since the cache is intended to
> > be used only by a single follower or client, it's an interesting thing
> > to think about.
> >
> > I guess the obvious disadvantage is that whenever your TCP session
> > drops, you have to make a full fetch request rather than an incremental
> > one.  It's not clear to me how often this happens in practice -- it
> > probably depends a lot on the quality of the network.  From a code
> > perspective, it might also be a bit difficult to access data associated
> > with the Session from classes like KafkaApis (although we could refactor
> > it to make this easier).
> >
> > It's also clear that even if we tie the cache to the session, we still
> > have to have limits on the number of caches we're willing to create.
> > And probably we should reserve some cache slots for each follower, so
> > that clients don't take all of them.
> >
> > >
> > > Id rather see a protocol in which the client is hinting the broker
> that,
> > > he is going to use the feature instead of a client
> > > realizing that the broker just offered the feature (regardless of
> > > protocol version which should only indicate that the feature
> > > would be usable).
> >
> > Hmm.  I'm not sure what you mean by "hinting."  I do think that the
> > server should have the option of not accepting incremental requests from
> > specific clients, in order to save memory space.
> >
> > > This seems to work better with a per
> > > connection/session attached Metadata than with a Map and could allow
> for
> > > easier client implementations.
> > > It would also make Client-side code easier as there wouldn't be any
> > > Cache-miss error Messages to handle.
> >
> > It is nice not to have to handle cache-miss responses, I agree.
> > However, TCP sessions aren't exposed to most of our client-side code.
> > For example, when the Producer creates a message and hands it off to the
> > NetworkClient, the NC will transparently re-connect and re-send a
> > message if the first send failed.  The higher-level code will not be
> > informed about whether the TCP session was re-established, whether an
> > existing TCP session was used, and so on.  

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-29 Thread Colin McCabe
I updated the KIP with the ideas we've been discussing.

best,
Colin

On Tue, Nov 28, 2017, at 08:38, Colin McCabe wrote:
> On Mon, Nov 27, 2017, at 22:30, Jan Filipiak wrote:
> > Hi Colin, thank you  for this KIP, it can become a really useful thing.
> > 
> > I just scanned through the discussion so far and wanted to start a 
> > thread to make as decision about keeping the
> > cache with the Connection / Session or having some sort of UUID indN exed 
> > global Map.
> > 
> > Sorry if that has been settled already and I missed it. In this case 
> > could anyone point me to the discussion?
> 
> Hi Jan,
> 
> I don't think anyone has discussed the idea of tying the cache to an
> individual TCP session yet.  I agree that since the cache is intended to
> be used only by a single follower or client, it's an interesting thing
> to think about.
> 
> I guess the obvious disadvantage is that whenever your TCP session
> drops, you have to make a full fetch request rather than an incremental
> one.  It's not clear to me how often this happens in practice -- it
> probably depends a lot on the quality of the network.  From a code
> perspective, it might also be a bit difficult to access data associated
> with the Session from classes like KafkaApis (although we could refactor
> it to make this easier).
> 
> It's also clear that even if we tie the cache to the session, we still
> have to have limits on the number of caches we're willing to create. 
> And probably we should reserve some cache slots for each follower, so
> that clients don't take all of them.
> 
> > 
> > Id rather see a protocol in which the client is hinting the broker that, 
> > he is going to use the feature instead of a client
> > realizing that the broker just offered the feature (regardless of 
> > protocol version which should only indicate that the feature
> > would be usable).
> 
> Hmm.  I'm not sure what you mean by "hinting."  I do think that the
> server should have the option of not accepting incremental requests from
> specific clients, in order to save memory space.
> 
> > This seems to work better with a per 
> > connection/session attached Metadata than with a Map and could allow for
> > easier client implementations.
> > It would also make Client-side code easier as there wouldn't be any 
> > Cache-miss error Messages to handle.
> 
> It is nice not to have to handle cache-miss responses, I agree. 
> However, TCP sessions aren't exposed to most of our client-side code. 
> For example, when the Producer creates a message and hands it off to the
> NetworkClient, the NC will transparently re-connect and re-send a
> message if the first send failed.  The higher-level code will not be
> informed about whether the TCP session was re-established, whether an
> existing TCP session was used, and so on.  So overall I would still lean
> towards not coupling this to the TCP session...
> 
> best,
> Colin
> 
> > 
> >   Thank you again for the KIP. And again, if this was clarified already 
> > please drop me a hint where I could read about it.
> > 
> > Best Jan
> > 
> > 
> > 
> > 
> > 
> > On 21.11.2017 22:02, Colin McCabe wrote:
> > > Hi all,
> > >
> > > I created a KIP to improve the scalability and latency of FetchRequest:
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability
> > >
> > > Please take a look.
> > >
> > > cheers,
> > > Colin
> >


Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-28 Thread Colin McCabe
On Mon, Nov 27, 2017, at 22:30, Jan Filipiak wrote:
> Hi Colin, thank you  for this KIP, it can become a really useful thing.
> 
> I just scanned through the discussion so far and wanted to start a 
> thread to make as decision about keeping the
> cache with the Connection / Session or having some sort of UUID indN exed 
> global Map.
> 
> Sorry if that has been settled already and I missed it. In this case 
> could anyone point me to the discussion?

Hi Jan,

I don't think anyone has discussed the idea of tying the cache to an
individual TCP session yet.  I agree that since the cache is intended to
be used only by a single follower or client, it's an interesting thing
to think about.

I guess the obvious disadvantage is that whenever your TCP session
drops, you have to make a full fetch request rather than an incremental
one.  It's not clear to me how often this happens in practice -- it
probably depends a lot on the quality of the network.  From a code
perspective, it might also be a bit difficult to access data associated
with the Session from classes like KafkaApis (although we could refactor
it to make this easier).

It's also clear that even if we tie the cache to the session, we still
have to have limits on the number of caches we're willing to create. 
And probably we should reserve some cache slots for each follower, so
that clients don't take all of them.

> 
> Id rather see a protocol in which the client is hinting the broker that, 
> he is going to use the feature instead of a client
> realizing that the broker just offered the feature (regardless of 
> protocol version which should only indicate that the feature
> would be usable).

Hmm.  I'm not sure what you mean by "hinting."  I do think that the
server should have the option of not accepting incremental requests from
specific clients, in order to save memory space.

> This seems to work better with a per 
> connection/session attached Metadata than with a Map and could allow for
> easier client implementations.
> It would also make Client-side code easier as there wouldn't be any 
> Cache-miss error Messages to handle.

It is nice not to have to handle cache-miss responses, I agree. 
However, TCP sessions aren't exposed to most of our client-side code. 
For example, when the Producer creates a message and hands it off to the
NetworkClient, the NC will transparently re-connect and re-send a
message if the first send failed.  The higher-level code will not be
informed about whether the TCP session was re-established, whether an
existing TCP session was used, and so on.  So overall I would still lean
towards not coupling this to the TCP session...

best,
Colin

> 
>   Thank you again for the KIP. And again, if this was clarified already 
> please drop me a hint where I could read about it.
> 
> Best Jan
> 
> 
> 
> 
> 
> On 21.11.2017 22:02, Colin McCabe wrote:
> > Hi all,
> >
> > I created a KIP to improve the scalability and latency of FetchRequest:
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability
> >
> > Please take a look.
> >
> > cheers,
> > Colin
> 


Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-27 Thread Dong Lin
Hey Colin,


On Mon, Nov 27, 2017 at 2:36 PM, Colin McCabe  wrote:

> On Sat, Nov 25, 2017, at 21:25, Dong Lin wrote:
> > Hey Colin,
> >
> > Thanks for the reply. Please see my comments inline.
> >
> > On Sat, Nov 25, 2017 at 3:33 PM, Colin McCabe 
> wrote:
> >
> > > On Fri, Nov 24, 2017, at 22:06, Dong Lin wrote:
> > > > Hey Colin,
> > > >
> > > > Thanks for the reply! Please see my comment inline.
> > > >
> > > > On Fri, Nov 24, 2017 at 9:39 PM, Colin McCabe 
> > > wrote:
> > > >
> > > > > On Thu, Nov 23, 2017, at 18:35, Dong Lin wrote:
> > > > > > Hey Colin,
> > > > > >
> > > > > > Thanks for the KIP! This is definitely useful when there are many
> > > idle
> > > > > > partitions in the clusters.
> > > > > >
> > > > > > Just in case it is useful, I will provide some number here. We
> > > observe
> > > > > > that for a clsuter that have around 2.5k partitions per broker,
> the
> > > > > > ProduceRequestTotal time average value is around 25 ms. For a
> cluster
> > > > > > with 2.5k partitions per broker whose AllTopicsBytesInRate is
> only
> > > > > around 6
> > > > > > MB/s, the ProduceRequestTotalTime average value is around 180 ms,
> > > most of
> > > > > > which is spent on ProduceRequestRemoteTime. The increased
> > > > > > ProduceRequestTotalTime significantly reduces throughput of
> producers
> > > > > > with ack=all. I think this KIP can help address this problem.
> > > > >
> > > > > Hi Dong,
> > > > >
> > > > > Thanks for the numbers.  It's good to have empirical confirmation
> that
> > > > > this will help!
> > > > >
> > > > > >
> > > > > > Here are some of my ideas on the current KIP:
> > > > > >
> > > > > > - The KIP says that the follower will include a partition in
> > > > > > the IncrementalFetchRequest if the LEO of the partition has been
> > > updated.
> > > > > > It seems that doing so may prevent leader from knowing
> information
> > > (e.g.
> > > > > > LogStartOffset) of the follower that will otherwise be included
> in
> > > the
> > > > > > FetchRequest. Maybe we should have a paragraph to explicitly
> define
> > > the
> > > > > > full criteria of when the fetcher should include a partition in
> the
> > > > > > FetchResponse and probably include logStartOffset as part of the
> > > > > > criteria?
> > > > >
> > > > > Hmm.  That's a good point... we should think about whether we need
> to
> > > > > send partition information in an incremental update when the LSO
> > > > > changes.
> > > > >
> > > > > Sorry if this is a dumb question, but what does the leader do with
> the
> > > > > logStartOffset of the followers?  When does the leader need to
> know it?
> > > > > Also, how often do we expect it to be changed by the LogCleaner?
> > > > >
> > > >
> > >
> > > Hi Dong,
> > >
> > > > The leader uses logStartOffset of the followers to determine the
> > > > logStartOffset of the partition. It is needed to handle
> > > > DeleteRecordsRequest. It can be changed if the log is deleted on the
> > > > follower due to log retention.
> > >
> > > Is there really a big advantage to the leader caching the LSO for each
> > > follower?  I guess it allows you to avoid sending the
> > > DeleteRecordsRequest to followers that you know have already deleted
> the
> > > records in question.  But the leader can just broadcast the request to
> > > all the followers.  This uses less network bandwidth than sending a
> > > single batch of records with acks=all.
> > >
> >
> > This is probably not just about caching. leader uses the LSO in the
> > FetchRequest from follower to figure out whether DeleteRecordsRequest can
> > succeed. Thus if follower does not send FetchRequest, leader will not
> > know the information needed for handling DeleteRecordsRequest. It is
> possible
> > to change the procedure for handling DeleteRecordsRequest. It is just
> that
> > the KIP probably needs to specify the change in more detail and we need
> to
> > understand whether this is the best approach.
>
> Hi Dong,
>
> That's a good point.  Do we have information on how frequently the LSO
> changes?  If it changes infrequently, maybe we should simply include
> this information in the incremental fetch response (as you suggested
> below).  Hmm... how frequently do we expect the LogCleaner to change
> this number?
>

LSO change caused by log retention should happen much less frequent than
the frequency of FetchRequest from follower. I don't exactly remember how
often LogClean can change LSO though..


>
> >
> > IMO the work in this KIP can be divided into three parts:
> >
> > 1) follower can skip a partition in the FetchRequest if the information
> > of that partition (i.e. those fields in FETCH_REQUEST_PARTITION_V5) does
> not
> > change in comparison to the last FetchRequest from this follower.
> > 2) the leader can skip a partition in the FetchResponse if the
> > information of that partition (i.e. those fields in
> FETCH_RESPONSE_PARTITION_V5) has
> > not changed in 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-27 Thread Jan Filipiak

Hi Colin, thank you  for this KIP, it can become a really useful thing.

I just scanned through the discussion so far and wanted to start a 
thread to make as decision about keeping the
cache with the Connection / Session or having some sort of UUID indexed 
global Map.


Sorry if that has been settled already and I missed it. In this case 
could anyone point me to the discussion?


Id rather see a protocol in which the client is hinting the broker that, 
he is going to use the feature instead of a client
realizing that the broker just offered the feature (regardless of 
protocol version which should only indicate that the feature
would be usable). This seems to work better with a per 
connection/session attached Metadata than with a Map and could allow for

easier client implementations.
It would also make Client-side code easier as there wouldn't be any 
Cache-miss error Messages to handle.


 Thank you again for the KIP. And again, if this was clarified already 
please drop me a hint where I could read about it.


Best Jan





On 21.11.2017 22:02, Colin McCabe wrote:

Hi all,

I created a KIP to improve the scalability and latency of FetchRequest:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability

Please take a look.

cheers,
Colin




Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-27 Thread Colin McCabe
On Sat, Nov 25, 2017, at 21:25, Dong Lin wrote:
> Hey Colin,
> 
> Thanks for the reply. Please see my comments inline.
> 
> On Sat, Nov 25, 2017 at 3:33 PM, Colin McCabe  wrote:
> 
> > On Fri, Nov 24, 2017, at 22:06, Dong Lin wrote:
> > > Hey Colin,
> > >
> > > Thanks for the reply! Please see my comment inline.
> > >
> > > On Fri, Nov 24, 2017 at 9:39 PM, Colin McCabe 
> > wrote:
> > >
> > > > On Thu, Nov 23, 2017, at 18:35, Dong Lin wrote:
> > > > > Hey Colin,
> > > > >
> > > > > Thanks for the KIP! This is definitely useful when there are many
> > idle
> > > > > partitions in the clusters.
> > > > >
> > > > > Just in case it is useful, I will provide some number here. We
> > observe
> > > > > that for a clsuter that have around 2.5k partitions per broker, the
> > > > > ProduceRequestTotal time average value is around 25 ms. For a cluster
> > > > > with 2.5k partitions per broker whose AllTopicsBytesInRate is only
> > > > around 6
> > > > > MB/s, the ProduceRequestTotalTime average value is around 180 ms,
> > most of
> > > > > which is spent on ProduceRequestRemoteTime. The increased
> > > > > ProduceRequestTotalTime significantly reduces throughput of producers
> > > > > with ack=all. I think this KIP can help address this problem.
> > > >
> > > > Hi Dong,
> > > >
> > > > Thanks for the numbers.  It's good to have empirical confirmation that
> > > > this will help!
> > > >
> > > > >
> > > > > Here are some of my ideas on the current KIP:
> > > > >
> > > > > - The KIP says that the follower will include a partition in
> > > > > the IncrementalFetchRequest if the LEO of the partition has been
> > updated.
> > > > > It seems that doing so may prevent leader from knowing information
> > (e.g.
> > > > > LogStartOffset) of the follower that will otherwise be included in
> > the
> > > > > FetchRequest. Maybe we should have a paragraph to explicitly define
> > the
> > > > > full criteria of when the fetcher should include a partition in the
> > > > > FetchResponse and probably include logStartOffset as part of the
> > > > > criteria?
> > > >
> > > > Hmm.  That's a good point... we should think about whether we need to
> > > > send partition information in an incremental update when the LSO
> > > > changes.
> > > >
> > > > Sorry if this is a dumb question, but what does the leader do with the
> > > > logStartOffset of the followers?  When does the leader need to know it?
> > > > Also, how often do we expect it to be changed by the LogCleaner?
> > > >
> > >
> >
> > Hi Dong,
> >
> > > The leader uses logStartOffset of the followers to determine the
> > > logStartOffset of the partition. It is needed to handle
> > > DeleteRecordsRequest. It can be changed if the log is deleted on the
> > > follower due to log retention.
> >
> > Is there really a big advantage to the leader caching the LSO for each
> > follower?  I guess it allows you to avoid sending the
> > DeleteRecordsRequest to followers that you know have already deleted the
> > records in question.  But the leader can just broadcast the request to
> > all the followers.  This uses less network bandwidth than sending a
> > single batch of records with acks=all.
> >
> 
> This is probably not just about caching. leader uses the LSO in the
> FetchRequest from follower to figure out whether DeleteRecordsRequest can
> succeed. Thus if follower does not send FetchRequest, leader will not
> know the information needed for handling DeleteRecordsRequest. It is possible
> to change the procedure for handling DeleteRecordsRequest. It is just that
> the KIP probably needs to specify the change in more detail and we need to
> understand whether this is the best approach.

Hi Dong,

That's a good point.  Do we have information on how frequently the LSO
changes?  If it changes infrequently, maybe we should simply include
this information in the incremental fetch response (as you suggested
below).  Hmm... how frequently do we expect the LogCleaner to change
this number?

> 
> IMO the work in this KIP can be divided into three parts:
> 
> 1) follower can skip a partition in the FetchRequest if the information
> of that partition (i.e. those fields in FETCH_REQUEST_PARTITION_V5) does not
> change in comparison to the last FetchRequest from this follower.
> 2) the leader can skip a partition in the FetchResponse if the
> information of that partition (i.e. those fields in 
> FETCH_RESPONSE_PARTITION_V5) has
> not changed in comparison to the last FetchResponse to this follower.
> 3) we can further skip a partition in FetchRequest (or FetchResponse) if
> the fields that have changed (e.g. LSO in the FetchRequest) does not need
> to be sent.
> 
> It seems to me that 1) and 2) are the most important part of the KIP.
> These two parts are "safe" to do in the sense that no information will be lost
> even if we skip these partitions in the FetchRequest/FetchResponse. It
> also seems that these two parts can achieve the main goal of this KIP 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-25 Thread Dong Lin
Hey Colin,

Thanks for the reply. Please see my comments inline.

On Sat, Nov 25, 2017 at 3:33 PM, Colin McCabe  wrote:

> On Fri, Nov 24, 2017, at 22:06, Dong Lin wrote:
> > Hey Colin,
> >
> > Thanks for the reply! Please see my comment inline.
> >
> > On Fri, Nov 24, 2017 at 9:39 PM, Colin McCabe 
> wrote:
> >
> > > On Thu, Nov 23, 2017, at 18:35, Dong Lin wrote:
> > > > Hey Colin,
> > > >
> > > > Thanks for the KIP! This is definitely useful when there are many
> idle
> > > > partitions in the clusters.
> > > >
> > > > Just in case it is useful, I will provide some number here. We
> observe
> > > > that for a clsuter that have around 2.5k partitions per broker, the
> > > > ProduceRequestTotal time average value is around 25 ms. For a cluster
> > > > with 2.5k partitions per broker whose AllTopicsBytesInRate is only
> > > around 6
> > > > MB/s, the ProduceRequestTotalTime average value is around 180 ms,
> most of
> > > > which is spent on ProduceRequestRemoteTime. The increased
> > > > ProduceRequestTotalTime significantly reduces throughput of producers
> > > > with ack=all. I think this KIP can help address this problem.
> > >
> > > Hi Dong,
> > >
> > > Thanks for the numbers.  It's good to have empirical confirmation that
> > > this will help!
> > >
> > > >
> > > > Here are some of my ideas on the current KIP:
> > > >
> > > > - The KIP says that the follower will include a partition in
> > > > the IncrementalFetchRequest if the LEO of the partition has been
> updated.
> > > > It seems that doing so may prevent leader from knowing information
> (e.g.
> > > > LogStartOffset) of the follower that will otherwise be included in
> the
> > > > FetchRequest. Maybe we should have a paragraph to explicitly define
> the
> > > > full criteria of when the fetcher should include a partition in the
> > > > FetchResponse and probably include logStartOffset as part of the
> > > > criteria?
> > >
> > > Hmm.  That's a good point... we should think about whether we need to
> > > send partition information in an incremental update when the LSO
> > > changes.
> > >
> > > Sorry if this is a dumb question, but what does the leader do with the
> > > logStartOffset of the followers?  When does the leader need to know it?
> > > Also, how often do we expect it to be changed by the LogCleaner?
> > >
> >
>
> Hi Dong,
>
> > The leader uses logStartOffset of the followers to determine the
> > logStartOffset of the partition. It is needed to handle
> > DeleteRecordsRequest. It can be changed if the log is deleted on the
> > follower due to log retention.
>
> Is there really a big advantage to the leader caching the LSO for each
> follower?  I guess it allows you to avoid sending the
> DeleteRecordsRequest to followers that you know have already deleted the
> records in question.  But the leader can just broadcast the request to
> all the followers.  This uses less network bandwidth than sending a
> single batch of records with acks=all.
>

This is probably not just about caching. leader uses the LSO in the
FetchRequest from follower to figure out whether DeleteRecordsRequest can
succeed. Thus if follower does not send FetchRequest, leader will not know
the information needed for handling DeleteRecordsRequest. It is possible to
change the procedure for handling DeleteRecordsRequest. It is just that the
KIP probably needs to specify the change in more detail and we need to
understand whether this is the best approach.

IMO the work in this KIP can be divided into three parts:

1) follower can skip a partition in the FetchRequest if the information of
that partition (i.e. those fields in FETCH_REQUEST_PARTITION_V5) does not
change in comparison to the last FetchRequest from this follower.
2) the leader can skip a partition in the FetchResponse if the information
of that partition (i.e. those fields in FETCH_RESPONSE_PARTITION_V5) has
not changed in comparison to the last FetchResponse to this follower.
3) we can further skip a partition in FetchRequest (or FetchResponse) if
the fields that have changed (e.g. LSO in the FetchRequest) does not need
to be sent.

It seems to me that 1) and 2) are the most important part of the KIP. These
two parts are "safe" to do in the sense that no information will be lost
even if we skip these partitions in the FetchRequest/FetchResponse. It also
seems that these two parts can achieve the main goal of this KIP because if
a partition does not have inactive traffic, mostly likely the corresponding
fields in FETCH_REQUEST_PARTITION_V5 and FETCH_RESPONSE_PARTITION_V5 will
not change, and therefore this partition can be skipped in most
FetchRequest and FetchResponse.

On the other hand, the part 3) can possibly be a useful optimization but it
can also be a bit unsafe and require more discussion. For example, if we
skip a partition in the FetchRequest when its LSO has changed, this can
potentially affect the handling of DeleteRecordsRequest. It is possible
that we can 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-25 Thread Colin McCabe
On Fri, Nov 24, 2017, at 22:06, Dong Lin wrote:
> Hey Colin,
> 
> Thanks for the reply! Please see my comment inline.
> 
> On Fri, Nov 24, 2017 at 9:39 PM, Colin McCabe  wrote:
> 
> > On Thu, Nov 23, 2017, at 18:35, Dong Lin wrote:
> > > Hey Colin,
> > >
> > > Thanks for the KIP! This is definitely useful when there are many idle
> > > partitions in the clusters.
> > >
> > > Just in case it is useful, I will provide some number here. We observe
> > > that for a clsuter that have around 2.5k partitions per broker, the
> > > ProduceRequestTotal time average value is around 25 ms. For a cluster
> > > with 2.5k partitions per broker whose AllTopicsBytesInRate is only
> > around 6
> > > MB/s, the ProduceRequestTotalTime average value is around 180 ms, most of
> > > which is spent on ProduceRequestRemoteTime. The increased
> > > ProduceRequestTotalTime significantly reduces throughput of producers
> > > with ack=all. I think this KIP can help address this problem.
> >
> > Hi Dong,
> >
> > Thanks for the numbers.  It's good to have empirical confirmation that
> > this will help!
> >
> > >
> > > Here are some of my ideas on the current KIP:
> > >
> > > - The KIP says that the follower will include a partition in
> > > the IncrementalFetchRequest if the LEO of the partition has been updated.
> > > It seems that doing so may prevent leader from knowing information (e.g.
> > > LogStartOffset) of the follower that will otherwise be included in the
> > > FetchRequest. Maybe we should have a paragraph to explicitly define the
> > > full criteria of when the fetcher should include a partition in the
> > > FetchResponse and probably include logStartOffset as part of the
> > > criteria?
> >
> > Hmm.  That's a good point... we should think about whether we need to
> > send partition information in an incremental update when the LSO
> > changes.
> >
> > Sorry if this is a dumb question, but what does the leader do with the
> > logStartOffset of the followers?  When does the leader need to know it?
> > Also, how often do we expect it to be changed by the LogCleaner?
> >
> 

Hi Dong,

> The leader uses logStartOffset of the followers to determine the
> logStartOffset of the partition. It is needed to handle
> DeleteRecordsRequest. It can be changed if the log is deleted on the
> follower due to log retention.

Is there really a big advantage to the leader caching the LSO for each
follower?  I guess it allows you to avoid sending the
DeleteRecordsRequest to followers that you know have already deleted the
records in question.  But the leader can just broadcast the request to
all the followers.  This uses less network bandwidth than sending a
single batch of records with acks=all.

> 
> 
> >
> > > - It seems that every time the set of partitions in the
> > > ReplicaFetcherThread is changed, or if follower restarts, a new UUID will
> > > be generated in the leader and leader will add a new entry in the
> > > in-memory  map to map the UUID to list of partitions (and other metadata
> > such as
> > > fetch offset). This map with grow over time depending depending on the
> > > frequency of events such as partition movement or broker restart. As you
> > mentioned,
> > > we probably need to timeout entries in this map. But there is also
> > > tradeoff  in this timeout -- large timeout increase memory usage whereas
> > smaller
> > > timeout increases frequency of the full FetchRequest. Could you specify
> > > the default value of this timeout and probably also explain how it
> > affects
> > > the performance of this KIP?
> >
> > Right, there are definitely some tradeoffs here.
> >
> > Since fetches happen very frequently, I think even a short UUID cache
> > expiration time of a minute or two should already be enough to ensure
> > that 99%+ of all fetch requests are incremental fetch requests.  I think
> > the idea of partitioning the cache per broker is a good one which will
> > let us limit memory consumption even more.
> >
> > If replica fetcher threads do change their partition assignments often,
> > we could also add a special "old UUID to uncache" field to the
> > FetchRequest as well.  That would avoid having to wait for the full
> > minute to clear the UUID cache.  That's probably not  necessary,
> > though...
> >
> 
> I think expiration time of a minute is two is probably reasonable. Yeah
> we
> can discuss it further after the KIP is updated. Thanks!
> 
> 
> >
> > > Also, do you think we can avoid having duplicate
> > > entries from the same ReplicaFetcher (in case of partition set change) by
> > > using brokerId+fetcherThreadIndex as the UUID?
> >
> > My concern about that is that if two messages get reordered somehow, or
> > an update gets lost, the view of partitions which the fetcher thread has
> > could diverge from the view which the leader has.  Also, UUIDs work for
> > consumers, but clearly consumers cannot use a
> > brokerID+fetcherThreadIndex.  It's simpler to have one system than two.
> >
> 
> 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-24 Thread Dong Lin
Hey Colin,

Thanks for the reply! Please see my comment inline.

On Fri, Nov 24, 2017 at 9:39 PM, Colin McCabe  wrote:

> On Thu, Nov 23, 2017, at 18:35, Dong Lin wrote:
> > Hey Colin,
> >
> > Thanks for the KIP! This is definitely useful when there are many idle
> > partitions in the clusters.
> >
> > Just in case it is useful, I will provide some number here. We observe
> > that for a clsuter that have around 2.5k partitions per broker, the
> > ProduceRequestTotal time average value is around 25 ms. For a cluster
> > with 2.5k partitions per broker whose AllTopicsBytesInRate is only
> around 6
> > MB/s, the ProduceRequestTotalTime average value is around 180 ms, most of
> > which is spent on ProduceRequestRemoteTime. The increased
> > ProduceRequestTotalTime significantly reduces throughput of producers
> > with ack=all. I think this KIP can help address this problem.
>
> Hi Dong,
>
> Thanks for the numbers.  It's good to have empirical confirmation that
> this will help!
>
> >
> > Here are some of my ideas on the current KIP:
> >
> > - The KIP says that the follower will include a partition in
> > the IncrementalFetchRequest if the LEO of the partition has been updated.
> > It seems that doing so may prevent leader from knowing information (e.g.
> > LogStartOffset) of the follower that will otherwise be included in the
> > FetchRequest. Maybe we should have a paragraph to explicitly define the
> > full criteria of when the fetcher should include a partition in the
> > FetchResponse and probably include logStartOffset as part of the
> > criteria?
>
> Hmm.  That's a good point... we should think about whether we need to
> send partition information in an incremental update when the LSO
> changes.
>
> Sorry if this is a dumb question, but what does the leader do with the
> logStartOffset of the followers?  When does the leader need to know it?
> Also, how often do we expect it to be changed by the LogCleaner?
>

The leader uses logStartOffset of the followers to determine the
logStartOffset of the partition. It is needed to handle
DeleteRecordsRequest. It can be changed if the log is deleted on the
follower due to log retention.


>
> > - It seems that every time the set of partitions in the
> > ReplicaFetcherThread is changed, or if follower restarts, a new UUID will
> > be generated in the leader and leader will add a new entry in the
> > in-memory  map to map the UUID to list of partitions (and other metadata
> such as
> > fetch offset). This map with grow over time depending depending on the
> > frequency of events such as partition movement or broker restart. As you
> mentioned,
> > we probably need to timeout entries in this map. But there is also
> > tradeoff  in this timeout -- large timeout increase memory usage whereas
> smaller
> > timeout increases frequency of the full FetchRequest. Could you specify
> > the default value of this timeout and probably also explain how it
> affects
> > the performance of this KIP?
>
> Right, there are definitely some tradeoffs here.
>
> Since fetches happen very frequently, I think even a short UUID cache
> expiration time of a minute or two should already be enough to ensure
> that 99%+ of all fetch requests are incremental fetch requests.  I think
> the idea of partitioning the cache per broker is a good one which will
> let us limit memory consumption even more.
>
> If replica fetcher threads do change their partition assignments often,
> we could also add a special "old UUID to uncache" field to the
> FetchRequest as well.  That would avoid having to wait for the full
> minute to clear the UUID cache.  That's probably not  necessary,
> though...
>

I think expiration time of a minute is two is probably reasonable. Yeah we
can discuss it further after the KIP is updated. Thanks!


>
> > Also, do you think we can avoid having duplicate
> > entries from the same ReplicaFetcher (in case of partition set change) by
> > using brokerId+fetcherThreadIndex as the UUID?
>
> My concern about that is that if two messages get reordered somehow, or
> an update gets lost, the view of partitions which the fetcher thread has
> could diverge from the view which the leader has.  Also, UUIDs work for
> consumers, but clearly consumers cannot use a
> brokerID+fetcherThreadIndex.  It's simpler to have one system than two.
>

Yeah this can be a problem if two messages are lost of reordered somehow. I
am just wondering whether there actually exists a scenario where the
message can be ordered between ReplicaFetcherThread and the leader. My gut
feel is that since the ReplicaFetcherThread talks to leader using a single
TCP connection with inflight requests = 1, out-of-order delivery probably
should not happen. I may be wrong though. What do you think?


>
> >
> > I agree with the previous comments that 1) ideally we want to evolve the
> > existing existing FetchRequest instead of adding a new request type; and
> > 2) KIP hopefully can also apply to replication 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-24 Thread Colin McCabe
On Thu, Nov 23, 2017, at 18:35, Dong Lin wrote:
> Hey Colin,
> 
> Thanks for the KIP! This is definitely useful when there are many idle
> partitions in the clusters.
> 
> Just in case it is useful, I will provide some number here. We observe
> that for a clsuter that have around 2.5k partitions per broker, the
> ProduceRequestTotal time average value is around 25 ms. For a cluster
> with 2.5k partitions per broker whose AllTopicsBytesInRate is only around 6
> MB/s, the ProduceRequestTotalTime average value is around 180 ms, most of
> which is spent on ProduceRequestRemoteTime. The increased
> ProduceRequestTotalTime significantly reduces throughput of producers
> with ack=all. I think this KIP can help address this problem.

Hi Dong,

Thanks for the numbers.  It's good to have empirical confirmation that
this will help!

> 
> Here are some of my ideas on the current KIP:
> 
> - The KIP says that the follower will include a partition in
> the IncrementalFetchRequest if the LEO of the partition has been updated.
> It seems that doing so may prevent leader from knowing information (e.g.
> LogStartOffset) of the follower that will otherwise be included in the
> FetchRequest. Maybe we should have a paragraph to explicitly define the
> full criteria of when the fetcher should include a partition in the
> FetchResponse and probably include logStartOffset as part of the
> criteria?

Hmm.  That's a good point... we should think about whether we need to
send partition information in an incremental update when the LSO
changes.

Sorry if this is a dumb question, but what does the leader do with the
logStartOffset of the followers?  When does the leader need to know it? 
Also, how often do we expect it to be changed by the LogCleaner?

> - It seems that every time the set of partitions in the
> ReplicaFetcherThread is changed, or if follower restarts, a new UUID will
> be generated in the leader and leader will add a new entry in the
> in-memory  map to map the UUID to list of partitions (and other metadata such 
> as
> fetch offset). This map with grow over time depending depending on the
> frequency of events such as partition movement or broker restart. As you 
> mentioned,
> we probably need to timeout entries in this map. But there is also
> tradeoff  in this timeout -- large timeout increase memory usage whereas 
> smaller
> timeout increases frequency of the full FetchRequest. Could you specify
> the default value of this timeout and probably also explain how it affects
> the performance of this KIP?

Right, there are definitely some tradeoffs here.

Since fetches happen very frequently, I think even a short UUID cache
expiration time of a minute or two should already be enough to ensure
that 99%+ of all fetch requests are incremental fetch requests.  I think
the idea of partitioning the cache per broker is a good one which will
let us limit memory consumption even more.

If replica fetcher threads do change their partition assignments often,
we could also add a special "old UUID to uncache" field to the
FetchRequest as well.  That would avoid having to wait for the full
minute to clear the UUID cache.  That's probably not  necessary,
though...

> Also, do you think we can avoid having duplicate
> entries from the same ReplicaFetcher (in case of partition set change) by
> using brokerId+fetcherThreadIndex as the UUID?

My concern about that is that if two messages get reordered somehow, or
an update gets lost, the view of partitions which the fetcher thread has
could diverge from the view which the leader has.  Also, UUIDs work for
consumers, but clearly consumers cannot use a 
brokerID+fetcherThreadIndex.  It's simpler to have one system than two.

> 
> I agree with the previous comments that 1) ideally we want to evolve the
> existing existing FetchRequest instead of adding a new request type; and
> 2) KIP hopefully can also apply to replication service such as e.g.
> MirrorMaker. In addition, ideally we probably want to implement the new
> logic in a separate class without having to modify the existing class
> (e.g. Log, LogManager) so that the implementation and design can be simpler
> going forward. Motivated by these concepts, I am wondering if the following
> alternative design may be worth thinking.
> 
> Here are the details of a potentially feasible alternative approach.
> 
> *Protocol change: *
> 
> - We add a fetcherId of string type in the FetchRequest. This fetcherId
> is similarly to UUID and helps leader correlate the fetcher (i.e.
> ReplicaFetcherThread or MM consumer) with the state of the fetcher. This
> fetcherId is determined by the fetcher. For most consumers this fetcherId
> is null. For ReplicaFetcherThread this fetcherId = brokerId +
> threadIndex.
> For MM this is groupId+someIndex.

As Jay pointed out earlier, there are other consumers besides
MirrorMaker that might want to take advantage of incremental fetch
requests.  He gave the example of the HDFS connector, but there are many

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-23 Thread Dong Lin
Hey Colin,

Thanks for the KIP! This is definitely useful when there are many idle
partitions in the clusters.

Just in case it is useful, I will provide some number here. We observe that
for a clsuter that have around 2.5k partitions per broker, the
ProduceRequestTotal time average value is around 25 ms. For a cluster with
2.5k partitions per broker whose AllTopicsBytesInRate is only around 6
MB/s, the ProduceRequestTotalTime average value is around 180 ms, most of
which is spent on ProduceRequestRemoteTime. The increased
ProduceRequestTotalTime significantly reduces throughput of producers with
ack=all. I think this KIP can help address this problem.

Here are some of my ideas on the current KIP:

- The KIP says that the follower will include a partition in
the IncrementalFetchRequest if the LEO of the partition has been updated.
It seems that doing so may prevent leader from knowing information (e.g.
LogStartOffset) of the follower that will otherwise be included in the
FetchRequest. Maybe we should have a paragraph to explicitly define the
full criteria of when the fetcher should include a partition in the
FetchResponse and probably include logStartOffset as part of the criteria?

- It seems that every time the set of partitions in the
ReplicaFetcherThread is changed, or if follower restarts, a new UUID will
be generated in the leader and leader will add a new entry in the in-memory
map to map the UUID to list of partitions (and other metadata such as fetch
offset). This map with grow over time depending depending on the frequency
of events such as partition movement or broker restart. As you mentioned,
we probably need to timeout entries in this map. But there is also tradeoff
in this timeout -- large timeout increase memory usage whereas smaller
timeout increases frequency of the full FetchRequest. Could you specify the
default value of this timeout and probably also explain how it affects the
performance of this KIP? Also, do you think we can avoid having duplicate
entries from the same ReplicaFetcher (in case of partition set change) by
using brokerId+fetcherThreadIndex as the UUID?

I agree with the previous comments that 1) ideally we want to evolve the
existing existing FetchRequest instead of adding a new request type; and 2)
KIP hopefully can also apply to replication service such as e.g.
MirrorMaker. In addition, ideally we probably want to implement the new
logic in a separate class without having to modify the existing class (e.g.
Log, LogManager) so that the implementation and design can be simpler going
forward. Motivated by these concepts, I am wondering if the following
alternative design may be worth thinking.

Here are the details of a potentially feasible alternative approach.

*Protocol change: *

- We add a fetcherId of string type in the FetchRequest. This fetcherId is
similarly to UUID and helps leader correlate the fetcher (i.e.
ReplicaFetcherThread or MM consumer) with the state of the fetcher. This
fetcherId is determined by the fetcher. For most consumers this fetcherId
is null. For ReplicaFetcherThread this fetcherId = brokerId + threadIndex.
For MM this is groupId+someIndex.

*Proposed change in leader broker:*

- A new class FetcherHandler will be used in the leader to map the
fetcherId to state of the fetcher. The state of the fetcher is a list of
FETCH_REQUEST_PARTITION_V0 for selected partitions.

- After leader receives a FetchRequest, it first transforms the
FetchRequest by doing request = FetcherHandler.addPartition(request) before
giving this partition to KafkaApis.handle(request). If the fetcherId in
this request is null, this method does not make any change. Otherwise, it
takes the list of FETCH_REQUEST_PARTITION_V0 associated with this fetcherId
and append it to the given request. The state of a new non-null fetcherId
is an empty list.

- The KafkaApis.handle(request) will process the request and generate a
response. All existing logic in ReplicaManager, LogManager and so on does
not need to be changed.

- The leader calls response = FetcherHandler.removePartition(response)
before sending the response back to the fetcher.
FetcherHandler.removePartition(response)
enumerates all partitions in the response. If a partition is "empty" (e.g.
no records to be sent), this partition and its FETCH_REQUEST_PARTITION_V0
in the original FetchRequest is added to the state  of this fetcherId; and
this partition is removed from the response. If the partition is not
"empty", the partition is remove from the state of this fetcherId.

*Proposed change in the ReplicaFetcherThread:*

- In addition the set of assigned partitions, the ReplicaFetcherThreads
also keeps track of the subset of assigned partitions which are non-empty
in the last FetchResponse. The is initialized to be the set of assigned
partitions. Then it is updated every time a FetchResponse is received. The
FetchResponse constructed by ReplicaFetcherThread includes exactly this
subset of assigned partition.

Here is 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-23 Thread Becket Qin
Hi Ismael,

Yes, you are right. The metadata may not help for multiple fetch thread or
the consumer case. Session based approach is probably better in this case.

The optimization of only returning data at the offset index entry boundary
may still be worth considering. It also helps improve the index lookup in
general.

@Jun,
Good point of log compacted topics. Perhaps we can make sure the read will
always be operated on the original segment file even if a compacted log
segment is swapped in. Combining this with the above solution which always
returns the data at the index boundary when possible, it seems we can avoid
the additional look up safely.

Thanks,

Jiangjie (Becket) Qin


On Thu, Nov 23, 2017 at 9:31 AM, Jun Rao  wrote:

> Yes, caching the log segment position after the index lookup may work. One
> subtle issue is that for a compacted topic, the underlying log segment may
> have changed between two consecutive fetch requests, and we need to think
> through the impact of that.
>
> Thanks,
>
> Jun
>
> On Wed, Nov 22, 2017 at 7:54 PM, Colin McCabe  wrote:
>
> > Oh, I see the issue now.  The broker uses sendfile() and sends some
> > message data without knowing what the ending offset is.  To learn that,
> we
> > would need another index access.
> >
> > However, when we do that index->offset lookup, we know that the next
> > offset->index lookup (done in the following fetch request) will be for
> the
> > same offset.  So we should be able to cache the result (the index).
> Also:
> > Does the operating system’s page cache help us here?
> >
> > Best,
> > Colin
> >
> > On Wed, Nov 22, 2017, at 16:53, Jun Rao wrote:
> > > Hi, Colin,
> > >
> > > After step 3a, do we need to update the cached offset in the leader to
> be
> > > the last offset in the data returned in the fetch response? If so, we
> > > need
> > > another offset index lookup since the leader only knows that it gives
> out
> > > X
> > > bytes in the fetch response, but not the last offset in those X bytes.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe 
> > wrote:
> > >
> > > > On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
> > > > > Hi, Colin,
> > > > >
> > > > > When fetching data for a partition, the leader needs to translate
> the
> > > > > fetch offset to a position in a log segment with an index lookup.
> If
> > the
> > > > fetch
> > > > > request now also needs to cache the offset for the next fetch
> > request,
> > > > > there will be an extra offset index lookup.
> > > >
> > > > Hmm.  So the way I was thinking about it was, with an incremental
> fetch
> > > > request, for each partition:
> > > >
> > > > 1a. the leader consults its cache to find the offset it needs to use
> > for
> > > > the fetch request
> > > > 2a. the leader performs a lookup to translate the offset to a file
> > index
> > > > 3a. the leader reads the data from the file
> > > >
> > > > In contrast, with a full fetch request, for each partition:
> > > >
> > > > 1b. the leader looks at the FetchRequest to find the offset it needs
> to
> > > > use for the fetch request
> > > > 2b. the leader performs a lookup to translate the offset to a file
> > index
> > > > 3b. the leader reads the data from the file
> > > >
> > > > It seems like there is only one offset index lookup in both cases?
> The
> > > > key point is that the cache in step #1a is not stored on disk.  Or
> > maybe
> > > > I'm missing something here.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > > The offset index lookup can
> > > > > potentially be expensive since it could require disk I/Os. One way
> to
> > > > > optimize this a bit is to further cache the log segment position
> for
> > the
> > > > > next offset. The tricky issue is that for a compacted topic, the
> > > > > underlying
> > > > > log segment could have changed between two consecutive fetch
> > requests. We
> > > > > could potentially make that case work, but the logic will be more
> > > > > complicated.
> > > > >
> > > > > Another thing is that it seems that the proposal only saves the
> > metadata
> > > > > overhead if there are low volume topics. If we use Jay's suggestion
> > of
> > > > > including 0 partitions in subsequent fetch requests, it seems that
> we
> > > > > could
> > > > > get the metadata saving even if all topics have continuous traffic.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > > On Wed, Nov 22, 2017 at 1:14 PM, Colin McCabe 
> > > > wrote:
> > > > >
> > > > > > On Tue, Nov 21, 2017, at 22:11, Jun Rao wrote:
> > > > > > > Hi, Jay,
> > > > > > >
> > > > > > > I guess in your proposal the leader has to cache the last
> offset
> > > > given
> > > > > > > back for each partition so that it knows from which offset to
> > serve
> > > > the
> > > > > > next
> > > > > > > fetch request.
> > > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Just to clarify, 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-23 Thread Jun Rao
Yes, caching the log segment position after the index lookup may work. One
subtle issue is that for a compacted topic, the underlying log segment may
have changed between two consecutive fetch requests, and we need to think
through the impact of that.

Thanks,

Jun

On Wed, Nov 22, 2017 at 7:54 PM, Colin McCabe  wrote:

> Oh, I see the issue now.  The broker uses sendfile() and sends some
> message data without knowing what the ending offset is.  To learn that, we
> would need another index access.
>
> However, when we do that index->offset lookup, we know that the next
> offset->index lookup (done in the following fetch request) will be for the
> same offset.  So we should be able to cache the result (the index).  Also:
> Does the operating system’s page cache help us here?
>
> Best,
> Colin
>
> On Wed, Nov 22, 2017, at 16:53, Jun Rao wrote:
> > Hi, Colin,
> >
> > After step 3a, do we need to update the cached offset in the leader to be
> > the last offset in the data returned in the fetch response? If so, we
> > need
> > another offset index lookup since the leader only knows that it gives out
> > X
> > bytes in the fetch response, but not the last offset in those X bytes.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe 
> wrote:
> >
> > > On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
> > > > Hi, Colin,
> > > >
> > > > When fetching data for a partition, the leader needs to translate the
> > > > fetch offset to a position in a log segment with an index lookup. If
> the
> > > fetch
> > > > request now also needs to cache the offset for the next fetch
> request,
> > > > there will be an extra offset index lookup.
> > >
> > > Hmm.  So the way I was thinking about it was, with an incremental fetch
> > > request, for each partition:
> > >
> > > 1a. the leader consults its cache to find the offset it needs to use
> for
> > > the fetch request
> > > 2a. the leader performs a lookup to translate the offset to a file
> index
> > > 3a. the leader reads the data from the file
> > >
> > > In contrast, with a full fetch request, for each partition:
> > >
> > > 1b. the leader looks at the FetchRequest to find the offset it needs to
> > > use for the fetch request
> > > 2b. the leader performs a lookup to translate the offset to a file
> index
> > > 3b. the leader reads the data from the file
> > >
> > > It seems like there is only one offset index lookup in both cases?  The
> > > key point is that the cache in step #1a is not stored on disk.  Or
> maybe
> > > I'm missing something here.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > > The offset index lookup can
> > > > potentially be expensive since it could require disk I/Os. One way to
> > > > optimize this a bit is to further cache the log segment position for
> the
> > > > next offset. The tricky issue is that for a compacted topic, the
> > > > underlying
> > > > log segment could have changed between two consecutive fetch
> requests. We
> > > > could potentially make that case work, but the logic will be more
> > > > complicated.
> > > >
> > > > Another thing is that it seems that the proposal only saves the
> metadata
> > > > overhead if there are low volume topics. If we use Jay's suggestion
> of
> > > > including 0 partitions in subsequent fetch requests, it seems that we
> > > > could
> > > > get the metadata saving even if all topics have continuous traffic.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Nov 22, 2017 at 1:14 PM, Colin McCabe 
> > > wrote:
> > > >
> > > > > On Tue, Nov 21, 2017, at 22:11, Jun Rao wrote:
> > > > > > Hi, Jay,
> > > > > >
> > > > > > I guess in your proposal the leader has to cache the last offset
> > > given
> > > > > > back for each partition so that it knows from which offset to
> serve
> > > the
> > > > > next
> > > > > > fetch request.
> > > > >
> > > > > Hi Jun,
> > > > >
> > > > > Just to clarify, the leader has to cache the last offset for each
> > > > > follower / UUID in the original KIP-227 proposal as well.  Sorry if
> > > that
> > > > > wasn't clear.
> > > > >
> > > > > > This is doable but it means that the leader needs to do an
> > > > > > additional index lookup per partition to serve a fetch request.
> Not
> > > sure
> > > > > > if the benefit from the lighter fetch request obviously offsets
> the
> > > > > > additional index lookup though.
> > > > >
> > > > > The runtime impact should be a small constant factor at most,
> right?
> > > > > You would just have a mapping between UUID and the latest offset in
> > > each
> > > > > partition data structure.  It seems like the runtime impact of
> looking
> > > > > up the fetch offset in a hash table (or small array) in the
> in-memory
> > > > > partition data structure should be very similar to the runtime
> impact
> > > of
> > > > > looking up the fetch offset in the FetchRequest.
> > > > >
> > > > > The extra memory consumption per partition 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-23 Thread Ismael Juma
Hi James,

There are 2 options being discussed.

Option A is similar to the existing approach where the follower informs the
leader of offsets it has seen by asking for the next ones. We just skip the
partitions where the offset hasn't changed.

Option B involves the leader keeping track of the offsets returned to the
follower. So, when the follower does the next incremental request (with no
partitions), the leader assumes that the previously returned offsets were
stored by the follower. An important invariant is that the follower can
only send an empty incremental fetch request if the previous response was
successfully processed. What does the follower do if there was an issue
processing _some_ of the partitions in the response? The simplest option
would be to send a full fetch request. An alternative would be for the
follower to send an incremental fetch request with some offsets (overrides
to what the leader expects) although that adds even more complexity (i.e.
it's a combination of options A and B) and may not be worth it.

Ismael

On Thu, Nov 23, 2017 at 4:58 AM, James Cheng  wrote:

> I think the discussion may have already cover this but just in case...
>
> How does the leader decide when a newly written message is "committed"
> enough to hand out to consumers?
>
> When a message is produced and is stored to the disk of the leader, the
> message is not considered "committed" until it has hit all replicas in the
> ISR. Only at that point will the leader decide to hand out the message to
> normal consumers.
>
> In the current protocol, I believe the leader has to wait for 2 fetch
> requests from a follower before it considers the message committed: One to
> fetch the uncommitted message, and another to fetch anything after that. It
> is the fetch offset in the 2nd fetch that tells the leader that the
> follower now has the uncommitted message.
>
> As an example:
> 1a. Newly produced messages at offsets 10,11,12. Saved to leader, not yet
> replicated to followers.
> 2a. Follower asks for messages starting at offset 10. Leader hands out
> messages 10,11,12
> 3a. Follower asks for messages starting at offset 13. Based on that fetch
> request, the leader concludes that the follower already has messages
> 10,11,12, and so will now hand messages 10,11,12 out to consumers.
>
> How will the new protocol handle that? How will the leader know that the
> follower already has messages 10,11,12?
>
> In particular, how will the new protocol handle the case when not all
> partitions are returned in each request?
>
> Another example:
> 1b. Newly produced messages to topic A at offsets 10,11,12. Saved to
> leader, not yet replicated to followers.
> 2b. Newly produced 1MB message to topic B at offset 100. Saved to leader,
> not yet replicated to follower.
> 3b. Follower asks for messages from topic A starting at offset 10, and
> messages from topic B starting at offset 100.
> 4b. Leader decides to send to the follower the 1MB message at topic B
> offset 100. Due to replica.fetch.max.bytes, it only sends that single
> message to the follower.
> 5b. Follower asks for messages from topic A starting at offset 10, and
> messages from topic B starting at offset 101. Leader concludes that topic B
> offset 100 has been replicated and so can be handed out to consumers. Topic
> A messages 10,11,12 are not yet replicated and so cannot yet be handled out
> to consumers.
>
> In this particular case, the follower made no progress on replicating the
> new messages from topic A.
>
> How will the new protocol handle this scenario?
>
> -James
>
> > On Nov 22, 2017, at 7:54 PM, Colin McCabe  wrote:
> >
> > Oh, I see the issue now.  The broker uses sendfile() and sends some
> > message data without knowing what the ending offset is.  To learn that,
> > we would need another index access.
> > However, when we do that index->offset lookup, we know that the next
> offset-
> >> index lookup (done in the following fetch request) will be for the same
> > offset.  So we should be able to cache the result (the index).  Also:
> > Does the operating system’s page cache help us here?
> > Best,
> > Colin
> >
> > On Wed, Nov 22, 2017, at 16:53, Jun Rao wrote:
> >> Hi, Colin,
> >>
> >> After step 3a, do we need to update the cached offset in the
> >> leader to be> the last offset in the data returned in the fetch
> response? If so, we> need
> >> another offset index lookup since the leader only knows that it
> >> gives out> X
> >> bytes in the fetch response, but not the last offset in those X bytes.>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe
> >>  wrote:>
> >>> On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
>  Hi, Colin,
> 
>  When fetching data for a partition, the leader needs to
>  translate the> > > fetch offset to a position in a log segment with
> an index lookup.
>  If the> > fetch
>  request now also needs to cache 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-23 Thread Ismael Juma
Hi Becket,

Relying on the cluster metadata doesn't seem like it would work if there
are multiple fetcher threads, right? It also doesn't work for the consumer
case, which Jay suggested would be good to handle.

Ismael

On Thu, Nov 23, 2017 at 2:21 AM, Becket Qin  wrote:

> Thanks for the KIP, Colin. It is an interesting idea.
>
> Thinking about the fetch protocol, at a high level, currently the following
> conveys two type of information:
> 1) what partitions I am interested in
> 2) where I am on those partitions, i.e. offsets
>
> An extreme optimization would be letting the leader know both 1) and 2)
> then the fetch request could be almost empty. I think we may be able
> achieve this when there is no leader migration.
>
> For 1) we actually kind of already have the information on each broker,
> which is the metadata. We have found that in many cases a versioned
> metadata is very helpful. With the metadata generation we can achieve 1),
> i.e. the follower do not need to tell the leader what does it interested
> in. More specifically, Assuming we add a generation to the metadata, in the
> fetch request the follower will include a metadata generation, if the
> generation matches the generation of the metadata on the leader, the leader
> will send back a response indicating that the leader knows the follower's
> interested set of partitions, so there is no need to send a full fetch
> request. Otherwise, the follower still needs to send a full fetch request
> in the next round. This will achieve the goal that unless there are leader
> migration, the followers do not need to send the full requests.
>
> There are other benefits of having a metadata generation. Those are
> orthogonal to this discussion. But since we may need it elsewhere, we need
> to introduce it at some point.
>
> For 2), there are two options, A) as Jun said, the leader can do a look up
> to know what is the last offset sent back to the follower for each
> partition, or B) the follower sends back the updated log end offset in the
> next fetch request. If we do (A), one potential optimization is that we can
> let the leader always return the offsets at index boundary or log end
> offset. For example, consider a log whose log end offset is 350, and the
> index file has an entry at offset 100, 200, 300. The leader will always try
> to return bytes at the offset boundary or log end, i.e. for each fetch
> response, the leader will try to return the data up to the highest offset
> index entry as long as the data could fit into the fetch size of the
> partition, so it could be either 100, 200, 300 or 350(LEO). If so, the
> leader will know the last returned offset without an additional log scan.
> If the leader was not able to return at the index boundary or log end
> offset, e.g. the fetch size is too small or the index bytes interval is too
> large, the leader could then fall back to lookup the offset. Alternatively,
> the leader can set a flag in the fetch response asking the follower to
> provide the fetch offset in the next fetch request.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Wed, Nov 22, 2017 at 4:53 PM, Jun Rao  wrote:
>
> > Hi, Colin,
> >
> > After step 3a, do we need to update the cached offset in the leader to be
> > the last offset in the data returned in the fetch response? If so, we
> need
> > another offset index lookup since the leader only knows that it gives
> out X
> > bytes in the fetch response, but not the last offset in those X bytes.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe 
> wrote:
> >
> > > On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
> > > > Hi, Colin,
> > > >
> > > > When fetching data for a partition, the leader needs to translate the
> > > > fetch offset to a position in a log segment with an index lookup. If
> > the
> > > fetch
> > > > request now also needs to cache the offset for the next fetch
> request,
> > > > there will be an extra offset index lookup.
> > >
> > > Hmm.  So the way I was thinking about it was, with an incremental fetch
> > > request, for each partition:
> > >
> > > 1a. the leader consults its cache to find the offset it needs to use
> for
> > > the fetch request
> > > 2a. the leader performs a lookup to translate the offset to a file
> index
> > > 3a. the leader reads the data from the file
> > >
> > > In contrast, with a full fetch request, for each partition:
> > >
> > > 1b. the leader looks at the FetchRequest to find the offset it needs to
> > > use for the fetch request
> > > 2b. the leader performs a lookup to translate the offset to a file
> index
> > > 3b. the leader reads the data from the file
> > >
> > > It seems like there is only one offset index lookup in both cases?  The
> > > key point is that the cache in step #1a is not stored on disk.  Or
> maybe
> > > I'm missing something here.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > > The offset index lookup can
> > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-22 Thread James Cheng
I think the discussion may have already cover this but just in case...

How does the leader decide when a newly written message is "committed" enough 
to hand out to consumers?

When a message is produced and is stored to the disk of the leader, the message 
is not considered "committed" until it has hit all replicas in the ISR. Only at 
that point will the leader decide to hand out the message to normal consumers.

In the current protocol, I believe the leader has to wait for 2 fetch requests 
from a follower before it considers the message committed: One to fetch the 
uncommitted message, and another to fetch anything after that. It is the fetch 
offset in the 2nd fetch that tells the leader that the follower now has the 
uncommitted message.

As an example:
1a. Newly produced messages at offsets 10,11,12. Saved to leader, not yet 
replicated to followers.
2a. Follower asks for messages starting at offset 10. Leader hands out messages 
10,11,12
3a. Follower asks for messages starting at offset 13. Based on that fetch 
request, the leader concludes that the follower already has messages 10,11,12, 
and so will now hand messages 10,11,12 out to consumers.

How will the new protocol handle that? How will the leader know that the 
follower already has messages 10,11,12?

In particular, how will the new protocol handle the case when not all 
partitions are returned in each request?

Another example:
1b. Newly produced messages to topic A at offsets 10,11,12. Saved to leader, 
not yet replicated to followers.
2b. Newly produced 1MB message to topic B at offset 100. Saved to leader, not 
yet replicated to follower.
3b. Follower asks for messages from topic A starting at offset 10, and messages 
from topic B starting at offset 100.
4b. Leader decides to send to the follower the 1MB message at topic B offset 
100. Due to replica.fetch.max.bytes, it only sends that single message to the 
follower.
5b. Follower asks for messages from topic A starting at offset 10, and messages 
from topic B starting at offset 101. Leader concludes that topic B offset 100 
has been replicated and so can be handed out to consumers. Topic A messages 
10,11,12 are not yet replicated and so cannot yet be handled out to consumers.

In this particular case, the follower made no progress on replicating the new 
messages from topic A.

How will the new protocol handle this scenario?

-James

> On Nov 22, 2017, at 7:54 PM, Colin McCabe  wrote:
> 
> Oh, I see the issue now.  The broker uses sendfile() and sends some
> message data without knowing what the ending offset is.  To learn that,
> we would need another index access.
> However, when we do that index->offset lookup, we know that the next offset-
>> index lookup (done in the following fetch request) will be for the same
> offset.  So we should be able to cache the result (the index).  Also:
> Does the operating system’s page cache help us here?
> Best,
> Colin
> 
> On Wed, Nov 22, 2017, at 16:53, Jun Rao wrote:
>> Hi, Colin,
>> 
>> After step 3a, do we need to update the cached offset in the
>> leader to be> the last offset in the data returned in the fetch response? If 
>> so, we> need
>> another offset index lookup since the leader only knows that it
>> gives out> X
>> bytes in the fetch response, but not the last offset in those X bytes.>
>> Thanks,
>> 
>> Jun
>> 
>> On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe
>>  wrote:>
>>> On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
 Hi, Colin,
 
 When fetching data for a partition, the leader needs to
 translate the> > > fetch offset to a position in a log segment with an 
 index lookup.
 If the> > fetch
 request now also needs to cache the offset for the next fetch
 request,> > > there will be an extra offset index lookup.
>>> 
>>> Hmm.  So the way I was thinking about it was, with an
>>> incremental fetch> > request, for each partition:
>>> 
>>> 1a. the leader consults its cache to find the offset it needs to
>>> use for> > the fetch request
>>> 2a. the leader performs a lookup to translate the offset to a
>>> file index> > 3a. the leader reads the data from the file
>>> 
>>> In contrast, with a full fetch request, for each partition:
>>> 
>>> 1b. the leader looks at the FetchRequest to find the offset it
>>> needs to> > use for the fetch request
>>> 2b. the leader performs a lookup to translate the offset to a
>>> file index> > 3b. the leader reads the data from the file
>>> 
>>> It seems like there is only one offset index lookup in both
>>> cases?  The> > key point is that the cache in step #1a is not stored on 
>>> disk.
>>> Or maybe> > I'm missing something here.
>>> 
>>> best,
>>> Colin
>>> 
>>> 
 The offset index lookup can
 potentially be expensive since it could require disk I/Os. One
 way to> > > optimize this a bit is to further cache the log segment 
 position
 for the> > > next offset. The tricky issue is that for a compacted topic, 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-22 Thread Colin McCabe
Oh, I see the issue now.  The broker uses sendfile() and sends some
message data without knowing what the ending offset is.  To learn that,
we would need another index access.
However, when we do that index->offset lookup, we know that the next offset-
>index lookup (done in the following fetch request) will be for the same
offset.  So we should be able to cache the result (the index).  Also:
Does the operating system’s page cache help us here?
Best,
Colin

On Wed, Nov 22, 2017, at 16:53, Jun Rao wrote:
> Hi, Colin,
>
> After step 3a, do we need to update the cached offset in the
> leader to be> the last offset in the data returned in the fetch response? If 
> so, we> need
> another offset index lookup since the leader only knows that it
> gives out> X
> bytes in the fetch response, but not the last offset in those X bytes.>
> Thanks,
>
> Jun
>
> On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe
>  wrote:>
> > On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
> > > Hi, Colin,
> > >
> > > When fetching data for a partition, the leader needs to
> > > translate the> > > fetch offset to a position in a log segment with an 
> > > index lookup.
> > > If the> > fetch
> > > request now also needs to cache the offset for the next fetch
> > > request,> > > there will be an extra offset index lookup.
> >
> > Hmm.  So the way I was thinking about it was, with an
> > incremental fetch> > request, for each partition:
> >
> > 1a. the leader consults its cache to find the offset it needs to
> > use for> > the fetch request
> > 2a. the leader performs a lookup to translate the offset to a
> > file index> > 3a. the leader reads the data from the file
> >
> > In contrast, with a full fetch request, for each partition:
> >
> > 1b. the leader looks at the FetchRequest to find the offset it
> > needs to> > use for the fetch request
> > 2b. the leader performs a lookup to translate the offset to a
> > file index> > 3b. the leader reads the data from the file
> >
> > It seems like there is only one offset index lookup in both
> > cases?  The> > key point is that the cache in step #1a is not stored on 
> > disk.
> > Or maybe> > I'm missing something here.
> >
> > best,
> > Colin
> >
> >
> > > The offset index lookup can
> > > potentially be expensive since it could require disk I/Os. One
> > > way to> > > optimize this a bit is to further cache the log segment 
> > > position
> > > for the> > > next offset. The tricky issue is that for a compacted topic, 
> > > the
> > > underlying
> > > log segment could have changed between two consecutive fetch
> > > requests. We> > > could potentially make that case work, but the logic 
> > > will be more> > > complicated.
> > >
> > > Another thing is that it seems that the proposal only saves the
> > > metadata> > > overhead if there are low volume topics. If we use Jay's
> > > suggestion of> > > including 0 partitions in subsequent fetch requests, 
> > > it seems
> > > that we> > > could
> > > get the metadata saving even if all topics have continuous
> > > traffic.> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > > On Wed, Nov 22, 2017 at 1:14 PM, Colin McCabe > > 
> > > wrote:
> > >
> > > > On Tue, Nov 21, 2017, at 22:11, Jun Rao wrote:
> > > > > Hi, Jay,
> > > > >
> > > > > I guess in your proposal the leader has to cache the last
> > > > > offset> > given
> > > > > back for each partition so that it knows from which offset to
> > > > > serve> > the
> > > > next
> > > > > fetch request.
> > > >
> > > > Hi Jun,
> > > >
> > > > Just to clarify, the leader has to cache the last offset for
> > > > each> > > > follower / UUID in the original KIP-227 proposal as well.  
> > > > Sorry
> > > > if> > that
> > > > wasn't clear.
> > > >
> > > > > This is doable but it means that the leader needs to do an
> > > > > additional index lookup per partition to serve a fetch
> > > > > request. Not> > sure
> > > > > if the benefit from the lighter fetch request obviously
> > > > > offsets the> > > > > additional index lookup though.
> > > >
> > > > The runtime impact should be a small constant factor at most,
> > > > right?> > > > You would just have a mapping between UUID and the latest 
> > > > offset
> > > > in> > each
> > > > partition data structure.  It seems like the runtime impact of
> > > > looking> > > > up the fetch offset in a hash table (or small array) in 
> > > > the in-
> > > > memory> > > > partition data structure should be very similar to the 
> > > > runtime
> > > > impact> > of
> > > > looking up the fetch offset in the FetchRequest.
> > > >
> > > > The extra memory consumption per partition is O(num_brokers),
> > > > which is> > > > essentially a small constant.  (The fact that brokers 
> > > > can have
> > > > multiple> > > > UUIDs due to parallel fetches is a small wrinkle.  But 
> > > > we can
> > > > place an> > > > upper bound on the number of UUIDs permitted per 
> > > > broker.)
> > > >
> > > > best,
> > > > Colin
> > > >
> > > > >
> > > > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-22 Thread Becket Qin
Thanks for the KIP, Colin. It is an interesting idea.

Thinking about the fetch protocol, at a high level, currently the following
conveys two type of information:
1) what partitions I am interested in
2) where I am on those partitions, i.e. offsets

An extreme optimization would be letting the leader know both 1) and 2)
then the fetch request could be almost empty. I think we may be able
achieve this when there is no leader migration.

For 1) we actually kind of already have the information on each broker,
which is the metadata. We have found that in many cases a versioned
metadata is very helpful. With the metadata generation we can achieve 1),
i.e. the follower do not need to tell the leader what does it interested
in. More specifically, Assuming we add a generation to the metadata, in the
fetch request the follower will include a metadata generation, if the
generation matches the generation of the metadata on the leader, the leader
will send back a response indicating that the leader knows the follower's
interested set of partitions, so there is no need to send a full fetch
request. Otherwise, the follower still needs to send a full fetch request
in the next round. This will achieve the goal that unless there are leader
migration, the followers do not need to send the full requests.

There are other benefits of having a metadata generation. Those are
orthogonal to this discussion. But since we may need it elsewhere, we need
to introduce it at some point.

For 2), there are two options, A) as Jun said, the leader can do a look up
to know what is the last offset sent back to the follower for each
partition, or B) the follower sends back the updated log end offset in the
next fetch request. If we do (A), one potential optimization is that we can
let the leader always return the offsets at index boundary or log end
offset. For example, consider a log whose log end offset is 350, and the
index file has an entry at offset 100, 200, 300. The leader will always try
to return bytes at the offset boundary or log end, i.e. for each fetch
response, the leader will try to return the data up to the highest offset
index entry as long as the data could fit into the fetch size of the
partition, so it could be either 100, 200, 300 or 350(LEO). If so, the
leader will know the last returned offset without an additional log scan.
If the leader was not able to return at the index boundary or log end
offset, e.g. the fetch size is too small or the index bytes interval is too
large, the leader could then fall back to lookup the offset. Alternatively,
the leader can set a flag in the fetch response asking the follower to
provide the fetch offset in the next fetch request.

Thanks,

Jiangjie (Becket) Qin


On Wed, Nov 22, 2017 at 4:53 PM, Jun Rao  wrote:

> Hi, Colin,
>
> After step 3a, do we need to update the cached offset in the leader to be
> the last offset in the data returned in the fetch response? If so, we need
> another offset index lookup since the leader only knows that it gives out X
> bytes in the fetch response, but not the last offset in those X bytes.
>
> Thanks,
>
> Jun
>
> On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe  wrote:
>
> > On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
> > > Hi, Colin,
> > >
> > > When fetching data for a partition, the leader needs to translate the
> > > fetch offset to a position in a log segment with an index lookup. If
> the
> > fetch
> > > request now also needs to cache the offset for the next fetch request,
> > > there will be an extra offset index lookup.
> >
> > Hmm.  So the way I was thinking about it was, with an incremental fetch
> > request, for each partition:
> >
> > 1a. the leader consults its cache to find the offset it needs to use for
> > the fetch request
> > 2a. the leader performs a lookup to translate the offset to a file index
> > 3a. the leader reads the data from the file
> >
> > In contrast, with a full fetch request, for each partition:
> >
> > 1b. the leader looks at the FetchRequest to find the offset it needs to
> > use for the fetch request
> > 2b. the leader performs a lookup to translate the offset to a file index
> > 3b. the leader reads the data from the file
> >
> > It seems like there is only one offset index lookup in both cases?  The
> > key point is that the cache in step #1a is not stored on disk.  Or maybe
> > I'm missing something here.
> >
> > best,
> > Colin
> >
> >
> > > The offset index lookup can
> > > potentially be expensive since it could require disk I/Os. One way to
> > > optimize this a bit is to further cache the log segment position for
> the
> > > next offset. The tricky issue is that for a compacted topic, the
> > > underlying
> > > log segment could have changed between two consecutive fetch requests.
> We
> > > could potentially make that case work, but the logic will be more
> > > complicated.
> > >
> > > Another thing is that it seems that the proposal only saves the
> metadata
> > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-22 Thread Jun Rao
Hi, Colin,

After step 3a, do we need to update the cached offset in the leader to be
the last offset in the data returned in the fetch response? If so, we need
another offset index lookup since the leader only knows that it gives out X
bytes in the fetch response, but not the last offset in those X bytes.

Thanks,

Jun

On Wed, Nov 22, 2017 at 4:01 PM, Colin McCabe  wrote:

> On Wed, Nov 22, 2017, at 14:09, Jun Rao wrote:
> > Hi, Colin,
> >
> > When fetching data for a partition, the leader needs to translate the
> > fetch offset to a position in a log segment with an index lookup. If the
> fetch
> > request now also needs to cache the offset for the next fetch request,
> > there will be an extra offset index lookup.
>
> Hmm.  So the way I was thinking about it was, with an incremental fetch
> request, for each partition:
>
> 1a. the leader consults its cache to find the offset it needs to use for
> the fetch request
> 2a. the leader performs a lookup to translate the offset to a file index
> 3a. the leader reads the data from the file
>
> In contrast, with a full fetch request, for each partition:
>
> 1b. the leader looks at the FetchRequest to find the offset it needs to
> use for the fetch request
> 2b. the leader performs a lookup to translate the offset to a file index
> 3b. the leader reads the data from the file
>
> It seems like there is only one offset index lookup in both cases?  The
> key point is that the cache in step #1a is not stored on disk.  Or maybe
> I'm missing something here.
>
> best,
> Colin
>
>
> > The offset index lookup can
> > potentially be expensive since it could require disk I/Os. One way to
> > optimize this a bit is to further cache the log segment position for the
> > next offset. The tricky issue is that for a compacted topic, the
> > underlying
> > log segment could have changed between two consecutive fetch requests. We
> > could potentially make that case work, but the logic will be more
> > complicated.
> >
> > Another thing is that it seems that the proposal only saves the metadata
> > overhead if there are low volume topics. If we use Jay's suggestion of
> > including 0 partitions in subsequent fetch requests, it seems that we
> > could
> > get the metadata saving even if all topics have continuous traffic.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Nov 22, 2017 at 1:14 PM, Colin McCabe 
> wrote:
> >
> > > On Tue, Nov 21, 2017, at 22:11, Jun Rao wrote:
> > > > Hi, Jay,
> > > >
> > > > I guess in your proposal the leader has to cache the last offset
> given
> > > > back for each partition so that it knows from which offset to serve
> the
> > > next
> > > > fetch request.
> > >
> > > Hi Jun,
> > >
> > > Just to clarify, the leader has to cache the last offset for each
> > > follower / UUID in the original KIP-227 proposal as well.  Sorry if
> that
> > > wasn't clear.
> > >
> > > > This is doable but it means that the leader needs to do an
> > > > additional index lookup per partition to serve a fetch request. Not
> sure
> > > > if the benefit from the lighter fetch request obviously offsets the
> > > > additional index lookup though.
> > >
> > > The runtime impact should be a small constant factor at most, right?
> > > You would just have a mapping between UUID and the latest offset in
> each
> > > partition data structure.  It seems like the runtime impact of looking
> > > up the fetch offset in a hash table (or small array) in the in-memory
> > > partition data structure should be very similar to the runtime impact
> of
> > > looking up the fetch offset in the FetchRequest.
> > >
> > > The extra memory consumption per partition is O(num_brokers), which is
> > > essentially a small constant.  (The fact that brokers can have multiple
> > > UUIDs due to parallel fetches is a small wrinkle.  But we can place an
> > > upper bound on the number of UUIDs permitted per broker.)
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, Nov 21, 2017 at 7:03 PM, Jay Kreps  wrote:
> > > >
> > > > > I think the general thrust of this makes a ton of sense.
> > > > >
> > > > > I don't love that we're introducing a second type of fetch
> request. I
> > > think
> > > > > the motivation is for compatibility, right? But isn't that what
> > > versioning
> > > > > is for? Basically to me although the modification we're making
> makes
> > > sense,
> > > > > the resulting protocol doesn't really seem like something you would
> > > design
> > > > > this way from scratch.
> > > > >
> > > > > I think I may be misunderstanding the semantics of the partitions
> in
> > > > > IncrementalFetchRequest. I think the intention is that the server
> > > remembers
> > > > > the partitions you last requested, and the partitions you specify
> in
> > > the
> > > > > request are added to this set. This is a bit odd though because you
> > > can add
> > > > > partitions but I don't see how you 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-22 Thread Ismael Juma
Hi Jay,

On Thu, Nov 23, 2017 at 12:01 AM, Jay Kreps  wrote:

> I was also thinking there could be mechanical improvements that would help
> efficiency such as sharing topic name or TopicPartition objects to reduce
> the footprint in a flyweight style.


Coincidentally, I was thinking of this earlier today and implemented a
quick sketch.

The topic name case is a win in both reduced memory usage and reduced
allocation since we can get the interned topic name string (if it exists)
from the ByteBuffer directly (with very little allocation). For the
TopicPartition case, it seems hard to avoid allocating the TopicPartition
instance itself (which is lightweight if the topic name string has already
been interned)  so the benefit would be reduced memory usage only.

Ismael


Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-22 Thread Colin McCabe
On Wed, Nov 22, 2017, at 16:01, Jay Kreps wrote:
> Hey Colin,
> 
> WRT memory management I think what you are saying is that you would add a
> field to the fetch request which would request that the server cache the
> set of partitions and the response would have a field indicating whether
> that happened or not. This would allow a bound on memory.

Yeah.

> I was also thinking there could be mechanical improvements that would
> help efficiency such as sharing topic name or TopicPartition objects to reduce
> the footprint in a flyweight style. If you think about it there is
> already some memory overhead on a per-connection basis for socket buffers and
> purgatory so a little more might be okay.

We could just implement our own version of String#intern.  Apparently
the default one is really bad, but you could probably create a much
better one with ConcurrentHashMap.  See
https://stackoverflow.com/questions/10624232/performance-penalty-of-string-intern
.  Strings are just one thing, though: there is a lot of other stuff
like builders, partition objects, serde containers, temporary objects
scala creates, and so on.  Algorithmic improvements are a lot more
exciting than micro-optimizations here, I think.

P.S. Hopefully, newer GCs like Shenandoah will improve our GC
performance.  [ https://wiki.openjdk.java.net/display/shenandoah/Main ]

regards,
Colin

> 
> -Jay
> 
> On Wed, Nov 22, 2017 at 1:46 PM, Colin McCabe  wrote:
> 
> > On Wed, Nov 22, 2017, at 13:43, Colin McCabe wrote:
> > > On Wed, Nov 22, 2017, at 13:08, Jay Kreps wrote:
> > > > Okay yeah, what I said didn't really work or make sense. Ismael's
> > > > interpretation is better.
> > > >
> > > > Couple of things to point out:
> > > >
> > > >1. I'm less sure that replication has a high partition count and
> > > >consumers don't. There are definitely use cases for consumers that
> > > >subscribe to everything (e.g. load all my data into HDFS) as well as
> > > >super high partition count topics. In a bigger cluster it is
> > unlikely a
> > > >given node is actually replicating that many partitions from another
> > > >particular node (though perhaps in aggregate the effect is the
> > same).
> > > >I think it would clearly be desirable to have a solution that
> > targeted
> > > >both the consumer and replication if that were achievable.
> > >
> > > Hmm.  I hadn't considered the possibility that consumers might want to
> > > subscribe to a huge number of topics.  That's a fair point (especially
> > > with the replication example).
> > >
> > > >I agree with the concern on memory, but perhaps there could be a
> > way to
> > > >be smart about the memory usage?
> > >
> > > One approach would be to let clients compete for a configurable number
> > > of cache slots on the broker.  So only the first N clients to ask for an
> > > incremental fetch request UUID would receive one.  You could combine
> > > this with making the clients not request an incremental fetch request
> > > unless they were following more than some configurable number of
> > > partitions (like 10).  That way you wouldn't waste all your cache slots
> > > on clients that were only following 1 or 2 partitions, and hence
> > > wouldn't benefit much from the optimization.
> >
> > By the way, I was envisioning the cache slots as something that would
> > time out.  So if a client created an incremental fetch UUID and then
> > disappeared, we'd eventually purge its cached offsets and let someone
> > else use the memory.
> >
> > 
> >
> > >
> > > This is basically a bet on the idea that if you have clients following a
> > > huge number of partitions, you probably will only have a limited number
> > > of such clients.  Arguably, if you have a huge number of clients
> > > following a huge number of partitions, you are going to have performance
> > > problems anyway.
> > >
> > > >2. For the question of one request vs two, one difference in values
> > > >here may be that it sounds like you are proposing a less ideal
> > protocol to
> > > >simplify the broker code. To me the protocol is really *the*
> > > >fundamental interface in Kafka and we should really strive to make
> > that
> > > >something that is beautiful and makes sense on its own (without
> > needing
> > > >to understand the history of how we got there). I think there may
> > well
> > > >be such an explanation for the two API version (as you kind of said
> > with
> > > >your HDFS analogy) but really making it clear how these two APIs are
> > > >different and how they interact is key. Like, basically I think we
> > should
> > > >be able to explain it from scratch in such a way that it is obvious
> > you'd
> > > >have these two things as the fundamental primitives for fetching
> > data.
> > >
> > > I can see some arguments for having a single API.  One is that both
> > > incremental and full fetch requests will travel along a similar code
> > > 

Re: [DISCUSS] KIP-227: Introduce Incremental FetchRequests to Increase Partition Scalability

2017-11-22 Thread Jay Kreps
Hey Colin,

WRT memory management I think what you are saying is that you would add a
field to the fetch request which would request that the server cache the
set of partitions and the response would have a field indicating whether
that happened or not. This would allow a bound on memory.

I was also thinking there could be mechanical improvements that would help
efficiency such as sharing topic name or TopicPartition objects to reduce
the footprint in a flyweight style. If you think about it there is already
some memory overhead on a per-connection basis for socket buffers and
purgatory so a little more might be okay.

-Jay

On Wed, Nov 22, 2017 at 1:46 PM, Colin McCabe  wrote:

> On Wed, Nov 22, 2017, at 13:43, Colin McCabe wrote:
> > On Wed, Nov 22, 2017, at 13:08, Jay Kreps wrote:
> > > Okay yeah, what I said didn't really work or make sense. Ismael's
> > > interpretation is better.
> > >
> > > Couple of things to point out:
> > >
> > >1. I'm less sure that replication has a high partition count and
> > >consumers don't. There are definitely use cases for consumers that
> > >subscribe to everything (e.g. load all my data into HDFS) as well as
> > >super high partition count topics. In a bigger cluster it is
> unlikely a
> > >given node is actually replicating that many partitions from another
> > >particular node (though perhaps in aggregate the effect is the
> same).
> > >I think it would clearly be desirable to have a solution that
> targeted
> > >both the consumer and replication if that were achievable.
> >
> > Hmm.  I hadn't considered the possibility that consumers might want to
> > subscribe to a huge number of topics.  That's a fair point (especially
> > with the replication example).
> >
> > >I agree with the concern on memory, but perhaps there could be a
> way to
> > >be smart about the memory usage?
> >
> > One approach would be to let clients compete for a configurable number
> > of cache slots on the broker.  So only the first N clients to ask for an
> > incremental fetch request UUID would receive one.  You could combine
> > this with making the clients not request an incremental fetch request
> > unless they were following more than some configurable number of
> > partitions (like 10).  That way you wouldn't waste all your cache slots
> > on clients that were only following 1 or 2 partitions, and hence
> > wouldn't benefit much from the optimization.
>
> By the way, I was envisioning the cache slots as something that would
> time out.  So if a client created an incremental fetch UUID and then
> disappeared, we'd eventually purge its cached offsets and let someone
> else use the memory.
>
> C.
>
> >
> > This is basically a bet on the idea that if you have clients following a
> > huge number of partitions, you probably will only have a limited number
> > of such clients.  Arguably, if you have a huge number of clients
> > following a huge number of partitions, you are going to have performance
> > problems anyway.
> >
> > >2. For the question of one request vs two, one difference in values
> > >here may be that it sounds like you are proposing a less ideal
> protocol to
> > >simplify the broker code. To me the protocol is really *the*
> > >fundamental interface in Kafka and we should really strive to make
> that
> > >something that is beautiful and makes sense on its own (without
> needing
> > >to understand the history of how we got there). I think there may
> well
> > >be such an explanation for the two API version (as you kind of said
> with
> > >your HDFS analogy) but really making it clear how these two APIs are
> > >different and how they interact is key. Like, basically I think we
> should
> > >be able to explain it from scratch in such a way that it is obvious
> you'd
> > >have these two things as the fundamental primitives for fetching
> data.
> >
> > I can see some arguments for having a single API.  One is that both
> > incremental and full fetch requests will travel along a similar code
> > path.  There will also be a lot of the same fields in both the request
> > and the response.  Separating the APIs means duplicating those fields
> > (like max_wait_time, min_bytes, isolation_level, etc.)
> >
> > The argument for having two APIs is that some fields will be be present
> > in incremental requests and not in full ones, and vice versa.  For
> > example, incremental requests will have a UUID, whereas full requests
> > will not.  And clearly, the interpretation of some fields will be a bit
> > different.  For example, incremental requests will only return
> > information about changed partitions, whereas full requests will return
> > information about all partitions in the request.
> >
> > On the whole, maybe having a single API makes more sense?  There really
> > would be a lot of duplicated fields if we split the APIs.
> >
> > best,
> > Colin
> >
> > >
> > > -Jay
> > >
> > > 

  1   2   >