[jira] [Commented] (KAFKA-2397) leave group request

2015-10-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961597#comment-14961597
 ] 

ASF GitHub Bot commented on KAFKA-2397:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/103


> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.9.0.0
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-10-14 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957784#comment-14957784
 ] 

Onur Karaman commented on KAFKA-2397:
-

Cool. It sounds like we all generally agree on the explicit request. Does a 
committer want to review the pull request?

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.9.0.0
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-10-14 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957718#comment-14957718
 ] 

Guozhang Wang commented on KAFKA-2397:
--

For consumer shutdown, in the ZK-based consumer we will immediately delete its 
ephemeral node so that other members will be notified, hence now in the new 
consumer without this fix we are effectively introducing an regression.

For consumer hard failure cases though, with ZK-based old consumer it also 
takes a session timeout period to be detected; so I feel modifying the socket 
server to penetrate client-id information to pass to coordinator in order to 
improve on this case may be an overkill.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.9.0.0
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-10-14 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957550#comment-14957550
 ] 

Jay Kreps commented on KAFKA-2397:
--

Dunno if we closed the loop on the approach.

[~onurkaraman] Yeah the pro of the TCP approach is that all clients get it 
automatically even in hard app failure cases.

The downside is that the implementation on the server side is more involved and 
there is some risk of unnecessary rebalances if there are situations that cause 
the connection to be lost.

Another aspect is the ability to implement a shutdown without rebalance for 
quick restarts. This is particularly useful for stream processing where there 
is associated state that takes work to rebuild. I don't think this can be 
implemented easily with the TCP connection approach.

I think I'm on board with doing it explicitly.

I think the other question is whether it is cleaner to add a field to the ack 
or make a custom request. I don't have a strong opinion on that though in 
general I think fewer requests is better.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.9.0.0
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-10-13 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955673#comment-14955673
 ] 

Onur Karaman commented on KAFKA-2397:
-

My pull request had diverged again from trunk, so I force pushed a rebase that 
just cleans up the conflicts.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.9.0.0
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-04 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731373#comment-14731373
 ] 

Jiangjie Qin commented on KAFKA-2397:
-

[~jkreps] using TCP close to signal disconnect does have merits. It works 
either when client process crashes or closes normally. It is just not very 
clear to me whether it is worth doing here.

The price we pay here is we have to propagate every connection close at network 
to coordinator. From the server log in LinkedIn I saw, socket closure is quite 
frequent. Todd even submitted a patch to change that particular log to debug 
level. They could just be the ad-hoc SyncProducer in old consumer to refresh 
metadata. Maybe I'm over concerned but I am a bit worried about the noise here.

I don't know in which case a TCP connection might be closed. Proxy was 
mentioned earlier, maybe some workload balancer / firewall / gateway, etc. I 
feel it might be another unnecessary assumption/dependency we introduce that is 
not buying us too much.

Another thing I am not sure is how often an application process crashes except 
people do a kill -9. In most cases there are multiple threads in an 
application. If an uncaught exception is thrown, usually only that thread dies 
and the process will hang but not exit unless the people do that explicitly 
like mirror maker does. In that case, is it reasonable to expect the 
client.close() to be called in the application shutdown hook or some finally 
block? (It may not be the case for some other language like C, though). If 
using TCP close mainly addresses kill -9. It is very likely that session 
timeout has already reached when people manually kill the process.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-04 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731364#comment-14731364
 ] 

Onur Karaman commented on KAFKA-2397:
-

Here's a summary of what I think each approach has to offer over the other.

Pros of LeaveGroupRequest:
1. simplicity. I think the logic is pretty straightforward, keeps the network 
and api layers separated, fits well with existing patterns, and doesn't require 
a complicated refactoring.
2. opens up the possibility for tooling that lets you kick out a consumer and 
trigger a rebalance. This might be a useful admin tool for when things go wrong.
3. opens up the possibility for rolling bounces without triggering a rebalance. 
We can modify the consumer to have a close and closeNow (close sends out a 
LeaveGroupRequest, closeNow doesn't). The application can persist the consumer 
id somewhere. The consumer can initially try out the persisted consumer id 
after it comes back up.

Pros of tcp disconnect:
1. rebalance gets triggered on process death. This would be a con if you want 
the possibility for rolling bounces without triggering a rebalance.

P.S: I'm going to be on vacation from tonight to Tuesday so my responsiveness 
will be a bit spotty from tonight until Tuesday. I think Jiangjie may be in a 
similar situation.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-04 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731091#comment-14731091
 ] 

Jay Kreps commented on KAFKA-2397:
--

[~hachikuji] Makes sense.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-04 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731073#comment-14731073
 ] 

Jason Gustafson commented on KAFKA-2397:


[~jkreps] Not having a way to properly leave the group is pretty painful in 
testing, which I imagine users are going to be doing a lot of initially. I 
think it also makes rolling upgrades trickier if you don't have it since you 
have to allow additional time for the group to stabilize after each machine is 
upgraded. The ideal workflow to minimize rebalance overhead would probably be 
to shutdown one instance, let the group stabilize, then restart it. If you just 
restart the instance, then the whole group will have to pause until the old 
member's session timeout has expired (Although you can also get around this by 
persisting the consumer id).

Anyway, I'd rather have something if possible, but I agree that it could be 
pushed to another release if we think the TCP option is the way forward.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-04 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731052#comment-14731052
 ] 

Jay Kreps commented on KAFKA-2397:
--

Couple thoughts:
1. [~hachikuji] Does this need to be in the next release? This is really an 
optimization we can do at any time right? How bad of a user experience is it 
not to have it?
2. Does anyone have a concrete idea of where using TCP close to signal 
disconnect falls short? [~becket_qin] I think you are saying this is a problem 
but when is it actually a problem? This might be one where broader input could 
help...
3. We shouldn't end up with two different ways to do the same thing just 
because two ways have been proposed and we aren't sure yet which is best. This 
just means we aren't done thinking through the design. I think likely zero is 
better than two, right?

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-04 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14730633#comment-14730633
 ] 

Onur Karaman commented on KAFKA-2397:
-

My pull request had diverged from trunk, so I force pushed a rebase that just 
cleans up the conflicts.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-03 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729800#comment-14729800
 ] 

Jason Gustafson commented on KAFKA-2397:


[~becket_qin] I'm not sure anyone was suggesting replacing the session timeout 
if TCP disconnect was used to signal group departure. I think you need session 
timeout regardless of whether we have an explicit leave group request or we use 
the TCP disconnect. I also feel a little concern about #3, but I don't actually 
know of any cases where network issues will cause a disconnect. In general, my 
feeling is that the advantages of the TCP disconnect (in particular the ability 
to detect hard crashes more swiftly) are not worth the cost of exposing the 
lower level network layer in the consumer coordinator. At the moment, however, 
my main concern is more pragmatic: the window for a big change like that is 
starting to close.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-03 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729769#comment-14729769
 ] 

Jiangjie Qin commented on KAFKA-2397:
-

[~ewencp] [~hachikuji] Some thoughts on this. I agree with [~ewencp] that we 
should follow one protocol but not both. Personally I like explicit leave group 
request better.

The goals we want to achieve are:
1. When a consumer actually dies, we don't want to wait for too long before a 
rebalance.
2. When a consumer exits normally, we want to trigger a rebalance soon.
3. If there are some jitters or network issues, etc. We want to have some 
tolerance over that.

Using TCP connection to signify the liveliness will satisfy 2. 
For 1, if the TCP connection timeout is super long it won't work. That's why we 
introduced session timeout. 
For 3, using TCP connection to signify liveliness might cause problem.

Explicit leave group request is clear that a member will only be excluded from 
a group when it exit normally, or session is timeout. So all the three goals 
are met.

An important related scenario worth thinking about is bouncing a consumer. 
Without leave group request, it is possible to bounce a client without 
triggering rebalance as long as the consumer shuts down then come back before 
session timeout. If we send a leave group request explicitly, bouncing a 
consumer means there will be two rebalances (Which I think is the correct 
behavior). So making rebalance cheap and fast is very important.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-02 Thread Ewen Cheslack-Postava (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728410#comment-14728410
 ] 

Ewen Cheslack-Postava commented on KAFKA-2397:
--

[~hachikuji] My primary objection to that plan is that it might lead to us 
maintaining more complicated code if we support leave group via two mechanisms 
instead of one, and it's also more stuff the user has to understand.

On the other hand, I can see a case for supporting both: explicit leave group 
via a message is great for forcing the coordinator to trigger a rebalance ASAP, 
whereas an implicit leave group is a nice way to allow for fast reconnect in 
the case of a network hiccup without affecting membership/requiring another 
round but also allows the broker to boot the consumer from the group without 
waiting for the full session interval (and the client can also take this into 
account, stopping consumption after a heartbeat interval during which it cannot 
connect rather than waiting for a full session timeout). But since I'm not 
really clear on when we'd see such network hiccups that wouldn't be masked by 
TCP anyway, I'm not sure this is worth the more complicated model.

It does sound like it's probably complicated -- or at least a lot of code 
changes -- to make the lower level connection management and higher level 
protocol stuff coordinate. Since this issue actually slows things down for me 
on a daily basis now, I think the explicit leave group would make sense to get 
committed.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-09-02 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14727809#comment-14727809
 ] 

Jason Gustafson commented on KAFKA-2397:


Bumping this issue. One nice thing about the current patch is its simplicity 
(should be similar with the un-heartbeat approach). I wonder if it would be a 
bad thing to support explicit group departure with this patch and implicit 
departure with TCP disconnect? Then we could let this patch go through and 
consider the TCP disconnect in another JIRA.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-14 Thread Ewen Cheslack-Postava (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696602#comment-14696602
 ] 

Ewen Cheslack-Postava commented on KAFKA-2397:
--

[~becket_qin] I was more worried about figuring out what behavior was 
preferable first, then figuring out how to make it work with our code. I 
realize we'd need to expose some events in the lower level up to the code 
layered on it, but I don't see anything wrong with doing that, it just requires 
a tracking some more state and relaying events as you described. Kicking the 
member out based on TCP disconnection seemed to cover more cases, so unless 
there was a problem with it, I figured it's worth the effort to try to make it 
work that way.

Any system tests that forcibly kill Copycat workers are going to have the same 
issues I'm running into now. That isn't a huge problem since it's ok for some 
tests to take a long time, but it does have other impacts as well; for example, 
that means that a crashed process will hold on to any assignments it has for up 
to the full session timeout, in which case those assignments will not be 
processed (which, for Copycat, could potentially mean 30s worth of data lost if 
the source data is ephemeral, such as metrics).

[~hachikuji] I thought about proxies, but I couldn't come up with a scenario 
where the TCP connection to the coordinator would be closed do to a very short 
transient issue. But I definitely won't claim I know that it will never be the 
case or that I know all the weird things proxies might do under a variety of 
scenarios or configurations...

One problem with requiring an explicit leave group request/flag is that any 
crash still takes a lot of time to free up assigned partitions and keeps any 
members who are behaving properly from continuing to process their assigned 
work (since they discover the need for rebalance, invoke the rebalance revoked 
callback, and join group immediately). This means any process that crashes can 
gum up the works for all the other processes. And some people prefer the fail 
fast, crash and recover by restarting the process approach, so while they would 
obviously prefer crashes not happen, they also might expect to encounter this 
scenario semi-regularly and then find things grinding to a halt for 30s at a 
time.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-13 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696551#comment-14696551
 ] 

Onur Karaman commented on KAFKA-2397:
-

Sooo... ship it?!

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-13 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696351#comment-14696351
 ] 

Jason Gustafson commented on KAFKA-2397:


[~ewencp] The only case I can come up with is an application timeout on the 
client (e.g. if the heartbeat was delayed by a transient network issue), and 
that can be fixed by always ensuring that the timeout for coordinator requests 
is longer than the session timeout. My unease mostly has to do with 
proxy/tunnel situations where I don't know that TCP always behaves properly. 
Perhaps you know enough to know whether this is an issue? In any case, it seems 
like we all agree that we need a way to leave the group properly. My preference 
is probably for Onur's patch as it stands now.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-13 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696334#comment-14696334
 ] 

Jiangjie Qin commented on KAFKA-2397:
-

[~ewencp] Are you saying that the coordinator should kick the consumer out of 
group once its TCP connection is closed? I think the problem here is this 
breaks the layers we have on broker side. So the TCP connections are only 
maintained by SocketServer and not exposed to KafkaApiThreads. So SocketServer 
does not know about which consumer a particular connection is associated with. 
If you want to let Coordinator knows about TCP connection closure, the 
coordinator needs to keep a consumer-socket map and SocketServer needs to 
produce some event back to the request queue to notify a disconnection and 
coordinator needs to check if that socket is associated with some consumer or 
not.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-13 Thread Ewen Cheslack-Postava (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696327#comment-14696327
 ] 

Ewen Cheslack-Postava commented on KAFKA-2397:
--

Can we be more concrete about what we think the odd side effects would be if it 
were tied to the TCP session? What would happen that would cause the TCP 
connection to actually close rather than waiting around for a long time for the 
normal TCP timeout? I'm struggling to come up with a scenario that would 
actually kill the connection and I wouldn't want to kick the member out of the 
group.

I'm running into annoying issues since we don't have any mechanism to leave the 
group currently. Initially it was just manifesting in my manual testing with 
Copycat where if I restarted a sink task, it would take awhile for it to start 
processing data because it had to wait for the process that I had just killed 
to be kicked out of the group. This is a bit annoying with the default session 
timeout of 30s, but workable. However, I'm also working on system tests and now 
it's making me set quite large timeouts for some steps (which otherwise should 
be more like < 1s to complete) and therefore makes the tests run a lot slower.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-06 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660279#comment-14660279
 ] 

Jay Kreps commented on KAFKA-2397:
--

[~guozhang] Basically "session" would be the concept that ties a connection to 
the business logic layer. The session would be exposed with the request. I 
haven't thought through how this would work, but maybe in handleProduce you 
could do
   session.addOnClose("clear-produce-purgatory", => 
removeRequestFromPurgatory(id))
and in the purgatory when the request is completed you'd do
   session.removeOnClose("clear-produce-purgatory")

A similar mechanism would work based on the join-group request to rebalance the 
group on connection close.

Like I said, I didn't think this through and don't really advocate it. Like 
Jason I think there could be odd side effects.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-06 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14660269#comment-14660269
 ] 

Jason Gustafson commented on KAFKA-2397:


[~jkreps] Yeah, TCP is pretty resilient to network weirdness. I was mostly 
thinking client-side timeouts which may end up exposed in configuration. The 
only thing the client can do if a request times out is disconnect and try 
again. Perhaps we'd want to keep any timeouts with the coordinator out of 
configuration if we tried this approach. I was also wondering if there were 
some tunneling situations which could make the connection unstable.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-05 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659525#comment-14659525
 ] 

Guozhang Wang commented on KAFKA-2397:
--

[~jkreps] I need to look into the session implementation at socket server, but 
just to be more concrete are you suggesting adding the session-id in all 
handleXXXRequest in KafkaApis, and add another handleSessionTimeout in to 
KafkaApis as well?

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-05 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659470#comment-14659470
 ] 

Jay Kreps commented on KAFKA-2397:
--

[~guozhang] I haven't thought this through but here is the basic idea. For the 
purgatory case I think when you added something to the purgatory you would also 
add a "shutdown action" to the session that would delete the item on session 
termination. The session concept is what ties the KafkaApi layer to the network 
layer so this could be added in handleProduce().

[~hachikuji] Makes sense. Theoretically TCP should handle it, but yeah anything 
which broke the tcp connections would issue a rebalance storm.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-05 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659275#comment-14659275
 ] 

Jason Gustafson commented on KAFKA-2397:


[~jkreps] I think the disconnect approach could be interesting if it was 
tractable in the code, but I'm a little concerned that it would lead to 
spurious rebalancing due to ephemeral network events. This might not be a big 
deal when the consumers are in the same data center as the Kafka cluster, but 
it could be a bigger problem if they have to cross the Internet. I wonder if 
you could even get into some bad situations where network instability leads to 
constant rebalancing as consumers leave and immediately join repeatedly.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-05 Thread Gwen Shapira (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659185#comment-14659185
 ] 

Gwen Shapira commented on KAFKA-2397:
-

No, SocketServer may be aware of host:port of client, but not clientID

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-05 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659158#comment-14659158
 ] 

Guozhang Wang commented on KAFKA-2397:
--

I once considered letting KafkaApis to handle connection closure when I was 
working on purgatory re-design, to purge the requests as mentioned by Jar. The 
difficulties are that at the socket server the connection is not logically tied 
to a client (although in fact it is), while for handling a requests / client 
failure events we need to pass-in a client-id into the API layer. A lot has 
changed in SocketServer since then so I do not know if things have changed so 
that we can infer the client-id (or more specifically consumer-id in this case) 
from SocketServer. 

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-03 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652686#comment-14652686
 ] 

Jay Kreps commented on KAFKA-2397:
--

Nice summary [~onurkaraman].

I agree that adding a field to heartbeat is functionally equivalent to a 
leave_group request/resp. The reason for preferring that was just to reduce the 
conceptual weight of the protocol.

A second idea that I'm not sure is good: rather than having either a new 
request or a heartbeat it would be possible to use the TCP connection closure 
for this. The advantage would be ANY process death that didn't also kill the OS 
would then be detectable without any client participation needed. The downside 
is that (1) the server change would be slightly more involved, and (2) you 
wouldn't be able to close the connection for other reasons.

The complexity of implementation is that currently only the network layer knows 
about socket closes. However we were already introducing a session concept for 
the security work which allows the KakaApi layer to have access to 
cross-request state such as the authenticated identity. We could make it 
possible to add shutdown actions to the session that would make it possible to 
trigger this; or alternately we could add a way to add onSocketClose actions 
directly to the network layer.

This same feature would actually be useful for the purgatory. Currently when a 
connection is closed, I don't think that requests in purgatory are removed. If 
the purgatory timeout is very small this is okay, but a very common thing for 
people to ask for NO timeout in which case each connection close potentially 
leaks memory. I think we kind of "fixed" this by just overriding the max wait 
time but purging purgatory on shutdown is obviously preferable.







> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-03 Thread Onur Karaman (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652386#comment-14652386
 ] 

Onur Karaman commented on KAFKA-2397:
-

Hey everyone.

There's a difference between the best, expected, and worst case rebalance time.

Trunk
-
A consumer leaves at t = 0 and the coordinator detects the failure at t = s. 
The rebalance window can close as soon as all the existing consumers rejoin and 
as late as the maximum member session timeout.

The time to stabilize since the consumer failure is something like:
{code}
t = s + rebalance_timeout
{code}
Best case: The coordinator receives all of the remaining consumers' heartbeats 
immediately after t = s. All of the remaining consumers rejoin immediately 
after receiving the heartbeat response. So everything is done by *t ~= s*.

Expected case: The coordinator receives all of the remaining heartbeats at t = 
4s/3 because consumers will typically figure out the rebalance after s/3 (an 
oversimplification. Consumers of a group actually have staggered heartbeat 
intervals). All of the remaining consumers eventually rejoin 
(coordinator_join_group_request_receival_delay). So everything is done by *t ~= 
s + (s/3 + coordinator_join_group_request_receival_delay)*.

Worst case: All of the consumers in the group somehow fail to get notified of 
the rebalance until very last possible moment and rejoin the group just before 
the rebalance window ends: *t = s + s*.

LeaveGroupRequest
-
A consumer leaves at t = 0 and sends out the LeaveGroupRequest. The rebalance 
window can close as soon as all the existing consumers rejoin and as late as 
the maximum member session timeout.

The LeaveGroupRequest would cut down the time to stabilize since the consumer 
failure to something like:
{code}
t = coordinator_leave_group_request_receival_delay + rebalance_timeout
{code}
Best case: A consumer leaves at t = 0, sends out the LeaveGroupRequest, and the 
coordinator immediately receives the LeaveGroupRequest. The coordinator 
receives all of the remaining consumers' heartbeats immediately after t = 0. 
All of the remaining consumers rejoin immediately after receiving the heartbeat 
response. So everything is done by *t ~= 0*.

Expected case: A consumer leaves at t = 0, sends out the LeaveGroupRequest, and 
the coordinator receives the LeaveGroupRequest at t = 
coordinator_leave_group_request_receival_delay. All of the remaining consumers 
eventually rejoin (coordinator_join_group_request_receival_delay). So 
everything is done by *t ~= coordinator_leave_group_request_receival_delay + 
(s/3 + coordinator_join_group_request_receival_delay)*. I'm assuming 
coordinator_leave_group_request_receival_delay << s.

Worst case: A consumer leaves at t = 0, sends out the LeaveGroupRequest, and 
the coordinator receives the LeaveGroupRequest at t = 
coordinator_leave_group_request_receival_delay. All of the consumers in the 
group somehow fail to get notified of the rebalance until very last possible 
moment and rejoin the group just before the rebalance window ends: *t = 
coordinator_leave_group_request_receival_delay + s*. I'm assuming 
coordinator_leave_group_request_receival_delay << s.

Absolute worst case: The LeaveGroupRequest somehow got dropped before reaching 
the coordinator. The heartbeat would timeout on the coordinator anyway and hit 
the existing *t = s + s* behavior.

Summary
-
So I guess the absolute worst case behavior hasn't changed if the 
LeaveGroupRequest was somehow dropped, but everything else should get better by 
about s.

P.S: To avoid confusion, it's probably best to state whether you're talking 
about the behavior in trunk or the proposed behavior with LeaveGroupRequest.

I prefer having a separate LeaveGroupRequest, but that's less of the focus here.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-03 Thread Jiangjie Qin (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652231#comment-14652231
 ] 

Jiangjie Qin commented on KAFKA-2397:
-

I would prefer extending heartbeat to indicate leaving group. And there will 
always a be a delay for up to 1/3 of session timeout for the rebalance to be 
triggered on all the consumers in the group given the broker always trigger 
rebalance on heartbeat response. That is probably fine.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-03 Thread Jason Gustafson (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652198#comment-14652198
 ] 

Jason Gustafson commented on KAFKA-2397:


[~onurkaraman] I like this idea. Wouldn't the expected rebalance time actually 
be just the heartbeat interval since that's how long it would take the other 
group members to see the need to rebalance and send the new join group request? 
I think [~jkreps] was also suggesting to implement this as an "un-heartbeat" 
(i.e. with a flag on the heartbeat request), but I'm not sure if there was a 
strong reason to prefer that over the explicit request.

> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-2397) leave group request

2015-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14650693#comment-14650693
 ] 

ASF GitHub Bot commented on KAFKA-2397:
---

GitHub user onurkaraman opened a pull request:

https://github.com/apache/kafka/pull/103

KAFKA-2397: leave group request



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/onurkaraman/kafka leave-group

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/103.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #103


commit 24d7c931f17f34211e3cac69a678ae0d3980396a
Author: Onur Karaman 
Date:   2015-07-31T08:52:44Z

leave group request




> leave group request
> ---
>
> Key: KAFKA-2397
> URL: https://issues.apache.org/jira/browse/KAFKA-2397
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Onur Karaman
>Assignee: Onur Karaman
>Priority: Minor
> Fix For: 0.8.3
>
>
> Let's say every consumer in a group has session timeout s. Currently, if a 
> consumer leaves the group, the worst case time to stabilize the group is 2s 
> (s to detect the consumer failure + s for the rebalance window). If a 
> consumer instead can declare they are leaving the group, the worst case time 
> to stabilize the group would just be the s associated with the rebalance 
> window.
> This is a low priority optimization!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)