RE: closing session on socket close vs waiting for timeout

2010-09-10 Thread Fournier, Camille F. [Tech]
Ben, could you explain a bit more why you think this won't work? I'm trying to 
decide if I should put in the work to take the POC I wrote and complete it, but 
I don't really want to waste my time if there's a fundamental reason it's a bad 
idea.

Thanks,
Camille

-Original Message-
From: Benjamin Reed [mailto:br...@yahoo-inc.com] 
Sent: Wednesday, September 08, 2010 4:03 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

unfortunately, that only works on the standalone server.

ben

On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:
 This would be the ideal solution to this problem I think.
 Poking around the (3.3) code to figure out how hard it would be to implement, 
 I figure one way to do it would be to modify the session timeout to the min 
 session timeout and touch the connection before calling close when you get 
 certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
 touch session that returns if the tickTime is greater than the expire time) 
 and it worked (in the standalone server anyway). Interesting solution, or 
 total hack that will not work beyond most basic test case?

 C

 (forgive lack of actual code in this email)

 -Original Message-
 From: Ted Dunning [mailto:ted.dunn...@gmail.com]
 Sent: Tuesday, September 07, 2010 1:11 PM
 To: zookeeper-user@hadoop.apache.org
 Cc: Benjamin Reed
 Subject: Re: closing session on socket close vs waiting for timeout

 This really is, just as Ben says a problem of false positives and false
 negatives in detecting session
 expiration.

 On the other hand, the current algorithm isn't really using all the
 information available.  The current algorithm is
 using time since last client initiated heartbeat.  The new proposal is
 somewhat worse in that it proposes to use
 just the boolean has-TCP-disconnect-happened.

 Perhaps it would be better to use multiple features in order to decrease
 both false positives and false negatives.

 For instance, I could imagine that we use the following features:

 - time since last client hearbeat or disconnect or reconnect

 - what was the last event? (a heartbeat or a disconnect or a reconnect)

 Then the expiration algorithm could use a relatively long time since last
 heartbeat and a relatively short time since last disconnect to mark a
 session as disconnected.

 Wouldn't this avoid expiration during GC and cluster partition and cause
 expiration quickly after a client disconnect?


 On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org  wrote:


 That's a good point, however with suitable documentation, warnings and such
 it seems like a reasonable feature to provide for those users who require
 it. Used in moderation it seems fine to me. Perhaps we also make it
 configurable at the server level for those administrators/ops who don't
 want
 to deal with it (disable the feature entirely, or only enable on particular
 servers, etc...).

 Patrick

 On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com  wrote:

  
 if this mechanism were used very often, we would get a huge number of
 session expirations when a server fails. you are trading fast error
 detection for the ability to tolerate temporary network and server

 outages.
  
 to be honest this seems like something that in theory sounds like it will
 work in practice, but once deployed we start getting session expirations

 for
  
 cases that we really do not want or expect.

 ben


 On 09/01/2010 12:47 PM, Patrick Hunt wrote:


 Ben, in this case the session would be tied directly to the connection,
 we'd explicitly deny session re-establishment for this session type (so
 4 would fail). Would that address your concern, others?

 Patrick

 On 09/01/2010 10:03 AM, Benjamin Reed wrote:


  
 i'm a bit skeptical that this is going to work out properly. a server
 may receive a socket reset even though the client is still alive:

 1) client sends a request to a server
 2) client is partitioned from the server
 3) server starts trying to send response
 4) client reconnects to a different server
 5) partition heals
 6) server gets a reset from client

 at step 6 i don't think you want to delete the ephemeral nodes.

 ben

 On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:



 Yes that's right. Which network issues can cause the socket to close
 without the initiating process closing the socket? In my limited
 experience in this area network issues were more prone to leave dead
 sockets open rather than vice versa so I don't know what to look out
 for.

 Thanks,
 Camille

 -Original Message-
 From: Dave Wright [mailto:wrig...@gmail.com]
 Sent: Tuesday, August 31, 2010 1:14 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: closing session on socket close vs waiting for timeout

 I think he's saying that if the socket closes because of a crash (i.e.
 not a
 normal

Re: closing session on socket close vs waiting for timeout

2010-09-10 Thread Benjamin Reed
 the problem is that followers don't track session timeouts. they track 
when they last heard from the sessions that are connected to them and 
they periodically propagate this information to the leader. the leader 
is the one that expires the session. your technique only works when the 
client is connected to the leader.


one thing you can do is generate a close request for the socket and push 
that through the system. that will cause it to get propagated through 
the followers and processed at the leader. it would also allow you to 
get your functionality without touching the processing pipeline.


the thing that worries me about this functionality in general is that 
network anomalies can cause a whole raft of sessions to get expired in 
this way. for example, you have 3 servers with load spread well; there 
is a networking glitch that cause clients to abandon a server; suddenly 
1/3 of your clients will get expired sessions.


ben

On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote:

Ben, could you explain a bit more why you think this won't work? I'm trying to 
decide if I should put in the work to take the POC I wrote and complete it, but 
I don't really want to waste my time if there's a fundamental reason it's a bad 
idea.

Thanks,
Camille

-Original Message-
From: Benjamin Reed [mailto:br...@yahoo-inc.com]
Sent: Wednesday, September 08, 2010 4:03 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

unfortunately, that only works on the standalone server.

ben

On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

This would be the ideal solution to this problem I think.
Poking around the (3.3) code to figure out how hard it would be to implement, I 
figure one way to do it would be to modify the session timeout to the min 
session timeout and touch the connection before calling close when you get 
certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
touch session that returns if the tickTime is greater than the expire time) and 
it worked (in the standalone server anyway). Interesting solution, or total 
hack that will not work beyond most basic test case?

C

(forgive lack of actual code in this email)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Tuesday, September 07, 2010 1:11 PM
To: zookeeper-user@hadoop.apache.org
Cc: Benjamin Reed
Subject: Re: closing session on socket close vs waiting for timeout

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean has-TCP-disconnect-happened.

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org   wrote:



That's a good point, however with suitable documentation, warnings and such
it seems like a reasonable feature to provide for those users who require
it. Used in moderation it seems fine to me. Perhaps we also make it
configurable at the server level for those administrators/ops who don't
want
to deal with it (disable the feature entirely, or only enable on particular
servers, etc...).

Patrick

On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com   wrote:



if this mechanism were used very often, we would get a huge number of
session expirations when a server fails. you are trading fast error
detection for the ability to tolerate temporary network and server


outages.


to be honest this seems like something that in theory sounds like it will
work in practice, but once deployed we start getting session expirations


for


cases that we really do not want or expect.

ben


On 09/01/2010 12:47 PM, Patrick Hunt wrote:



Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:




i'm a bit skeptical that this is going to work out properly. a server
may receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3

Re: closing session on socket close vs waiting for timeout

2010-09-10 Thread Benjamin Reed
 ah dang, i should have said generate a close request for the session 
and push that through the system.


ben

On 09/10/2010 01:01 PM, Benjamin Reed wrote:

   the problem is that followers don't track session timeouts. they track
when they last heard from the sessions that are connected to them and
they periodically propagate this information to the leader. the leader
is the one that expires the session. your technique only works when the
client is connected to the leader.

one thing you can do is generate a close request for the socket and push
that through the system. that will cause it to get propagated through
the followers and processed at the leader. it would also allow you to
get your functionality without touching the processing pipeline.

the thing that worries me about this functionality in general is that
network anomalies can cause a whole raft of sessions to get expired in
this way. for example, you have 3 servers with load spread well; there
is a networking glitch that cause clients to abandon a server; suddenly
1/3 of your clients will get expired sessions.

ben

On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote:

Ben, could you explain a bit more why you think this won't work? I'm trying to 
decide if I should put in the work to take the POC I wrote and complete it, but 
I don't really want to waste my time if there's a fundamental reason it's a bad 
idea.

Thanks,
Camille

-Original Message-
From: Benjamin Reed [mailto:br...@yahoo-inc.com]
Sent: Wednesday, September 08, 2010 4:03 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

unfortunately, that only works on the standalone server.

ben

On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

This would be the ideal solution to this problem I think.
Poking around the (3.3) code to figure out how hard it would be to implement, I 
figure one way to do it would be to modify the session timeout to the min 
session timeout and touch the connection before calling close when you get 
certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
touch session that returns if the tickTime is greater than the expire time) and 
it worked (in the standalone server anyway). Interesting solution, or total 
hack that will not work beyond most basic test case?

C

(forgive lack of actual code in this email)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Tuesday, September 07, 2010 1:11 PM
To: zookeeper-user@hadoop.apache.org
Cc: Benjamin Reed
Subject: Re: closing session on socket close vs waiting for timeout

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean has-TCP-disconnect-happened.

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.orgwrote:



That's a good point, however with suitable documentation, warnings and such
it seems like a reasonable feature to provide for those users who require
it. Used in moderation it seems fine to me. Perhaps we also make it
configurable at the server level for those administrators/ops who don't
want
to deal with it (disable the feature entirely, or only enable on particular
servers, etc...).

Patrick

On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.comwrote:



if this mechanism were used very often, we would get a huge number of
session expirations when a server fails. you are trading fast error
detection for the ability to tolerate temporary network and server


outages.


to be honest this seems like something that in theory sounds like it will
work in practice, but once deployed we start getting session expirations


for


cases that we really do not want or expect.

ben


On 09/01/2010 12:47 PM, Patrick Hunt wrote:



Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:




i'm a bit skeptical that this is going to work out properly. a server
may

RE: closing session on socket close vs waiting for timeout

2010-09-10 Thread Fournier, Camille F. [Tech]
Well, this did work when I tried it on clients connected to follower nodes in 
the cluster. The followers still propagate back their touch table to the leader 
when they ping, correct? And then the leader does the same logic as when it 
touches a session? As long as the SessionTrackerImpl lines 173-176 are 
commented out (in the 3.3 branch), you can touch with a lowered expire time, 
and so long as that session does not re-connect during that shortened interval, 
the leader will properly time out the session. Again, I haven't thought through 
this logic extremely closely so it is possible I am missing some implications 
(like, what happens when two clients are connected to the same session on 
different servers, but then, what in the world does that even mean?), but it 
does work in the basic case so far as my testing has shown. I'll try to get the 
go-ahead to submit a ticket with the code so you can look at it yourself. 

I agree that if a network glitch causes a bunch of live clients to appear to 
close their sockets on the server they are connected to, you could have 
problems. This could be somewhat remediated by setting the minsessiontimeout to 
a value that allows for some slop, but not the amount of slop you want to allow 
for something like a full gc on a client. But what I'm not sure about is the 
glitch that actually causes this sort of observed behavior. I am not a 
networking expert, but in my experience I've seen network glitches that cause 
sockets to appear to be live that are actually dead, but not vice-versa. Can 
you tell me what would cause a socket closure with otherwise alive client and 
server? 

Thanks for the feedback,
Camille

-Original Message-
From: Benjamin Reed [mailto:br...@yahoo-inc.com] 
Sent: Friday, September 10, 2010 4:02 PM
To: Fournier, Camille F. [Tech]
Cc: 'zookeeper-user@hadoop.apache.org'
Subject: Re: closing session on socket close vs waiting for timeout

  the problem is that followers don't track session timeouts. they track 
when they last heard from the sessions that are connected to them and 
they periodically propagate this information to the leader. the leader 
is the one that expires the session. your technique only works when the 
client is connected to the leader.

one thing you can do is generate a close request for the socket and push 
that through the system. that will cause it to get propagated through 
the followers and processed at the leader. it would also allow you to 
get your functionality without touching the processing pipeline.

the thing that worries me about this functionality in general is that 
network anomalies can cause a whole raft of sessions to get expired in 
this way. for example, you have 3 servers with load spread well; there 
is a networking glitch that cause clients to abandon a server; suddenly 
1/3 of your clients will get expired sessions.

ben

On 09/10/2010 12:17 PM, Fournier, Camille F. [Tech] wrote:
 Ben, could you explain a bit more why you think this won't work? I'm trying 
 to decide if I should put in the work to take the POC I wrote and complete 
 it, but I don't really want to waste my time if there's a fundamental reason 
 it's a bad idea.

 Thanks,
 Camille

 -Original Message-
 From: Benjamin Reed [mailto:br...@yahoo-inc.com]
 Sent: Wednesday, September 08, 2010 4:03 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: closing session on socket close vs waiting for timeout

 unfortunately, that only works on the standalone server.

 ben

 On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:
 This would be the ideal solution to this problem I think.
 Poking around the (3.3) code to figure out how hard it would be to 
 implement, I figure one way to do it would be to modify the session timeout 
 to the min session timeout and touch the connection before calling close 
 when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing 
 the code in touch session that returns if the tickTime is greater than the 
 expire time) and it worked (in the standalone server anyway). Interesting 
 solution, or total hack that will not work beyond most basic test case?

 C

 (forgive lack of actual code in this email)

 -Original Message-
 From: Ted Dunning [mailto:ted.dunn...@gmail.com]
 Sent: Tuesday, September 07, 2010 1:11 PM
 To: zookeeper-user@hadoop.apache.org
 Cc: Benjamin Reed
 Subject: Re: closing session on socket close vs waiting for timeout

 This really is, just as Ben says a problem of false positives and false
 negatives in detecting session
 expiration.

 On the other hand, the current algorithm isn't really using all the
 information available.  The current algorithm is
 using time since last client initiated heartbeat.  The new proposal is
 somewhat worse in that it proposes to use
 just the boolean has-TCP-disconnect-happened.

 Perhaps it would be better to use multiple features in order to decrease
 both false positives and false negatives.

 For instance, I

Re: closing session on socket close vs waiting for timeout

2010-09-10 Thread Ted Dunning
A switch failure could do that, I think.

On Fri, Sep 10, 2010 at 1:49 PM, Fournier, Camille F. [Tech] 
camille.fourn...@gs.com wrote:

 I am not a networking expert, but in my experience I've seen network
 glitches that cause sockets to appear to be live that are actually dead, but
 not vice-versa. Can you tell me what would cause a socket closure with
 otherwise alive client and server?


RE: closing session on socket close vs waiting for timeout

2010-09-08 Thread Fournier, Camille F. [Tech]
This would be the ideal solution to this problem I think. 
Poking around the (3.3) code to figure out how hard it would be to implement, I 
figure one way to do it would be to modify the session timeout to the min 
session timeout and touch the connection before calling close when you get 
certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
touch session that returns if the tickTime is greater than the expire time) and 
it worked (in the standalone server anyway). Interesting solution, or total 
hack that will not work beyond most basic test case?

C

(forgive lack of actual code in this email)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Tuesday, September 07, 2010 1:11 PM
To: zookeeper-user@hadoop.apache.org
Cc: Benjamin Reed
Subject: Re: closing session on socket close vs waiting for timeout

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean has-TCP-disconnect-happened.

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Hunt ph...@apache.org wrote:

 That's a good point, however with suitable documentation, warnings and such
 it seems like a reasonable feature to provide for those users who require
 it. Used in moderation it seems fine to me. Perhaps we also make it
 configurable at the server level for those administrators/ops who don't
 want
 to deal with it (disable the feature entirely, or only enable on particular
 servers, etc...).

 Patrick

 On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed br...@yahoo-inc.com wrote:

  if this mechanism were used very often, we would get a huge number of
  session expirations when a server fails. you are trading fast error
  detection for the ability to tolerate temporary network and server
 outages.
 
  to be honest this seems like something that in theory sounds like it will
  work in practice, but once deployed we start getting session expirations
 for
  cases that we really do not want or expect.
 
  ben
 
 
  On 09/01/2010 12:47 PM, Patrick Hunt wrote:
 
  Ben, in this case the session would be tied directly to the connection,
  we'd explicitly deny session re-establishment for this session type (so
  4 would fail). Would that address your concern, others?
 
  Patrick
 
  On 09/01/2010 10:03 AM, Benjamin Reed wrote:
 
 
  i'm a bit skeptical that this is going to work out properly. a server
  may receive a socket reset even though the client is still alive:
 
  1) client sends a request to a server
  2) client is partitioned from the server
  3) server starts trying to send response
  4) client reconnects to a different server
  5) partition heals
  6) server gets a reset from client
 
  at step 6 i don't think you want to delete the ephemeral nodes.
 
  ben
 
  On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:
 
 
  Yes that's right. Which network issues can cause the socket to close
  without the initiating process closing the socket? In my limited
  experience in this area network issues were more prone to leave dead
  sockets open rather than vice versa so I don't know what to look out
  for.
 
  Thanks,
  Camille
 
  -Original Message-
  From: Dave Wright [mailto:wrig...@gmail.com]
  Sent: Tuesday, August 31, 2010 1:14 PM
  To: zookeeper-user@hadoop.apache.org
  Subject: Re: closing session on socket close vs waiting for timeout
 
  I think he's saying that if the socket closes because of a crash (i.e.
  not a
  normal zookeeper close request) then the session stays alive until the
  session timeout, which is of course true since ZK allows reconnection
  and
  resumption of the session in case of disconnect due to network issues.
 
  -Dave Wright
 
  On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
  wrote:
 
 
 
  That doesn't sound right to me.
 
  Is there a Zookeeper expert in the house?
 
  On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
  camille.fourn...@gs.com  wrote:
 
 
 
  I foolishly did not investigate the ZK code closely enough and it
  seems
  that closing the socket still waits for the session timeout to
  remove the
  session.
 
 
 
 
 
 



Re: closing session on socket close vs waiting for timeout

2010-09-08 Thread Benjamin Reed

unfortunately, that only works on the standalone server.

ben

On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

This would be the ideal solution to this problem I think.
Poking around the (3.3) code to figure out how hard it would be to implement, I 
figure one way to do it would be to modify the session timeout to the min 
session timeout and touch the connection before calling close when you get 
certain exceptions in NIOServerCnxn.doIO. I did this (removing the code in 
touch session that returns if the tickTime is greater than the expire time) and 
it worked (in the standalone server anyway). Interesting solution, or total 
hack that will not work beyond most basic test case?

C

(forgive lack of actual code in this email)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Tuesday, September 07, 2010 1:11 PM
To: zookeeper-user@hadoop.apache.org
Cc: Benjamin Reed
Subject: Re: closing session on socket close vs waiting for timeout

This really is, just as Ben says a problem of false positives and false
negatives in detecting session
expiration.

On the other hand, the current algorithm isn't really using all the
information available.  The current algorithm is
using time since last client initiated heartbeat.  The new proposal is
somewhat worse in that it proposes to use
just the boolean has-TCP-disconnect-happened.

Perhaps it would be better to use multiple features in order to decrease
both false positives and false negatives.

For instance, I could imagine that we use the following features:

- time since last client hearbeat or disconnect or reconnect

- what was the last event? (a heartbeat or a disconnect or a reconnect)

Then the expiration algorithm could use a relatively long time since last
heartbeat and a relatively short time since last disconnect to mark a
session as disconnected.

Wouldn't this avoid expiration during GC and cluster partition and cause
expiration quickly after a client disconnect?


On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org  wrote:

   

That's a good point, however with suitable documentation, warnings and such
it seems like a reasonable feature to provide for those users who require
it. Used in moderation it seems fine to me. Perhaps we also make it
configurable at the server level for those administrators/ops who don't
want
to deal with it (disable the feature entirely, or only enable on particular
servers, etc...).

Patrick

On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com  wrote:

 

if this mechanism were used very often, we would get a huge number of
session expirations when a server fails. you are trading fast error
detection for the ability to tolerate temporary network and server
   

outages.
 

to be honest this seems like something that in theory sounds like it will
work in practice, but once deployed we start getting session expirations
   

for
 

cases that we really do not want or expect.

ben


On 09/01/2010 12:47 PM, Patrick Hunt wrote:

   

Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:


 

i'm a bit skeptical that this is going to work out properly. a server
may receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

ben

On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:


   

Yes that's right. Which network issues can cause the socket to close
without the initiating process closing the socket? In my limited
experience in this area network issues were more prone to leave dead
sockets open rather than vice versa so I don't know what to look out
for.

Thanks,
Camille

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com]
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e.
not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection
and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
wrote:



 

That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
camille.fourn...@gs.com   wrote:



   

I foolishly did not investigate the ZK code closely enough and it
seems

Re: closing session on socket close vs waiting for timeout

2010-09-08 Thread Ted Dunning
To get it to work in a cluster, what would be necessary?

A new message to the leader to describe connection loss?

On Wed, Sep 8, 2010 at 1:03 PM, Benjamin Reed br...@yahoo-inc.com wrote:

 unfortunately, that only works on the standalone server.

 ben

 On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

 This would be the ideal solution to this problem I think.
 Poking around the (3.3) code to figure out how hard it would be to
 implement, I figure one way to do it would be to modify the session timeout
 to the min session timeout and touch the connection before calling close
 when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing
 the code in touch session that returns if the tickTime is greater than the
 expire time) and it worked (in the standalone server anyway). Interesting
 solution, or total hack that will not work beyond most basic test case?

 C

 (forgive lack of actual code in this email)

 -Original Message-
 From: Ted Dunning [mailto:ted.dunn...@gmail.com]
 Sent: Tuesday, September 07, 2010 1:11 PM
 To: zookeeper-user@hadoop.apache.org
 Cc: Benjamin Reed
 Subject: Re: closing session on socket close vs waiting for timeout

 This really is, just as Ben says a problem of false positives and false
 negatives in detecting session
 expiration.

 On the other hand, the current algorithm isn't really using all the
 information available.  The current algorithm is
 using time since last client initiated heartbeat.  The new proposal is
 somewhat worse in that it proposes to use
 just the boolean has-TCP-disconnect-happened.

 Perhaps it would be better to use multiple features in order to decrease
 both false positives and false negatives.

 For instance, I could imagine that we use the following features:

 - time since last client hearbeat or disconnect or reconnect

 - what was the last event? (a heartbeat or a disconnect or a reconnect)

 Then the expiration algorithm could use a relatively long time since last
 heartbeat and a relatively short time since last disconnect to mark a
 session as disconnected.

 Wouldn't this avoid expiration during GC and cluster partition and cause
 expiration quickly after a client disconnect?


 On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org  wrote:



 That's a good point, however with suitable documentation, warnings and
 such
 it seems like a reasonable feature to provide for those users who require
 it. Used in moderation it seems fine to me. Perhaps we also make it
 configurable at the server level for those administrators/ops who don't
 want
 to deal with it (disable the feature entirely, or only enable on
 particular
 servers, etc...).

 Patrick

 On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com
  wrote:



 if this mechanism were used very often, we would get a huge number of
 session expirations when a server fails. you are trading fast error
 detection for the ability to tolerate temporary network and server


 outages.


 to be honest this seems like something that in theory sounds like it
 will
 work in practice, but once deployed we start getting session expirations


 for


 cases that we really do not want or expect.

 ben


 On 09/01/2010 12:47 PM, Patrick Hunt wrote:



 Ben, in this case the session would be tied directly to the connection,
 we'd explicitly deny session re-establishment for this session type (so
 4 would fail). Would that address your concern, others?

 Patrick

 On 09/01/2010 10:03 AM, Benjamin Reed wrote:




 i'm a bit skeptical that this is going to work out properly. a server
 may receive a socket reset even though the client is still alive:

 1) client sends a request to a server
 2) client is partitioned from the server
 3) server starts trying to send response
 4) client reconnects to a different server
 5) partition heals
 6) server gets a reset from client

 at step 6 i don't think you want to delete the ephemeral nodes.

 ben

 On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:




 Yes that's right. Which network issues can cause the socket to close
 without the initiating process closing the socket? In my limited
 experience in this area network issues were more prone to leave dead
 sockets open rather than vice versa so I don't know what to look out
 for.

 Thanks,
 Camille

 -Original Message-
 From: Dave Wright [mailto:wrig...@gmail.com]
 Sent: Tuesday, August 31, 2010 1:14 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: closing session on socket close vs waiting for timeout

 I think he's saying that if the socket closes because of a crash
 (i.e.
 not a
 normal zookeeper close request) then the session stays alive until
 the
 session timeout, which is of course true since ZK allows reconnection
 and
 resumption of the session in case of disconnect due to network
 issues.

 -Dave Wright

 On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
 wrote:





 That doesn't sound right to me.

 Is there a Zookeeper

RE: closing session on socket close vs waiting for timeout

2010-09-08 Thread Fournier, Camille F. [Tech]
Yes, Ben, would you give some more details as to why it doesn't work in a 
cluster? I think I am seeing it work ok in cluster mode as well with some basic 
tests. There are probably other major problems with this but I would appreciate 
any direction you could give as to what might go wrong here.

Thanks,
C

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Wednesday, September 08, 2010 4:51 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

To get it to work in a cluster, what would be necessary?

A new message to the leader to describe connection loss?

On Wed, Sep 8, 2010 at 1:03 PM, Benjamin Reed br...@yahoo-inc.com wrote:

 unfortunately, that only works on the standalone server.

 ben

 On 09/08/2010 12:52 PM, Fournier, Camille F. [Tech] wrote:

 This would be the ideal solution to this problem I think.
 Poking around the (3.3) code to figure out how hard it would be to
 implement, I figure one way to do it would be to modify the session timeout
 to the min session timeout and touch the connection before calling close
 when you get certain exceptions in NIOServerCnxn.doIO. I did this (removing
 the code in touch session that returns if the tickTime is greater than the
 expire time) and it worked (in the standalone server anyway). Interesting
 solution, or total hack that will not work beyond most basic test case?

 C

 (forgive lack of actual code in this email)

 -Original Message-
 From: Ted Dunning [mailto:ted.dunn...@gmail.com]
 Sent: Tuesday, September 07, 2010 1:11 PM
 To: zookeeper-user@hadoop.apache.org
 Cc: Benjamin Reed
 Subject: Re: closing session on socket close vs waiting for timeout

 This really is, just as Ben says a problem of false positives and false
 negatives in detecting session
 expiration.

 On the other hand, the current algorithm isn't really using all the
 information available.  The current algorithm is
 using time since last client initiated heartbeat.  The new proposal is
 somewhat worse in that it proposes to use
 just the boolean has-TCP-disconnect-happened.

 Perhaps it would be better to use multiple features in order to decrease
 both false positives and false negatives.

 For instance, I could imagine that we use the following features:

 - time since last client hearbeat or disconnect or reconnect

 - what was the last event? (a heartbeat or a disconnect or a reconnect)

 Then the expiration algorithm could use a relatively long time since last
 heartbeat and a relatively short time since last disconnect to mark a
 session as disconnected.

 Wouldn't this avoid expiration during GC and cluster partition and cause
 expiration quickly after a client disconnect?


 On Mon, Sep 6, 2010 at 11:26 PM, Patrick Huntph...@apache.org  wrote:



 That's a good point, however with suitable documentation, warnings and
 such
 it seems like a reasonable feature to provide for those users who require
 it. Used in moderation it seems fine to me. Perhaps we also make it
 configurable at the server level for those administrators/ops who don't
 want
 to deal with it (disable the feature entirely, or only enable on
 particular
 servers, etc...).

 Patrick

 On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reedbr...@yahoo-inc.com
  wrote:



 if this mechanism were used very often, we would get a huge number of
 session expirations when a server fails. you are trading fast error
 detection for the ability to tolerate temporary network and server


 outages.


 to be honest this seems like something that in theory sounds like it
 will
 work in practice, but once deployed we start getting session expirations


 for


 cases that we really do not want or expect.

 ben


 On 09/01/2010 12:47 PM, Patrick Hunt wrote:



 Ben, in this case the session would be tied directly to the connection,
 we'd explicitly deny session re-establishment for this session type (so
 4 would fail). Would that address your concern, others?

 Patrick

 On 09/01/2010 10:03 AM, Benjamin Reed wrote:




 i'm a bit skeptical that this is going to work out properly. a server
 may receive a socket reset even though the client is still alive:

 1) client sends a request to a server
 2) client is partitioned from the server
 3) server starts trying to send response
 4) client reconnects to a different server
 5) partition heals
 6) server gets a reset from client

 at step 6 i don't think you want to delete the ephemeral nodes.

 ben

 On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:




 Yes that's right. Which network issues can cause the socket to close
 without the initiating process closing the socket? In my limited
 experience in this area network issues were more prone to leave dead
 sockets open rather than vice versa so I don't know what to look out
 for.

 Thanks,
 Camille

 -Original Message-
 From: Dave Wright [mailto:wrig...@gmail.com]
 Sent: Tuesday, August 31, 2010 1:14 PM
 To: zookeeper-user

Re: closing session on socket close vs waiting for timeout

2010-09-07 Thread Patrick Hunt
That's a good point, however with suitable documentation, warnings and such
it seems like a reasonable feature to provide for those users who require
it. Used in moderation it seems fine to me. Perhaps we also make it
configurable at the server level for those administrators/ops who don't want
to deal with it (disable the feature entirely, or only enable on particular
servers, etc...).

Patrick

On Mon, Sep 6, 2010 at 2:10 PM, Benjamin Reed br...@yahoo-inc.com wrote:

 if this mechanism were used very often, we would get a huge number of
 session expirations when a server fails. you are trading fast error
 detection for the ability to tolerate temporary network and server outages.

 to be honest this seems like something that in theory sounds like it will
 work in practice, but once deployed we start getting session expirations for
 cases that we really do not want or expect.

 ben


 On 09/01/2010 12:47 PM, Patrick Hunt wrote:

 Ben, in this case the session would be tied directly to the connection,
 we'd explicitly deny session re-establishment for this session type (so
 4 would fail). Would that address your concern, others?

 Patrick

 On 09/01/2010 10:03 AM, Benjamin Reed wrote:


 i'm a bit skeptical that this is going to work out properly. a server
 may receive a socket reset even though the client is still alive:

 1) client sends a request to a server
 2) client is partitioned from the server
 3) server starts trying to send response
 4) client reconnects to a different server
 5) partition heals
 6) server gets a reset from client

 at step 6 i don't think you want to delete the ephemeral nodes.

 ben

 On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:


 Yes that's right. Which network issues can cause the socket to close
 without the initiating process closing the socket? In my limited
 experience in this area network issues were more prone to leave dead
 sockets open rather than vice versa so I don't know what to look out
 for.

 Thanks,
 Camille

 -Original Message-
 From: Dave Wright [mailto:wrig...@gmail.com]
 Sent: Tuesday, August 31, 2010 1:14 PM
 To: zookeeper-user@hadoop.apache.org
 Subject: Re: closing session on socket close vs waiting for timeout

 I think he's saying that if the socket closes because of a crash (i.e.
 not a
 normal zookeeper close request) then the session stays alive until the
 session timeout, which is of course true since ZK allows reconnection
 and
 resumption of the session in case of disconnect due to network issues.

 -Dave Wright

 On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
 wrote:



 That doesn't sound right to me.

 Is there a Zookeeper expert in the house?

 On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
 camille.fourn...@gs.com  wrote:



 I foolishly did not investigate the ZK code closely enough and it
 seems
 that closing the socket still waits for the session timeout to
 remove the
 session.








Re: closing session on socket close vs waiting for timeout

2010-09-06 Thread Benjamin Reed
if this mechanism were used very often, we would get a huge number of 
session expirations when a server fails. you are trading fast error 
detection for the ability to tolerate temporary network and server outages.


to be honest this seems like something that in theory sounds like it 
will work in practice, but once deployed we start getting session 
expirations for cases that we really do not want or expect.


ben

On 09/01/2010 12:47 PM, Patrick Hunt wrote:

Ben, in this case the session would be tied directly to the connection,
we'd explicitly deny session re-establishment for this session type (so
4 would fail). Would that address your concern, others?

Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:
   

i'm a bit skeptical that this is going to work out properly. a server
may receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

ben

On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:
 

Yes that's right. Which network issues can cause the socket to close
without the initiating process closing the socket? In my limited
experience in this area network issues were more prone to leave dead
sockets open rather than vice versa so I don't know what to look out for.

Thanks,
Camille

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com]
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e.
not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
wrote:

   

That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
camille.fourn...@gs.com  wrote:

 

I foolishly did not investigate the ZK code closely enough and it seems
that closing the socket still waits for the session timeout to
remove the
session.
   
 




Re: closing session on socket close vs waiting for timeout

2010-09-01 Thread Benjamin Reed
i'm a bit skeptical that this is going to work out properly. a server 
may receive a socket reset even though the client is still alive:


1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

ben

On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:

Yes that's right. Which network issues can cause the socket to close without 
the initiating process closing the socket? In my limited experience in this 
area network issues were more prone to leave dead sockets open rather than vice 
versa so I don't know what to look out for.

Thanks,
Camille

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com]
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e. not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com  wrote:

   

That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
camille.fourn...@gs.com  wrote:

 

I foolishly did not investigate the ZK code closely enough and it seems
that closing the socket still waits for the session timeout to remove the
session.
   
 




Re: closing session on socket close vs waiting for timeout

2010-09-01 Thread Patrick Hunt
Ben, in this case the session would be tied directly to the connection, 
we'd explicitly deny session re-establishment for this session type (so 
4 would fail). Would that address your concern, others?


Patrick

On 09/01/2010 10:03 AM, Benjamin Reed wrote:

i'm a bit skeptical that this is going to work out properly. a server
may receive a socket reset even though the client is still alive:

1) client sends a request to a server
2) client is partitioned from the server
3) server starts trying to send response
4) client reconnects to a different server
5) partition heals
6) server gets a reset from client

at step 6 i don't think you want to delete the ephemeral nodes.

ben

On 08/31/2010 01:41 PM, Fournier, Camille F. [Tech] wrote:

Yes that's right. Which network issues can cause the socket to close
without the initiating process closing the socket? In my limited
experience in this area network issues were more prone to leave dead
sockets open rather than vice versa so I don't know what to look out for.

Thanks,
Camille

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com]
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e.
not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunningted.dunn...@gmail.com
wrote:


That doesn't sound right to me.

Is there a Zookeeper expert in the house?

On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech]
camille.fourn...@gs.com wrote:


I foolishly did not investigate the ZK code closely enough and it seems
that closing the socket still waits for the session timeout to
remove the
session.




Re: closing session on socket close vs waiting for timeout

2010-08-31 Thread Dave Wright
I think he's saying that if the socket closes because of a crash (i.e. not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 That doesn't sound right to me.

 Is there a Zookeeper expert in the house?

 On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] 
 camille.fourn...@gs.com wrote:

  I foolishly did not investigate the ZK code closely enough and it seems
  that closing the socket still waits for the session timeout to remove the
  session.



RE: closing session on socket close vs waiting for timeout

2010-08-31 Thread Fournier, Camille F. [Tech]
Yes that's right. Which network issues can cause the socket to close without 
the initiating process closing the socket? In my limited experience in this 
area network issues were more prone to leave dead sockets open rather than vice 
versa so I don't know what to look out for.

Thanks,
Camille 

-Original Message-
From: Dave Wright [mailto:wrig...@gmail.com] 
Sent: Tuesday, August 31, 2010 1:14 PM
To: zookeeper-user@hadoop.apache.org
Subject: Re: closing session on socket close vs waiting for timeout

I think he's saying that if the socket closes because of a crash (i.e. not a
normal zookeeper close request) then the session stays alive until the
session timeout, which is of course true since ZK allows reconnection and
resumption of the session in case of disconnect due to network issues.

-Dave Wright

On Tue, Aug 31, 2010 at 1:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 That doesn't sound right to me.

 Is there a Zookeeper expert in the house?

 On Tue, Aug 31, 2010 at 8:58 AM, Fournier, Camille F. [Tech] 
 camille.fourn...@gs.com wrote:

  I foolishly did not investigate the ZK code closely enough and it seems
  that closing the socket still waits for the session timeout to remove the
  session.