Re: [Dovecot] TIMO HELP! director ring wont stay connected

2012-09-11 Thread Timo Sirainen
On 3.9.2012, at 21.26, Kelsey Cummings wrote:

 Sep  3 09:22:42 b.director. b dovecot: director: Error: Director 
 10.10.10.71:9321/right disconnected
 Sep  3 09:22:45 a.director. a dovecot: director: Error: Director 
 10.10.10.37:9321/left disconnected
 Sep  3 09:22:49 b.director. b dovecot: director: Error: Director 
 10.10.10.71:9321/left disconnected
 Sep  3 09:22:53 b.director. b dovecot: director: Error: Director 
 10.10.10.71:9321/left disconnected
 Sep  3 09:22:54 a.director. a dovecot: director: Error: Director 
 10.10.10.37:9321/left disconnected
 Sep  3 09:22:59 b.director. b dovecot: director: Error: Director 
 10.10.10.71:9321/left disconnected
 Sep  3 09:23:02 a.director. a dovecot: director: Error: Director 
 10.10.10.37:9321/right disconnected

All of these connections had finished handshaking. They simply just seemed to 
disconnect the connection for no reason. I found one reason for why that could 
happen, which could explain all of this:

http://hg.dovecot.org/dovecot-2.1/rev/24e791bbcf69

The user weakness is remembered until all directors are shut down or all 
directors have acknowledged the weakness.

 Sep  3 09:23:02 a.director. a dovecot: director: Warning: director: Couldn't 
 connect to right side, we must be the only director left

The user weakness should have been removed at this point, but looks like 
there's code missing for it for 1-director-only setup. I'll fix that soon..

 Sep  3 09:23:32 a.director. a dovecot: director: Error: director: User foo 
 host lookup failed: Timeout - queued for 47 secs (Ring synced for 30 secs, 
 weak user, user refreshed 47 secs ago)
 Sep  3 09:23:32 a.director. a dovecot: director: Error: director: User bar 
 host lookup failed: Timeout - queued for 38 secs (Ring synced for 30 secs, 
 weak user, user refreshed 38 secs ago)

These are the weak users causing the trouble.

This improves logging: http://hg.dovecot.org/dovecot-2.1/rev/27d3289e1f5c



Re: [Dovecot] TIMO HELP! director ring wont stay connected

2012-09-04 Thread Kelsey Cummings

On 09/03/12 12:06, Timo Sirainen wrote:

On 3.9.2012, at 21.26, Kelsey Cummings wrote:


I've had 2x director ring up and running with production load on 2.1.8 with 
around 10,000 active connections for two weeks and everything has been working 
great - until this morning.

There isn't anything obvious in the logs beyond the fact that the director 
connections started bouncing.  It was not resolved by reloads or restarts or an 
upgrade to 2.1.9 (only the directors.)


Did you try stopping both and then starting them again? That clears up all the 
state they have.


I stopped both directors last night and they were able to stay in sync 
after they were restarted.  Could corruption of the in memory state lead 
to the connections being dropped?


If this happens again I'll try to get a tcpdump and an strace so the bug 
can get squashed.


-K


Re: [Dovecot] TIMO HELP! director ring wont stay connected

2012-09-04 Thread Timo Sirainen
On 3.9.2012, at 21.26, Kelsey Cummings wrote:

 passdb {
  args = proxy=y nopassword=y
  driver = static
 }

I wonder if someone was doing a ton of logins for different usernames? This 
kind of setup where director doesn't verify the username can be attacked that 
way.



Re: [Dovecot] TIMO HELP! director ring wont stay connected

2012-09-04 Thread Timo Sirainen
On 5.9.2012, at 3.58, Timo Sirainen wrote:

 On 3.9.2012, at 21.26, Kelsey Cummings wrote:
 
 passdb {
 args = proxy=y nopassword=y
 driver = static
 }
 
 I wonder if someone was doing a ton of logins for different usernames? This 
 kind of setup where director doesn't verify the username can be attacked that 
 way.

Although the extra users should be freed from the memory after 15 minutes.

Hmm. Once Dovecot supports moving existing connections from one backend server 
to another without the client noticing anything, the director could be 
simplified by using consistent hashing and when the number of backends changes, 
the director could start moving connections to their proper backends. During 
this move new connections would be handled by 1) if old backend = new backend 
just forward the connection there or 2) if they're different, request immediate 
move for that user's existing connections and wait for it to be finished before 
letting new connections finish. Or alternatively if the user isn't just being 
moved at that time, forward the connection to the old server and let it be part 
of the later move.

The main difference here is that directors wouldn't need to keep any track of 
user - backend associations. The moving period could still be a bit tricky to 
handle well, especially since the situation can change again while a previous 
move is still going on.

Re: [Dovecot] TIMO HELP! director ring wont stay connected

2012-09-04 Thread Kelsey Cummings

On 9/4/2012 5:58 PM, Timo Sirainen wrote:

On 3.9.2012, at 21.26, Kelsey Cummings wrote:


passdb {
  args = proxy=y nopassword=y
  driver = static
}


I wonder if someone was doing a ton of logins for different usernames? This 
kind of setup where director doesn't verify the username can be attacked that 
way.


It doesn't look like there was a higher than normal number of failed 
logins leading up to the connection issues.  I'm going to write some 
more stats collection tools to track state on the directors and see what 
comes of it.


Can the director proxy validate the username via a unix pw lookup but 
not check the password?


--
Kelsey Cummings - k...@corp.sonic.net  sonic.net, inc.
System Architect  2260 Apollo Way
707.522.1000  Santa Rosa, CA 95407


Re: [Dovecot] TIMO HELP! director ring wont stay connected

2012-09-03 Thread Kelsey Cummings

On 9/3/2012 12:06 PM, Timo Sirainen wrote:

Did you try stopping both and then starting them again? That clears up all the 
state they have.


I'm not sure that they were both down when restarting them and will try 
this tonight.



If the state clearing doesn't help, maybe this has something to do with the OS 
or the network is really having some issues.


I can't rule that out but there are not any signs that there are any 
hardware, OS or network related issues.


Thanks for gettting the ring status into doveadm by the way.  At least 
our monitoring caught this quickly.


--
Kelsey Cummings - k...@corp.sonic.net  sonic.net, inc.
System Architect  2260 Apollo Way
707.522.1000  Santa Rosa, CA 95407