from:"Brandon Davidson"

Re: [Dovecot] Fwd: Re: Dotlock dovecot-uidlist errors / NFS / High Load

2011-01-20 Thread Brandon Davidson

Stan,

On 1/20/11 7:45 PM, "Stan Hoeppner"  wrote:
> 
> What you're supposed to do, and what VMWare recommends, is to run ntpd _only
> in
> the ESX host_ and _not_ in each guest.  According to:
> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displ
> ayKC&externalId=1006427

Did you read the document you linked? As was mentioned on this list fairly
recently, that's not been the recommendation for quite some time. To the
contrary:

===
NTP Recommendations
Note: In all cases use NTP instead of VMware Tools periodic time
synchronization. (...) When using NTP in the guest, disable VMware Tools
periodic time synchronization.
===

We run the guests with divider=10, periodic timesync disabled, and NTP on
both the host and the guest. We have not had any time problems in several
years of operation.

-Brad

Re: [Dovecot] SSD drives are really fast running Dovecot

2011-01-14 Thread Brandon Davidson

On 1/14/11 8:59 PM, "Brandon Davidson"  wrote:
> I work for central IS, so this is the first stage of a consolidated service
> offering that we anticipate may encompass all of our staff and faculty. We
> bought what we could with what we had, anticipating that usage will grow
> over time as individual units migrate off their existing infrastructure.
> 
> 1/3 of the available capacity is passive 3rd-site disaster-recovery. The
> remaining 2 sites each host both an active and a passive copy of each mail
> store; we design to be able to sustain a site outage without loss of
> service. Each site has extra space for several years of growth, database
> restores, and archival / records retention reserves.

Oh, and you probably don't even want to think about what we did for our
Dovecot infrastructure. Clustered NFS servers with seamless failover,
snapshotting, and real-time block-level replication aren't cheap. The
students and faculty/staff not supported by an existing Exchange environment
aren't getting any less support, I'll say that much.

Folks trust us with their education, their livelihoods, and their personal
lives. I'd like to think that 'my fellow taxpayers' understand the
importance of what we do and appreciate the measures we take to ensure the
integrity and availability of their data.

-Brad

Re: [Dovecot] SSD drives are really fast running Dovecot

2011-01-14 Thread Brandon Davidson

Stan,

On 1/14/11 7:09 PM, "Stan Hoeppner"  wrote:
> 
> The average size of an email worldwide today is less than 4KB, less than one
> typical filesystem block.
> 
> 28TB / 4KB = 28,000,000,000,000 bytes / 4096 bytes = 6,835,937,500 =
> 6.8 billion emails / 5,000 users =
> 1,367,188 emails per user
> 
> 6.8 billion emails is "not much anymore" for a 5,000 seat org?

You obviously don't live in the same world I do. Have you ever been part of
a grant approval process and seen what kinds of files are exchanged, and
with what frequency? Complied with retention and archival policies? Dealt
with folks who won't (or can't) delete an message once they've received it?

Blithely applying some inexplicable figure you've pulled out of
who-knows-where and extrapolating from that hardly constitutes prudent
planning. We based our requirement on real numbers observed in our
environment, expected growth, and our budget cycle. How do you plan? More
blind averaging?

> How much did that 252TB NetApp cost the university?  $300k?  $700k?  Just a
> drop
> in the bucket right?  Do you think that was a smart purchasing decision, given
> your state's $3.8 Billion deficit?

You're close, if a bit high with one of your guesses. Netapp is good to
Education. Not that it matters - you know very little about the financial
state of my institution or how capital expenditures work within my
department's funding model.

I suppose I shouldn't be surprised though, you seem to be very skilled at
taking a little bit of information and making a convincing-sounding argument
about it... regardless of how much you actually know.

> For comparison, as of Feb 2009, the entire digital online content of the
> Library
> of Congress was only 74TB.  And you just purchased 252TB just for email for a
> 5,000 head count subsection of a small state university's population?

I work for central IS, so this is the first stage of a consolidated service
offering that we anticipate may encompass all of our staff and faculty. We
bought what we could with what we had, anticipating that usage will grow
over time as individual units migrate off their existing infrastructure.
Again, you're guessing and casting aspersions.

This is enterprise storage; I'm not sure that you know what this actually
means either. With Netapp you generally lose on the order of 35-45% due to
right-sizing, RAID, spares, and aggregate/volume/snapshot reserves. What's
left will be carved up into LUNs and presented to the hosts.

1/3 of the available capacity is passive 3rd-site disaster-recovery. The
remaining 2 sites each host both an active and a passive copy of each mail
store; we design to be able to sustain a site outage without loss of
service. Each site has extra space for several years of growth, database
restores, and archival / records retention reserves.

That's how 16TB of active mail can end up requiring 252TB of raw disk. Doing
things right can be expensive, but it's usually cheaper in the long run than
doing it wrong. It's like looking into a whole other world for you, isn't
it? No Newegg parts here...

-Brad

Re: [Dovecot] Ongoing performance issues with 2.0.x

2010-11-08 Thread Brandon Davidson

Stan,

On 11/8/10 10:39 AM, "Stan Hoeppner"  wrote:
> 
> However, if CONFIG_HZ=1000 you're generating WAY too many interrupts/sec
> to the timer, ESPECIALLY on an 8 core machine.  This will exacerbate the
> high context switching problem.  On an 8 vCPU (and physical CPU) machine
> you should have CONFIG_HZ=100 or a tickless kernel.  You may get by
> using 250, but anything higher than that is trouble.

On modern kernels you can boot with "divider=10" to take the HZ from 1000
down to 100 at boot time - no rebuilding necessary.

-Brad

Re: [Dovecot] remote hot site, IMAP replication or cluster over WAN

2010-11-02 Thread Brandon Davidson

Stan,

On 11/1/10 7:30 PM, "Stan Hoeppner"  wrote:
> 1.  How many of you have a remote site hot backup Dovecot IMAP server?

+1

> 2.  How are you replicating mailbox data to the hot backup system?
> C.  Other

Netapp Fabric MetroCluster, active IMAP/POP3 nodes at both sites mounting
storage over NFS, and active/standby hardware load balancers in front.

Probably more than most folks can afford, but it's pretty bulletproof.

-Brad

Re: [Dovecot] Problem with namespace (maybe bug?)

2010-10-28 Thread Brandon Davidson

Timo,

On 10/28/10 5:13 AM, "Timo Sirainen"  wrote:
>>   . list (subscribed) "" "*"
>>   * LIST (\Subscribed \NonExistent) "/"
>> "Shared/tester2/sdfgsg/gsdfgf/vtyjyfgj/rtdhrthxs/zhfhg"
>>   . OK List completed.
> 
> Looks like a bug, yeah. Should be fixed in v2.0. I don't know if it's worth
> the trouble to fix it in v1.2 anymore though..

I think it's differently broken in 2.0 in certain configurations. See
http://www.dovecot.org/list/dovecot/2010-October/054310.html

It's possible that the same change to SUBSCRIPTIONS would fix it on 1.2?

-Brad

Re: [Dovecot] Retrieving unread message count

2010-10-17 Thread Brandon Davidson

Timo,

On 10/17/10 4:20 PM, "Timo Sirainen"  wrote:
> On 18.10.2010, at 0.19, Brandon Davidson wrote:
> 
>> Other than actually calling THREAD and
>> counting the resulting groups, is there a good way to get a count of
>> threads?
> 
> Nope, that's the only way.

It looks like draft-ietf-morg-inthread-01 dropped THREADROOT/THREADLEAF
which is too bad. It would be really nice to be able to do something like:

s01 SEARCH RETURN (COUNT) THREADROOT INTHREAD

It looks like it's still pretty much a work in progress, but are you
planning on implementing any more of the INTHREAD stuff?

-Brad

Re: [Dovecot] Retrieving unread message count

2010-10-17 Thread Brandon Davidson

Timo,

On 10/17/10 3:56 PM, "Timo Sirainen"  wrote:
> 
> The reason why STATUS is mentioned to be possibly slow is to discourage
> clients from doing a STATUS to all mailboxes.
> 
> STATUS is definitely faster than SELECT+SEARCH with all IMAP servers.

That's what I figured, thanks! Other than actually calling THREAD and
counting the resulting groups, is there a good way to get a count of
threads?

-Brad

[Dovecot] Retrieving unread message count

2010-10-17 Thread Brandon Davidson

Timo,

I'm working with a webmail client that periodically polls unread message
counts for a list of folders. It currently does this by doing a LIST or LSUB
and then iterating across all of the folders, running a SEARCH ALL UNSEEN,
and counting the resulting UID list.

Eventually I'd like to see it using RFC5819 LIST-EXTENDED, but that requires
a fair bit of work. In the mean time I'm trying to speed up the existing
iteration. I've got it working using 'STATUS "mailbox" (UNSEEN)', but the
language in RFC3501 suggest that this may be slow. There is a
counterproposal to use RFC4731 ESEARCH and do 'SELECT "MAILBOX"'; 'SEARCH
RETURN (COUNT) UNSEEN'.

>From an IMAP server perspective, which do you anticipate would be faster?
>From a client perspective it seems like STATUS would be better since it
involves less round-trips to the server and less output parsing, but given
the warnings in the RFCs there is concern that it is in fact be more
expensive.

-Brad

Re: [Dovecot] Significant performance problems

2010-10-06 Thread Brandon Davidson

Chris,

On 10/6/10 9:42 PM, "Chris Hobbs"  wrote:
> 3) Modified my NFS mount with noatime to reduce i/o hits there. Need to
> figure out what Brad's suggestions about readahead on the server mean.

It's been a while since I mucked with Linux as a NFS server, I've been on
Netapp for a while. There may be less knobs than I recall.

> I do have one more idea I'll throw out there. Everything I've got here
> is virtual. I only have the one Dovecot/Postfix server running now, and
> the impression I get from you all is that that should be adequate for my
> load. What would the collective opinion be of simply removing the NFS
> server altogether and mounting the virtual disk holding my messages
> directly to the dovecot server?

If you're not planning on doing some sort of HA failover or load balancing,
and have the option to make your storage direct-attached instead of NAS, it
might be worth trying. There's not much to be gained from NFS in a
single-node configuration.

-Brad

Re: [Dovecot] Broken SELECT ""/EXAMINE ""

2010-09-01 Thread Brandon Davidson

Michael,

On 9/1/10 12:18 AM, "Michael M. Slusarz"  wrote:
> imapproxy *should* really be using UNSELECT, but that looks like a
> different (imapproxy) bug.

I run imapproxy too. If you're using Dovecot 2.0, set:

imap_capability = +UNSELECT IDLE

Imapproxy is naive and only reads capabilities from the initial banner - it
doesn't refresh them after login. If you make sure they're in the initial
capability list it will behave properly.

-Brad

Re: [Dovecot] nfs director

2010-08-27 Thread Brandon Davidson

Noel,

On 8/26/10 11:28 PM, "Noel Butler"  wrote:
> I just fail to see why adding more complexity, and essentially making
> $9K load balancers redundant, is the way of the future.

To each their own. If your setup works without it, then fine, don't use
it... but I don't see why you feel the need to disparage it either. It's
hardly bloat; those of us with larger installations do find it useful. IIRC
it was sponsored development, and was running in production for a large ISP
from the very moment it was released.

-Brad

Re: [Dovecot] nfs director

2010-08-26 Thread Brandon Davidson

Noel,

On 8/26/10 9:59 PM, "Noel Butler"  wrote:

>> I fail to see advantage if anything it add in more point of failure, with
> 
> i agree with this and it is why we dont use it
> 
> we use dovecots deliver with postfix and have noticed no problems, not
> to say there was none, but if so, we dont notice it.

We might be a slightly larger install than you (60k users, mail on FAS 3170
Metrocluster), but we have noticed corruption issues and the director is
definitely going to see use in our shop. We still use Sendmail+procmail for
delivery, so no issue there... but we've got hordes of IMAP users that will
leave a client running at home, at their desk, on their phone, and then will
use Webmail on their laptop.

Without the director, all of these sessions end up on different backend
mailservers, and it's basically a crapshoot which Dovecot instance notices a
new message first. NFS locking being what it is, odds are an index will get
corrupted sooner or later, and when this happens the user's mail
'disappears' until Dovecot can reindex it. The users inevitably freak out
and call the helpdesk, who tells them to close and reopen their mail client.
Maybe you're small enough to not run into problems, or maybe your users just
have lower expectations or a higher pain threshold than ours. Either way,
it's unpleasant for everyone involved, and quite easy to solve with the
director proxy.

Timo has been saying for YEARS that you need user-node affinity if you're
doing NFS, and now he's done something about it. If you've already got a
load balancer, then just point the balancer at a pool of directors, and then
point the directors at your existing mailserver pool.

For health monitoring on the directors, check out:
http://github.com/brandond/poolmon

-Brad

[Dovecot] Login process connection routing

2010-07-20 Thread Brandon Davidson

Timo,

Just out of curiosity, how are incoming connections routed to login
processes when run with:
service imap-login { service_count = 0 }

I've been playing with this on our test director, and the process connection
counts look somewhat unbalanced. I'm wondering if there are any performance
issues with having a single process handle so many connections. It seems
fine (system load is actually lower than with service_count = 1), but I
thought I'd ask.

/usr/sbin/dovecot
 \_ dovecot/imap-login
 \_ dovecot/imap-login [1 connections (1 TLS)]
 \_ dovecot/imap-login
 \_ dovecot/imap-login [5 connections (5 TLS)]
 \_ dovecot/imap-login [1 connections (1 TLS)]
 \_ dovecot/imap-login [4 connections (4 TLS)]
 \_ dovecot/imap-login [1 connections (1 TLS)]
 \_ dovecot/imap-login [1 connections (1 TLS)]
 \_ dovecot/imap-login [315 connections (315 TLS)]
 \_ dovecot/imap-login [63 connections (63 TLS)]
 \_ dovecot/imap-login [12 connections (12 TLS)]
 \_ dovecot/imap-login
 \_ dovecot/imap-login [10 connections (10 TLS)]
 \_ dovecot/imap-login [2 connections (2 TLS)]
 \_ dovecot/imap-login [370 connections (370 TLS)]
 \_ dovecot/imap-login [24 connections (24 TLS)]

-Brad

Re: [Dovecot] Doveadm director flush/remove

2010-07-20 Thread Brandon Davidson

Timo,

On 7/19/10 9:38 AM, "Timo Sirainen"  wrote:
>
> http://hg.dovecot.org/dovecot-2.0/rev/f178792fb820 fixes it?

It makes it further before crashing. Trace attached.

I still wonder why it's timing out in the first place. Didn't you change it
to reset the timeout as long as it's still getting data from the userdb?

-Brad

auth-worker-gdb_2.txt
Description: Binary data

Re: [Dovecot] Doveadm director flush/remove

2010-07-17 Thread Brandon Davidson

Timo,

>> Maybe this fixes it: http://hg.dovecot.org/dovecot-2.0/rev/cfd15170dff7
> 
> Nope, still crashes with the same stack. I'll rebuild with -g and report
> back.

Here we go. Attached, hopefully Entourage won't mangle the line wrap.

-Brad






auth-worker-gdb.txt
Description: Binary data

Re: [Dovecot] Doveadm director flush/remove

2010-07-17 Thread Brandon Davidson

Timo,

On 7/17/10 11:06 AM, "Timo Sirainen"  wrote:
>
>> Here's a stack trace. Standard null function pointer. No locals, I think I'd
>> have to recompile to get additional information.
>> 
>> #0  0x in ?? ()
>> #1  0x00415a71 in auth_worker_destroy ()
>> #2  0x00415416 in auth_worker_call_timeout ()
> 
> Maybe this fixes it: http://hg.dovecot.org/dovecot-2.0/rev/cfd15170dff7

Nope, still crashes with the same stack. I'll rebuild with -g and report
back.

-Brad

Re: [Dovecot] Doveadm director flush/remove

2010-07-17 Thread Brandon Davidson

Timo,

On 7/16/10 4:23 AM, "Timo Sirainen"  wrote:
> 
>> Jul 16 01:50:44 cc-popmap7 dovecot: auth: Error: auth worker: Aborted
>> request: Lookup timed out
>> Jul 16 01:50:44 cc-popmap7 dovecot: master: Error: service(auth): child 1607
>> killed with signal 11 (core dumps disabled)
> 
> I don't think that above change should have caused any crashes, so backtrace
> would be nice.

Here's a stack trace. Standard null function pointer. No locals, I think I'd
have to recompile to get additional information.

#0  0x in ?? ()
#1  0x00415a71 in auth_worker_destroy ()
#2  0x00415416 in auth_worker_call_timeout ()
#3  0x0038b3e5273d in io_loop_handle_timeouts_real () from
/usr/lib64/dovecot/libdovecot.so.0
#4  0x0038b3e52797 in io_loop_handle_timeouts () from
/usr/lib64/dovecot/libdovecot.so.0
#5  0x0038b3e53958 in io_loop_handler_run () from
/usr/lib64/dovecot/libdovecot.so.0
#6  0x0038b3e527dd in io_loop_run () from
/usr/lib64/dovecot/libdovecot.so.0
#7  0x0038b3e3b926 in master_service_run () from
/usr/lib64/dovecot/libdovecot.so.0
#8  0x004184b1 in main ()
 
-Brad

Re: [Dovecot] Doveadm director flush/remove

2010-07-16 Thread Brandon Davidson

Timo,

On 7/15/10 4:18 PM, "Timo Sirainen"  wrote:
>>> Jul 15 13:46:24 cc-popmap7 dovecot: auth: Error: auth worker: Aborted
>>> request: Lookup timed out
>>> Jul 15 13:53:25 cc-popmap7 dovecot: auth: Error: getpwent() failed: No such
>>> file or directory
> 
> Also see if http://hg.dovecot.org/dovecot-2.0/rev/d13c1043096e fixes this or
> if there are other timeouts?

Now I get:

Jul 16 01:50:44 cc-popmap7 dovecot: auth: Error: auth worker: Aborted
request: Lookup timed out
Jul 16 01:50:44 cc-popmap7 dovecot: master: Error: service(auth): child 1607
killed with signal 11 (core dumps disabled)

Should I try to grab a core, or do you have a good idea where this is coming
from? Seems suspiciously similar to the crash with '-f userlist'.

-Brad

Re: [Dovecot] Doveadm director flush/remove

2010-07-16 Thread Brandon Davidson

Timo,

On 7/15/10 4:12 PM, "Timo Sirainen"  wrote:
>
>> Maybe there could be a parameter to get the user list from a file (one
>> username per line) instead of userdb.
> 
> Added -f parameter for this.

Awesome! I dumped a userlist (one username per line) which it seems to read
through quite quickly, unfortunately I get...

[r...@cc-popmap7 ~]# doveadm director map -f userlist.txt
Segmentation fault

(lots of pread/mmap snipped)
pread(9, "user0\nuser1\nuser2\nuser3\nuser4"..., 8189, 393042) = 8189
mmap(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x2acbc3aae000
pread(9, "user5\nuser6\nuser7\nuser8\nuser9"..., 8188, 401231) = 36
pread(9, "", 8152, 401267)  = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---


#1  0x2b55977ec1d0 in auth_connection_close () from
/usr/lib64/dovecot/libdovecot.so.0
#2  0x2b55977ec258 in auth_master_deinit () from
/usr/lib64/dovecot/libdovecot.so.0
#3  0x0040a059 in user_file_get_user_list ()
#4  0x0040a22f in cmd_director_map ()
#5  0x0040897d in doveadm_try_run_multi_word ()
#6  0x00408aab in doveadm_try_run ()
#7  0x00408e0f in main ()
 
-Brad

Re: [Dovecot] Director proxy timeout

2010-07-14 Thread Brandon Davidson

On 7/13/10 4:53 AM, "Timo Sirainen"  wrote:
> 
> Hmm. "Between"? Is it doing CAPABILITY before or after login or both? That
> anyway sounds different from the idle timeout problem..

I added some additional logging to imapproxy and it looks like it's actually
getting stuck in a few different commands. It just depends on what it's
trying to do when the connection gets wedged.

What I'm seeing is that from time to time an imapproxy -> imap-login proxy
connection will get stuck and cease responding to commands. After a while
the PHP client will timeout and give up, after which the stuck connection
goes back to the pool, and continues to get reused and cause hangs until I
either restart imapproxy or kill off the imap-login proxy that the stuck
socket is connected to.

If I attach to the stuck imap-login process, it's waiting in:
#0  0x00385c0c6070 in __write_nocancel () from /lib64/libc.so.6
#1  0x003c5620c9a1 in login_proxy_state_notify () from
/usr/lib64/dovecot/libdovecot-login.so.0
#2  0x003c5620c026 in login_proxy_notify () from
/usr/lib64/dovecot/libdovecot-login.so.0
#3  0x003c55e52521 in io_loop_handle_timeouts_real () from
/usr/lib64/dovecot/libdovecot.so.0
#4  0x003c55e5257b in io_loop_handle_timeouts () from
/usr/lib64/dovecot/libdovecot.so.0
#5  0x003c55e5373c in io_loop_handler_run () from
/usr/lib64/dovecot/libdovecot.so.0
#6  0x003c55e525c1 in io_loop_run () from
/usr/lib64/dovecot/libdovecot.so.0
#7  0x003c55e3b896 in master_service_run () from
/usr/lib64/dovecot/libdovecot.so.0
#8  0x003c5620dc4b in main () from
/usr/lib64/dovecot/libdovecot-login.so.0
#9  0x00385c01d994 in __libc_start_main () from /lib64/libc.so.6
#10 0x00402019 in _start ()

If I tcpdump the stuck connection, I can see that imapproxy sends something
to the imap-login proxy when new clients are connected, but I'm not sure
what since it's SSL encrypted. The response is an empty ack packet. I'm
going to try disabling SSL between imapproxy and the director to see if I
can figure out what it's sending.

All in all I'm having a hard time debugging it since it only seems to happen
when there are a decent number of users active. I'm not at all convinced
that it's dovecot's fault, but if you have any suggestions or things that I
could to to see what the imap-login proxy or backend think is going on I'd
be much in your debt.

-Brad

[Dovecot] Doveadm director flush/remove

2010-07-14 Thread Brandon Davidson

I've got a couple more issues with the doveadm director interface:

1) If I use "doveadm director remove" to disable a host with active users,
the director seems to lose track of users mapped to that host. I guess I
would expect it to tear down any active sessions by killing the login
proxies, like I'd done 'doveadm direct add HOSTNAME 0 && doveadm director
flush HOSTNAME' before removing it? Here's what I see with an active open
connection:

[r...@cc-popmap7 ~]# doveadm director status brandond
Current: 10.142.0.179 (expires 2010-07-14 01:26:14)
Hashed: 10.142.0.179
Initial config: 10.142.0.161
[r...@cc-popmap7 ~]# doveadm director remove 10.142.0.179
[r...@cc-popmap7 ~]# doveadm director status brandond
Current: not assigned
Hashed: 10.142.0.174
Initial config: 10.142.0.161


2) "doveadm director flush" returns the wrong usage:

[r...@cc-popmap7 ~]# doveadm director flush
doveadm director remove [-a ] 


3) "doveadm director flush" all breaks the ring:

[r...@cc-popmap7 ~]# doveadm director flush all
Jul 14 01:26:33 cc-popmap7 dovecot: director: Error: Director
10.142.0.180:1234/right disconnected
Jul 14 01:26:33 cc-popmap7 dovecot: director: Error: Director
10.142.0.180:1234/left disconnected
Jul 14 01:26:33 oh-popmap7 dovecot: director: Error:
director(10.142.0.162:1234/left): Invalid HOST-FLUSH args
Jul 14 01:26:33 oh-popmap7 dovecot: director: Error:
director(10.142.0.162:1234/right): Invalid HOST-FLUSH args
 
For some reason, flushing a host address only disconnects one side:

[r...@cc-popmap7 ~]# doveadm director flush 10.142.0.160
Jul 14 01:28:23 cc-popmap7 dovecot: director: Error: Director
10.142.0.180:1234/right disconnected
Jul 14 01:28:23 oh-popmap7 dovecot: director: Error:
director(10.142.0.162:1234/left): Invalid HOST-FLUSH args

-Brad

Re: [Dovecot] v2.0.rc2 released

2010-07-11 Thread Brandon Davidson

Timo,

On 7/11/10 10:58 AM, "Timo Sirainen"  wrote:
>
>> dsync in hg tip is failing tests:
> 
> Fixed now, as well as another dsync bug.

Looks good!

New doveadm director status is a little odd though. The 'mail server ip'
column is way wide (I guess it adjusts to term size though?) and the users
column got renamed to vhosts, so now there are two vhosts columns.

-Brad

Re: [Dovecot] dovecot director service

2010-07-11 Thread Brandon Davidson

Timo,

On 7/11/10 12:06 PM, "Timo Sirainen"  wrote:
>> Pretty much anything built into Dovecot would be an improvement over an
>> external script from my point of view.
> 
> Yeah, some day I guess..

Well, I would definitely make use of it if you ever get around to coding it.

>> With a script I have to deal with all
>> kinds of questions like, which director do I have my script log in to?
> 
> Any one of them.

Sure, but that adds additional complexity to the script (selecting a host,
retrying, etc)... or sticking the director/doveadm interface behind a load
balancer if possible.

>> What happens if it goes down? What happens if the monitoring host is down?
> 
> Add redundancy :)

Sure, but it's easier for me if I don't have to worry about it ;)

> Well, you could make the doveadm interface available via TCP port, but
> that really should be firewalled well. Hmm. It wouldn't be difficult to
> patch doveadm director to also support connecting to host:port rather
> than unix socket.

That would be pretty awesome. Rignt now I could try to talk Director
protocol from the management node to inject commands, but the Directors only
accept connections from other hosts listed in the ring, right? So having
doveadm function over TCP would be a big plus.

-Brad

Re: [Dovecot] TLS Issue

2010-07-10 Thread Brandon Davidson

Leander,

On 7/10/10 2:14 PM, "Leander S."  wrote:
> "You have attempted to establish a connection with "server". However,
> the security certificate presented belongs to "*.server". It is
> possible, though unlikely, that someone may be trying to intercept your
> communication with this web site."

IIRC, wildcard certificates are only valid for subdomains. *.domain.com
would be valid for a.domain.com, b.domain.com, but not domain.com. It also
relies upon the client supporting wildcard certs.

-Brad

Re: [Dovecot] v2.0.rc2 released

2010-07-10 Thread Brandon Davidson

dsync in hg tip is failing tests:

test-dsync-brain.c:176: Assert failed:
test_dsync_mailbox_create_equals(&box_event.box, &src_boxes[6])
test-dsync-brain.c:180: Assert failed:
test_dsync_mailbox_create_equals(&box_event.box, &dest_boxes[6])
Segmentation fault

I'm currently using rev 77f244924009, I'm not sure when it started.

-Brad

On 7/9/10 3:14 PM, "Timo Sirainen"  wrote:
>
> http://dovecot.org/releases/2.0/rc/dovecot-2.0.rc2.tar.gz
> http://dovecot.org/releases/2.0/rc/dovecot-2.0.rc2.tar.gz.sig

Re: [Dovecot] dovecot director service

2010-07-09 Thread Brandon Davidson

On 7/9/10 12:01 AM, "Xavier Pons"  wrote:

> I think this new funcionalities would be perfect (necessary ;-) ) for a
> complete load balanced/high availability mail system.

Timo, what you described sounds great.

Pretty much anything built into Dovecot would be an improvement over an
external script from my point of view. With a script I have to deal with all
kinds of questions like, which director do I have my script log in to? What
happens if it goes down? What happens if the monitoring host is down?

I'd probably end up trying to put the director port behind the load-balancer
and figuring out some way to get my script to talk Director protocol to
add/remove mail servers, and that would just be ugly ;)

>> Yeah. Any good naming ideas for the doveadm director command? :)
> something like doveadm director servers  ?!?!

I'm not sure either. Maybe 'doveadm director ring'?

I thought of suggesting that 'status' report the ring status, and the
current output go to something like 'mailhosts'. After a moment of
consideration, I realized that all the current director commands
(add,remove,status) act on the mailhost list, not the director list, and so
in that sense 'doveadm director' is really more like 'doveadm
director-mailhosts' to begin with.

-Brad

Re: [Dovecot] dovecot director service

2010-07-08 Thread Brandon Davidson

Xavier,

On 7/8/10 1:29 AM, "Xavier Pons"  wrote:
>
> Yes, we will have two hardware balancers in front of proxies. Thus,  the
> director service will detect failures of backend servers and not forward
> sessions them? how detects if a  backend server it's alive or not?

IIRC, it does not detect failures of backend servers. It's up to you to
detect outages and react appropriately.

The folks that sponsored Director development apparently have a monitoring
script that removes downed nodes by running something like 'ssh directorhost
doveadm director remove backendhost', and then re-adds them when they come
back up. 

I'm not sure how I'm going to handle this myself, as our monitoring system
only checks every 5 minutes, and our existing load balancers check and
add/remove nodes every 20 seconds or so. 5 minutes would be a long time to
have Dovecot trying to send users to a non-functional server.

> The command 'doveadm director status', gives information about status of
> backend mailservers or of director servers?

Just the backend servers - it shows backend server addresses and how many
users they're each assigned, or details on a specific user mapping.

I am not aware of a way to get Dovecot to output the director ring status.
That would be nice though, to be able to list the directors and how many
connections they're each proxying.

You might read though this thread, which starts here:
http://www.dovecot.org/list/dovecot/2010-May/049189.html
And continues later here:
http://www.dovecot.org/list/dovecot/2010-June/049425.html

-Brad

Re: [Dovecot] dovecot evaluation on a 30 gb mailbox

2010-06-25 Thread Brandon Davidson

Timo,

On 6/24/10 4:23 AM, "Timo Sirainen"  wrote:
>> 
>> I'd recommend also installing and configuring imapproxy - it can be
>> beneficial with squirrelmail.
> 
> Do you have any about a real world numbers about installation with and without
> imapproxy?

We run imapproxy behind our Roundcube instance, and our old in-house Perl
mail system has a custom equivalent written in C that also does some caching
of folder metadata and message headers.

We run a proxy instance on each of the webmail hosts, with the communication
between the web application add the proxy being done in cleartext, but with
the proxy -> Dovecot communication secured over SSL. Besides preventing a
lot of extra SSL handshakes and login/logout actions, it also helps tie a
user session to a single backend node in our pool of IMAP servers. It seems
like there might also be other benefits to having Dovecot not tear down all
of the user session state between page loads.

A lot of this stuff might be nice to see in the Director some day. If there
was an option to not immediately close the Director proxy's backend
connections when the user logs out, (ie leave the connection active and
logged in for X seconds, and reuse it if the user logs in to the Director
again), and if the auth caching works as well as you say, then I could
definitely see a day where we replace imapproxy with a director instance on
the webmail host.

-Brad

Re: [Dovecot] 'doveadm who' enhancement request

2010-06-02 Thread Brandon Davidson

On 6/2/10 7:33 PM, "Timo Sirainen"  wrote:
>> I wonder if they can stand up to 10k+ concurrent proxied
>> connections though?
> 
> I'd think so.

I could probably give that a try, but I'll have a hard time convincing folks
to do that until after 2.0 has out of beta for a bit. Maybe after summer
term then... October or so? I can certainly start testing it in advance of
that though.

> Also another thought :) I guess you have now two login services for imap and
> pop3, one for proxy and one for backend?

No, just the one for each right now. I haven't figured out how to do that
yet ;) Just multiple imap-login/pop3-login blocks with different
inet_listener addresses?

> You could do the same for auth
> processes and override the other one's settings. Something like:
> 
> # disable default auth process for proxy lookups
> service auth {
>   executable = auth -o passdb/proxy/args=ignore.conf
>   unix_listener auth-login {
> mode = 0
>   }
> }
> 
> service auth-proxy {
>   unix_listener auth-login {
> user = dovecot
> mode = 0600
>   }
> }

How do I tell different login services to use different auth backends? Is it
the first argument to the login process executable? So like:

service imap-login {
  executable = imap-login auth-proxy
  inet_listener imaps {
address = 1.2.3.4
  }
}

I'm still trying to grok what all the different config bits mean and imply.

-Brad

Re: [Dovecot] RTFM: Manual pages for Dovecot v2.0

2010-05-31 Thread Brandon Davidson

Pascal,

On 5/31/10 11:40 PM, "Pascal Volk" 
wrote:
>
> I've spent some time for the fine manual. Whats new?
> 
> Location: http://hg.localdomain.org/dovecot-2.0-man
> So I don't have to flood the wiki with attachments.
> As soon as the manual pages are complete, they will be included in the
> Dovecot source tree.

This is fantastic. When I get a moment, I'll definitely read them over. I
spent a good bit of time getting a Dovecot 2.0 test system set up this
weekend and I found myself flipping back and forth between Timo's release
announcements, sample configurations in the tarball, and the raw source
code. Doveconf in particular doesn't even seem to provide help text listing
available command line flags, so a man page is very welcome.

Documentation of what configuration options work where would also be
particularly nice to see. The new syntax is incredibly powerful but also
very complex. It appears that there are some things that will pass doveconf
checks but will either cause errors or be ignored by the actual code at
runtime.

-Brad

Re: [Dovecot] A new director service in v2.0 for NFS installations

2010-05-31 Thread Brandon Davidson

Timo,

On 5/31/10 6:56 PM, "Timo Sirainen"  wrote:
> 
> Oh, you're right. For auth settings currently only protocol blocks work. It
> was a bit too much trouble to make local/remote blocks to work. :)

That's too bad! Any hope of getting support for this and
director+proxy_maybe anytime soon?

-Brad

Re: [Dovecot] A new director service in v2.0 for NFS installations

2010-05-31 Thread Brandon Davidson

Timo,

On 5/31/10 5:34 PM, "Brandon Davidson"  wrote:
> 
> Still not sure why it's not proxying though. The config looks good but it's
> still using PAM even for the external IP.

I played with subnet masks instead of IPs and using remote instead of local,
as well as setting auth_cache_size = 0, but no dice. It still seems to
ignore the block and only use the global definition, even if doveconf -f
lip= shows that it's expanding it properly.

-Brad

Re: [Dovecot] A new director service in v2.0 for NFS installations

2010-05-31 Thread Brandon Davidson

Timo,

On 5/31/10 5:09 PM, "Timo Sirainen"  wrote:
> 
> Right .. it doesn't work exactly like that I guess. Or I don't remember :)
> Easiest to test with:
> 
> doveconf -f lip=128.223.142.138 -n

That looks better:

[r...@cc-popmap7 ~]# doveconf -f lip=128.223.142.138 -h |grep -B1 -A7 passdb
}
passdb {
  args = /etc/dovecot/proxy-sqlite.conf
  deny = no
  driver = sql
  master = no
  pass = no
}
passdb {
  args = 
  deny = no
  driver = pam
  master = no
  pass = no
}
plugin {
--
local 128.223.142.138 {
  passdb {
args = /etc/dovecot/proxy-sqlite.conf
driver = sql
  }
}


Still not sure why it's not proxying though. The config looks good but it's
still using PAM even for the external IP.

-Brad

Re: [Dovecot] A new director service in v2.0 for NFS installations

2010-05-31 Thread Brandon Davidson

Timo,

On 5/31/10 4:36 PM, "Timo Sirainen"  wrote:
> 
> The passdbs and userdbs are checked in the order they're defined. You could
> add them at the bottom. Or probably more easily:
> 
> local 128.223.143.138 {
>   passdb {
> driver = sql
> args = ..
>   }
> 
> passdb {
>   driver = pam
> }
> userdb {
>   driver = passwd
> }

Ahh, OK. For some reason I was assuming that the best match was used.
Unfortunately that doesn't seem to work either. I've got it set up just as
you recommended:

[r...@cc-popmap7 ~]# cat /etc/dovecot/dovecot.conf | nl | grep -B1 -A4
passdb
35local 128.223.142.138 {
36  passdb {
37driver = sql
38args = /etc/dovecot/proxy-sqlite.conf
39  }
40}
41passdb {
42  driver = pam
43}
44userdb {
45  driver = passwd

It still doesn't respect the driver for that local block, and uses PAM for
everything:

May 31 16:48:16 cc-popmap7 dovecot: auth: Debug: client in: AUTH1
PLAINservice=imapsecuredlip=128.223.142.138
rip=128.223.162.22lport=993rport=57067resp=
May 31 16:48:16 cc-popmap7 dovecot: auth: Debug:
pam(brandond,128.223.162.22): lookup service=dovecot
May 31 16:48:16 cc-popmap7 dovecot: auth: Debug:
pam(brandond,128.223.162.22): #1/1 style=1 msg=Password:
May 31 16:48:16 cc-popmap7 dovecot: auth: Debug:
pam(brandond,128.223.162.22): #1/1 style=1 msg=LDAP Password:
May 31 16:48:16 cc-popmap7 dovecot: auth: Debug: client out: OK1
user=brandond
May 31 16:48:16 cc-popmap7 dovecot: auth: Debug: master in: REQUEST1
56521d19a5592fd2206241cfc0ca658020b0b
May 31 16:48:16 cc-popmap7 dovecot: auth: Debug:
passwd(brandond,128.223.162.22): lookup
May 31 16:48:16 cc-popmap7 dovecot: auth: Debug: master out: USER1
brandondsystem_groups_user=brandonduid=41027gid=91
home=/home10/brandond
May 31 16:48:16 cc-popmap7 dovecot: imap-login: Login: user=,
method=PLAIN, rip=128.223.162.22, lip=128.223.142.138, TLS, mailpid=5667

Interestingly enough, if I run 'doveconf -n' it doesn't seem to be retaining
the order I specified. The local section is dropped down to the very end:

[r...@cc-popmap7 ~]# doveconf -n | nl | grep -B1 -A4 passdb
31}
32passdb {
33  driver = pam
34}
35plugin {
36  quota = fs:user:inode_per_mail
--
82local 128.223.142.138 {
83  passdb {
84args = /etc/dovecot/proxy-sqlite.conf
85driver = sql
86  }
87}

Ideas?

-Brad

Re: [Dovecot] A new director service in v2.0 for NFS installations

2010-05-31 Thread Brandon Davidson

Timo,

On 5/31/10 4:13 PM, "Timo Sirainen"  wrote:
> You need to put the other passdb/userdb to the external IP:
> 
> local 1.2.3.4 {
>> userdb {
>>  driver = passwd
>> }
>> passdb {
>>  driver = sql
>>  args = /etc/dovecot/proxy-sqlite.conf
>> }
> 
> }
> 

It still doesn't seem to work. I tried this, with no userdb/passdb outside a
local block:

local 128.223.142.138 {
  userdb {
driver = passwd
  }
  passdb {
driver = sql
args = /etc/dovecot/proxy-sqlite.conf
  }
}
local 10.142.0.162 {
  userdb {
driver = passwd
  }
  passdb {
driver = pam
  }
}

But I got this error in the log file upon connecting to the external IP:

May 31 16:20:42 cc-popmap7 dovecot: auth: Fatal: No passdbs specified in
configuration file. PLAIN mechanism needs one
May 31 16:20:42 cc-popmap7 dovecot: master: Error: service(auth): command
startup failed, throttling
May 31 16:20:42 cc-popmap7 dovecot: master: Error: service(director): child
5339 killed with signal 11 (core dumps disabled)
May 31 16:20:42 cc-popmap7 dovecot: master: Error: service(director):
command startup failed, throttling

So I added a global passdb/userdb:

userdb {
  driver = passwd
}
passdb {
  driver = pam
}
local 128.223.142.138 {
  userdb {
driver = passwd
  }
  passdb {
driver = sql
args = /etc/dovecot/proxy-sqlite.conf
  }
}
local 10.142.0.162 {
  userdb {
driver = passwd
  }
  passdb {
driver = pam
  }
}

And again it uses the global passdb for all requests, ignoring the contents
of the local blocks.

-Brad

Re: [Dovecot] A new director service in v2.0 for NFS installations

2010-05-31 Thread Brandon Davidson

Timo,

On 5/31/10 6:04 AM, "Timo Sirainen"  wrote:
> Well .. maybe you could use separate services. Have the proxy listen on
> public IP and the backend listen on localhost. Then you can do:
> 
> local_ip 127.0.0.1 {
>   passdb { 
> ..
>   }
> }
> 
> and things like that. I think it would work, but I haven't actually
> tried.

It doesn't seem to be honoring the passdb setting within the local block.
I've got a single host set up with director, and itself listed as a mail
server:

director_servers = 128.223.142.138
director_mail_servers = 128.223.142.138
userdb {
  driver = passwd
}
passdb {
  driver = sql
  args = /etc/dovecot/proxy-sqlite.conf
}
local 127.0.0.1 {
  passdb {
driver = pam
  }
}

If I telnet to localhost and attempt to log in, the logs show:

May 31 14:39:34 cc-popmap7 dovecot: auth: Debug: client in: AUTH1
PLAINservice=imapsecuredlip=127.0.0.1rip=127.0.0.1
lport=143rport=60417resp=
May 31 14:39:34 cc-popmap7 dovecot: auth: Debug: sql(brandond,127.0.0.1):
query: SELECT null AS password, 'Y' AS nopassword, 'Y' AS proxy
May 31 14:39:34 cc-popmap7 dovecot: auth: Debug: client out: OK1
user=brandondproxypass=
May 31 14:39:34 cc-popmap7 dovecot: auth: Debug: client in: AUTH1
PLAINservice=imapsecuredlip=128.223.142.138
rip=128.223.142.138lport=143rport=44453resp=
May 31 14:39:34 cc-popmap7 dovecot: auth: Debug:
sql(brandond,128.223.142.138): query: SELECT null AS password, 'Y' AS
nopassword, 'Y' AS proxy
May 31 14:39:34 cc-popmap7 dovecot: auth: Debug: client out: OK1
user=brandondproxypass=
May 31 14:39:34 cc-popmap7 dovecot: imap-login: Error: Proxying loops to
itself: user=, method=PLAIN, rip=128.223.142.138,
lip=128.223.142.138, secured, mailpid=0
May 31 14:39:34 cc-popmap7 dovecot: auth: Debug: new auth connection:
pid=4700
May 31 14:39:34 cc-popmap7 dovecot: imap-login: Disconnected (auth failed, 1
attempts): user=, method=PLAIN, rip=128.223.142.138,
lip=128.223.142.138, secured, mailpid=0

Even if the alternate passdb worked, how would I get it to connect to the
backend on localhost? It looks like the proxy connection comes in over the
external IP even if it's to itself, as the external address is what's
specified as the proxy destination by the director.

I do have a private network that I run NFS over; I suppose I could run the
proxy on the external, backend on the internal, and use only the internal
IPs in the mailserver list. I've also tried that, but it doesn't seem to
work either due to the passdb setting not being honored within local|remote
blocks. 

Even if it did, wouldn't it still complain about the proxy looping back to
itself since both lip and rip would both be local addresses? Unless the
loopback check just compares to see if they're the same... Either way, it
seems like having proxy_maybe work with the director service would make the
whole setup a lot simpler.

> There's not yet a static passdb .. perhaps there should be. But you
> could use e.g. sqlite backend for the proxy and use:
> 
> password_query = select null as password, 'Y' as nopassword, 'Y' as
> proxy

 That seems to work well enough, with the major caveat noted above.

Re: [Dovecot] A new director service in v2.0 for NFS installations

2010-05-31 Thread Brandon Davidson

Timo,

After straightening out some issues with Axel's spec file, I'm back to
poking at this.

On 5/25/10 3:14 PM, "Timo Sirainen"  wrote:
> So instead of having separate proxies and mail servers, have only hybrids
> everywhere? I guess it would almost work, except proxy_maybe isn't yet
> compatible with director. That's actually a bit annoying to implement.. You
> could of course run two separate Dovecot instances, but that also can be a bit
> annoying.

Would I have to run two separate instances, or could I just set up multiple
login services on different ports; one set to proxy (forwarding the password
to the remote server) and one set to not? I suppose each login service would
have to use a different authdb, which I don't know how to do.

> No. The director service simply adds "host" field to auth lookup replies if
> the original reply had proxy=y but didn't have host field.

Interesting. It sounds like proxying requires a database query that will
return 'proxy=y' as part of the auth lookup. It would be nice to have a
static password authdb for proxying that didn't require a real database
backend. I'm using PAM now, and don't see a good way to enable proxying.

The wiki also says that there's a way to let the proxy backend handle
authentication, but I don't see an example of that anywhere.

> Yes. So the connections between the proxies should be pretty fast. I think the
> maximum bytes transferred per user is 38.

Cool.

> The proxies always try to keep connecting to next available server (i.e. if
> the next server won't connect, it tries one further away until it finally
> connects to something or reaches itself). So the segmentation could happen
> only if there was no network connection between the two segments.

Ahh, OK - good to know. That sounds like a good way to do it. Can I confirm
my understanding of a few other things?

It looks like the mailserver list is initially populated from
director_mail_servers, but can be changed by discovering hosts from other
directors or by adding/removing hosts with doveadm. Since the initial host
list is not written back in to the config file, changes made with doveadm
are not persistent across service restarts. Does 'doveadm director
' need to be run against each director individually, or will the
changes be sent around the ring? If a new host comes up with a mailserver in
its list that has been removed by doveadm, will the handshake remove it from
the list?

The list of director servers used to build the ring is read from
director_servers, and cannot be changed at runtime. A host finds its
position within the ring based on its order within the list, and connects to
hosts to its left and right until it has a connection on either side and can
successfully send a test message around the ring.

Is that all correct? What happens if some hosts have only a subset, or
different subsets, of a group of hosts in their mail server or director
server list?

Thanks!

-Brad

Re: [Dovecot] beta5 builds under RHEL

2010-05-30 Thread Brandon Davidson

On 5/30/10 2:49 PM, "Axel Thimm"  wrote:
> 
> How are your %optflags (which is the same as $RPM_OPT_FLAGS) merged
> into the build if it is not passed to make? And it would yield the
> same CFLAGS as above (merged default optflags with what configure adds
> to it).

They're exported by the %configure macro, and configure writes the combined
CFLAGS, CXXFLAGS, and FFLAGS into the Makefile... so it's not necessary (and
possibly detrimental) to both export them before configuring and pass them
explicitly to make, as the command-line CLFAGS option overrides the Makefile
CFLAGS declaration that includes -std=gnu99.

My point is, if I don't inclue CFLAGS="..." in my call to make in the spec
file, it builds fine, and *does* include all the necessary optflags. Give it
a try.

-Brad

Re: [Dovecot] beta5 builds under RHEL

2010-05-30 Thread Brandon Davidson

Axel,

On 5/30/10 10:22 AM, "Axel Thimm"  wrote:
>> 
>> Oh, the spec file overrides CFLAGS and doesn't contain -std=gnu99?
>> 
> 
> The config.log for RHEL5/x86_64 says:
> 
> CFLAGS='-std=gnu99 -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2
> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64
> -mtune=generic -Wall -W -Wmissing-prototypes -Wmissing-declarations
> -Wpointer-arith -Wchar-subscripts -Wformat=2 -Wbad-function-cast
> -Wstrict-aliasing=2 -I/usr/kerberos/include   '

It may be a specfile after all. %configure exports CFLAGS before calling
./configure, which should be sufficient to get any needed options into the
Makefile, merged with whatever configure auto-detects (including
-std=gnu99).

Your spec also calls "make CFLAGS="$RPM_OPT_FLAGS" which overrides
everything and omits -std=gnu99 unless specifically included by the
packager. If I remove that and just call 'make' it works fine - my %optflags
are merged in with the CFLAGS from configure and the build completes without
error.

-Brad

Re: [Dovecot] beta5 builds under RHEL

2010-05-30 Thread Brandon Davidson

Axel,

On 5/30/10 3:39 AM, "Axel Thimm"  wrote:
> 
> Now it is more consistent and looks like a change between 4.1.2 and
> 4.4.1.
> 
> Maybe in the older gcc -std=gnu99 didn't set __USE_ISOC99 and thus the
> missing constants were not defined?

If I '%define optflags -std=gnu99' in the spec it builds just fine, so I
don't think it's a compiler problem. Maybe a libtool issue?

-Brad

Re: [Dovecot] beta5 builds under RHEL

2010-05-30 Thread Brandon Davidson

Axel,

On 5/30/10 12:05 AM, "Axel Thimm"  wrote:
> beta4 built under RHEL4, RHEL5 and RHEL6 (the latter being the public
> beta). beta5 now builds only for RHEL5, the other two fail with:
> 
> strnum.c: In function `str_to_llong':
> strnum.c:139: error: `LLONG_MIN' undeclared (first use in this function)
> strnum.c:139: error: (Each undeclared identifier is reported only once
> strnum.c:139: error: for each function it appears in.)

FWIW, the build fails with the same error within my CentOS 5 Mock build
environment. I'm not sure what I've got set up different than you, but I'm
using a slightly tweaked version of your spec file and a pretty vanilla Mock
0.6 setup.

-Brad

Re: [Dovecot] quick question

2010-02-10 Thread Brandon Davidson

Hi David,

> -Original Message-
> From: David Halik
> 
> It looks like we're still working towards a layer 7 solution anyway.
> Right now we have one of our student programmers hacking Perdition
with
> a new plugin for dynamic username caching, storage, and automatic fail
> over. If we get it working I can send you the basics if you're
interested.

I'd definitely be glad in taking a look at what you come up with! I'm
still leaning towards MySQL with quick local fallback, but I'm nowhere
near committed to anything.

On a side note, we've been running with the two latest maildir patches
in production for a few days now. The last few days we've been seeing a
lot of lock failures:

Feb 10 04:06:02 cc-popmap6p dovecot: imap-login: Login: user=,
method=PLAIN, rip=67.223.67.45, lip=128.223.142.39, TLS, mailpid=12881 
Feb 10 04:08:03 oh-popmap3p dovecot: imap-login: Login: user=,
method=PLAIN, rip=67.223.67.45, lip=128.223.142.39, TLS, mailpid=9569 
Feb 10 04:09:02 cc-popmap6p dovecot: imap: user=,
rip=67.223.67.45, pid=12881: Timeout while waiting for lock for
transaction log file /home6/pellerin/.imapidx/.INBOX/dovecot.index.log 
Feb 10 04:09:02 cc-popmap6p dovecot: imap: user=,
rip=67.223.67.45, pid=12881: Our dotlock file
/home6/pellerin/Maildir/dovecot-uidlist.lock was modified (1265803562 vs
1265803684), assuming it wa
Feb 10 04:09:02 cc-popmap6p dovecot: imap: user=,
rip=67.223.67.45, pid=12881: Connection closed bytes=31/772 
Feb 10 04:11:04 oh-popmap3p dovecot: imap: user=,
rip=67.223.67.45, pid=9569: Timeout while waiting for lock for
transaction log file /home6/pellerin/.imapidx/.INBOX/dovecot.index.log 
Feb 10 04:11:04 oh-popmap3p dovecot: imap: user=,
rip=67.223.67.45, pid=9569: Our dotlock file
/home6/pellerin/Maildir/dovecot-uidlist.lock was deleted (locked 180
secs ago, touched 180 secs ago) 
Feb 10 04:11:04 oh-popmap3p dovecot: imap: user=,
rip=67.223.67.45, pid=9569: Connection closed bytes=18/465

I'm not sure if this is just because it's trying more diligently to make
sure it's got the latest info, and is therefore hitting locks where it
didn't previously... but it's been hanging our clients and requiring
manual intervention to clear. We've been removing the lock file and
killing any active dovecot sessions, which seems to resolve things for a
while.

Just thought I'd see if this was happening to anyone else.

-Brad

Re: [Dovecot] quick question

2010-02-08 Thread Brandon Davidson

Hi David,

> -Original Message-
> From: David Halik
> 
> I've been running both patches and so far they're stable with no new
> crashes, but I haven't really seen any "better" behavior, so I don't
> know if it's accomplishing anything. =)
> 
> Still seeing entire uidlist list dupes after the list goes stale. I
> think that was what we were originally discussing.

I wasn't able to roll the patched packages into production until this
morning, but so far I'm seeing the same thing as you - no real change in
behavior.

I guess that brings us back to Timo's possibility number two?

-Brad

Re: [Dovecot] proxy_maybe regex

2010-02-01 Thread Brandon Davidson

David,

> -Original Message-
> From: dovecot-bounces+brandond=uoregon@dovecot.org
[mailto:dovecot-
>
> There are ways of doing this in mysql, with heartbeats etc (which
we've
> discussed before), but then I'm back to mysql again. Maybe mysql just
> has to be the way to go in this case.
> 
> Brad, any more investigation into this?

I've been mulling it over in my head, but haven't had a chance to
actually build up a test environment and start playing with it yet. I
got some other things (Blackboard, for those that can sympathize)
dropped in my lap, and that's been consuming the majority of my time.

I do like the possibility of falling back to a local connection if the
database goes away. I am curious to see how it behaves if the database
is corrupt, database server is down, host is offline, and so on. All
that plus figuring out the best schema, queries, cleanup, etc of course
;)

-Brad

Re: [Dovecot] quick question

2010-01-25 Thread Brandon Davidson

Timo,

On 1/25/10 12:31 PM, "Timo Sirainen"  wrote:
> 
> I don't think it's immediate.. But it's probably something like:
> 
>  - notice it's not working -> reconnect
>  - requests are queued
>  - reconnect fails, hopefully soon, but MySQL connect at least fails in max.
> 10 seconds
>  - reconnect timeout is added, which doubles after each failure
>  - requests are failed while it's not trying to connect

Hmm, that's not great. Is that tunable at all? Cursory examination shows
that it's hardcoded in src/lib-sql/driver-mysql.c, so I guess not.

I suppose I could also get around to playing with multi-master replication
so I at least have a SQL server available at each of the sites that I have
Dovecot servers...

-Brad

Re: [Dovecot] quick question

2010-01-25 Thread Brandon Davidson

Timo,

> -Original Message-
> From: Timo Sirainen [mailto:t...@iki.fi]
> 
> On 25.1.2010, at 21.30, Brandon Davidson wrote:
> > If it could be set up to just fall back to
> > using a local connection in the event of a SQL server outage, that
might
> > help things a bit. Anyone know how that might work?
> 
> Well, you can always fall back to LDAP if SQL isn't working.. Just
something
> like:
> 
> passdb sql {
>  ..
> }
> passdb ldap {
>  ..
> }

Or just 'passdb pam { ... }' for the second one in our case, since we're
using system auth with pam_ldap/nss_ldap. Is the SQL connection/query
timeout configurable? It would be nice to make a very cursory attempt at
proxying, and immediately give up and use a local connection if anything
isn't working.

-Brad

Re: [Dovecot] quick question

2010-01-25 Thread Brandon Davidson

David,

> Though we aren't using NFS we do have a BigIP directing IMAP and POP3
> traffic to multiple dovecot stores. We use mysql authentication and
the
> "proxy_maybe" option to keep users on the correct box. My tests using
an
> external proxy box didn't significantly reduce the load on the stores
> compared to proxy_maybe. And you don't have to manage another
> box/config. Since you only need to keep users on the _same_ box and
not
> the _correct_ box, if you're using mysql authentication you could hash
> the username or domain to a particular IP address:
> 
> SELECT CONCAT('192.168.1.', ORD(UPPER(SUBSTRING('%d', 1, 1))) AS host,
> 'Y' AS proxy_maybe, ...
> 
> Just assign IP addresses 192.168.1.48-90 to your dovecot servers.
Shift
> the range by adding or subtracting to the ORD. A mysql function would
> likely work just as well. If a server goes down, move it's IP. You
could
> probably make pairs with heartbeat or some monitoring software to do
it
> automatically.

Timo posted a similar suggestion recently, and I might try to find some
time to proof this out over the next few weeks. I liked his idea of
storing the user's current server in the database and proxying to that,
with fallback to a local connection if they're new or their current
server is unavailable. The table cleanup and pool monitoring would
probably be what I'd worry most about testing.

Unfortunately we're currently using LDAP auth via PAM... so even if I
could get the SQL and monitoring issues resolved, I think I'd have a
hard time convincing my peers that adding a SQL server as a single point
of failure was a good idea. If it could be set up to just fall back to
using a local connection in the event of a SQL server outage, that might
help things a bit. Anyone know how that might work?

-Brad

Re: [Dovecot] quick question

2010-01-25 Thread Brandon Davidson

David,

> -Original Message-
> From: David Halik [mailto:dha...@jla.rutgers.edu]
> 
> *sigh*, it looks like there still might be the occasional user visible
> issue. I was hoping that once the assert stopped happening, and the
> process stayed alive, that the users wouldn't see their inbox
disappear
> and reappear apparently, this is still happening occasionally.
> 
> I just had user experience this with TB 2, and after looking at the
logs
> I found the good ole' stale nfs message:
> 

Hmm, that's disappointing to hear. I haven't received any new reports
from our helpdesk, so maybe it's at least less visible?

> For now they're just have to live with it until I either get
proxy_maybe
> setup, or some other solution.

Let me know if you come up with anything. I'm not sure we want to add
MySQL as a dependency for our mail service... but I'm at least curious
to see how things perform with session affinity. I'll add it to my long
list of things to play with when I have time for such things...

-Brad

Re: [Dovecot] quick question

2010-01-22 Thread Brandon Davidson

David,

On 1/22/10 12:34 PM, "David Halik"  wrote:
> 
> We currently have IP session 'sticky' on our L4's and it didn't help all
> that much. yes, it reduces thrashing on the backend, but ultimately it
> won't help the corruption. Like you said, multiple logins will still go
> to different servers when the IP's are different.
> 
> How if your webmail architecture setup? We're using imapproxy to spread
> them them out across the same load balancer, so essentially all traffic
> from outside and inside get's balanced. The trick is we have an internal
> load balanced virtual IP that spreads the load out for webmail on
> private IP space. If they were to go outside they would get NAT'd as one
> outbound IP, so we just go inside and get the benefit of balancing.

We have two webmail interfaces - one is an old in-house open-source project
called Alphamail, the new one is Roundcube. Both of them point at the same
VIP that we point users at, with no special rules. We're running straight
round-robin L4 connection distribution, with no least-connections or
sticky-client rules.

We've been running this way for about 3 years I think.. I've only been here
a year. We made a number of changes in sequence starting about three and a
half years ago - Linux NFS to Netapp, Courier to Dovecot, mbox to Maildir+,
LVS to F5 BigIP; not necessarily in that order. At no point have we ever had
any sort of session affinity.

> That's where we are, and as long as the corruptions stay user invisible,
> I'm fine with it. Crashes seem to be the only user visible issue so far,
> with "noac" being out of the question unless they buy a ridiculously
> expensive filer.

Yeah, as long as the users don't see it, I'm happy to live with the messages
in the log file.

-Brad

Re: [Dovecot] quick question

2010-01-22 Thread Brandon Davidson

Cor,

On 1/22/10 1:05 PM, "Cor Bosman"  wrote:
> 
> Pretty much the same as us as well.  35 imap servers. 10 pop servers.
> clustered pair of 6080s, with about 250 15K disks. We're seeing some
> corruption as well. I myself am using imap extensively and regularly have
> problems with my inbox disappearing. Im not running the patch yet though. Is
> 1.2.10 imminent or should i just patch 1.2.9?

You guys must serve a pretty heavy load. What's your peak connection count
across all those machines? How's the load? We recently went through a
hardware replacement cycle, and were targeting < 25% utilization at peak
load so we can lose one of our sites (half of our machines are in each site)
without running into any capacity problems. We're actually at closer to 10%
at peak, if that... Probably less now that we've disabled noac. Dovecot is
fantastic :)

-Brad

Re: [Dovecot] quick question

2010-01-22 Thread Brandon Davidson

David,

> -Original Message-
> From: dovecot-bounces+brandond=uoregon@dovecot.org
[mailto:dovecot-
> Our physical setup is 10 Centos 5.4 x86_64 IMAP/POP servers, all with
> the same NFS backend where the index, control, and Maildir's for the
> users reside. Accessing this are direct connections from clients, plus
> multiple squirrelmail webservers, and pine users, all at the same time
> with layer4 switch connection load balancing.
> 
> Each server has an average of about 400 connections, for a total of
> around concurrent 4000 during a normal business day. This is out of a
> possible user population of about 15,000.
> 
> All our dovecot servers syslog to one machine, and on average I see
> about 50-75 instances of file corruption per day. I'm not counting
each
> line, since some instances of corruption generate a log message for
each
> uid that's wrong. This is just me counting "user A was corrupted once
at
> 10:00, user B was corrupted at 10:25" for example.

We have a much similar setup - 8 POP/IMAP servers running RHEL 5.4,
Dovecot 1.2.9 (+ patches), F5 BigIP load balancer cluster
(active/standby) in a L4 profile distributing connections round-robin,
maildirs on two Netapp Filers (clustered 3070s with 54k RPM SATA disks),
10k peak concurrent connections for 45k total accounts. We used to run
with the noac mount option, but performance was abysmal, and we were
approaching 80% CPU utilization on the filers at peak load. After
removing noac, our CPU is down around 30%, and our NFS ops/sec rate is
maybe 1/10th of what it used to be.

The downside to this is that we've started seeing significantly more
crashing and mailbox corruption. Timo's latest patch seems to have fixed
the crashing, but the corruption just seems to be the cost of
distributing users at random across our backend servers.

We've thought about enabling IP-based session affinity on the load
balancer, but this would concentrate the load of our webmail clients, as
well as not really solving the problem for users that leave clients open
on multiple systems. I've done a small bit of looking at nginx's imap
proxy support, but it's not really set up to do what we want, and would
require moving the IMAP virtual server off our load balancers and on to
something significantly less supportable. Having the dovecot processes
'talk amongst themselves' to synchronize things, or go into proxy mode
automatically, would be fantastic.

Anyway, that's where we're at with the issue. As a data point for your
discussion with your boss:
* With 'noac', we would see maybe 1 or two 'corrupt' errors a day. Most
of these were related to users going over quota.
* After removing 'noac', we saw 5-10 'Corrupt' errors and 20-30 crashes
a day. The crashes were highly visible to the users, as their mailbox
would appear to be empty until the rebuild completed.
* Since applying the latest patch, we've seen no crashes, and 60-70
'Corrupt' errors a day. We have not had any new user complaints.

Hope that helps,

-Brad

Re: [Dovecot] 1.2.9 imap crash with backtrace

2010-01-14 Thread Brandon Davidson

Hi David,

On 1/14/10 3:13 PM, "David Halik"  wrote:
> 
> FYI, we backed out of the "noac" change today. When our 20K accounts
> started coming to work the NetApp NFS server was pushing 70% CPU usage
> and 25K NFS Ops/s, which resulted in all kinds of other havoc as normal
> services started becoming slow. This server usally runs around 25% and
> 5K, so such a large increase of load was too much to handle.
> 
> During the 12 hour window I didn't see a single uid error as expected,
> but the fix was worse than the problem.

We're pretty loathe to go back to noac as well. We will probably disable
process log throttling (mail_log_max_lines_per_sec = 0) to increase the
reindex speed until Timo comes up with a fix for the crash. This should at
least help the users "get their mail back" in a more reasonable timeframe.

-Brad

Re: [Dovecot] 1.2.9 imap crash with backtrace

2010-01-13 Thread Brandon Davidson

Timo,

> -Original Message-
> From: Timo Sirainen
> 
> 1721 is not in the recs[] list, since it's sorted and the first one is
1962.
> 
> So there's something weird going on why it's in the filename hash
table, but
> not in the array. I'll try to figure it out later..

I hope your move is going well, and you get settled in and your internet
hooked up soon. It's got to be a rough process!

Just for the record, we continue to see this crash fairly frequently
with a small subset of our users, enough so that they have started to
complain to the helpdesk staff about their mail 'disappearing and then
reappearing.' One user in particular has a mail client left open from
three hosts and has hit it 23 times in the last week, and 10 times
today.

If there's any more information I can collect or anything I can do to
help get this resolved, please let me know! 

-Brad

Re: [Dovecot] dovecot-1.2.8 imap crash (with backtrace)

2009-12-23 Thread Brandon Davidson

Timo,

On 12/23/09 8:37 AM, "David Halik"  wrote:
> I switched all of our servers to dotlock_use_excl=no last night, but
> we're still seeing the errors:

We too have set dotlock_use_excl = no. I'm not seeing the "Stale NFS file
handle" message any more, but I am still seeing a crash. The crashes seem to
be leaving the indexes in a bad state:

Dec 23 09:07:44 oh-popmap3p dovecot: imap: user=, rip=x.x.x.x,
pid=30101: Panic: file maildir-uidlist.c: line 403
(maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
Dec 23 09:07:44 oh-popmap3p dovecot: imap: user=, rip= x.x.x.x,
pid=30101: Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] ->
imap(i_fatal+0) [0x4d8c7a] -> imap [0x44f2cc] -> imap [0x44f814] -> imap
[0x4500a2] -> imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap
[0x44bff1] -> imap [0x44c0a8] -> imap [0x44c178] ->
imap(maildir_storage_sync_init+0x7c) [0x44c6e6] ->
imap(mailbox_sync_init+0x44) [0x489922] -> imap(imap_sync_init+0xab)
[0x42e02b] -> imap [0x41ccc4] -> imap [0x41cd26] -> imap [0x4733be] -> imap
[0x4e4171] -> imap(io_loop_handle_timeouts+0x1d) [0x4e41ce] ->
imap(io_loop_handler_run+0x86) [0x4e4f29] -> imap(io_loop_run+0x3b)
[0x4e4214] -> imap(main+0xa6) [0x4300af] ->
/lib64/libc.so.6(__libc_start_main+0xf4) [0x3217e1d994] -> imap [0x419aa9]
Dec 23 09:07:45 oh-popmap3p dovecot: dovecot: child 30101 (imap) killed with
signal 6 (core dumped)
Dec 23 09:09:16 cc-popmap3p dovecot: imap: user=, rip= x.x.x.x,
pid=5975: Corrupted index cache file
/home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: invalid record size
Dec 23 09:09:17 oh-popmap2p dovecot: imap: user=, rip=y.y.y.y,
pid=3279: read() failed with index cache file
/home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: Input/output error
Dec 23 09:09:38 cc-popmap3p dovecot: imap: user=, rip= x.x.x.x,
pid=5975: Corrupted index cache file
/home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: invalid record size
Dec 23 09:18:12 cc-popmap3p dovecot: imap: user=, rip= x.x.x.x,
pid=5975: Corrupted index cache file
/home16/cnisser/.imapidx/.INBOX/dovecot.index.cache: invalid record size

We're also seeing another odd error that seems to be unrelated to the
crashes, but seemed like it bears reporting. Reading of uidlists and cache
files seems to intermittently fail with EIO. It doesn't seem to tie in with
anything else, and I don't see any corresponding NFS errors in the system
log.

Dec 23 09:31:06 oh-popmap4p dovecot: imap: user=, rip=a.a.a.a,
pid=7641: read(/home6/joet/Maildir/dovecot-uidlist) failed: Input/output
error 
Dec 23 09:53:17 cc-popmap2p dovecot: imap: user=, rip=b.b.b.b,
pid=12840: read(/home3/catm/Maildir/dovecot-uidlist) failed: Input/output
error 
Dec 23 09:59:38 cc-popmap5p dovecot: imap: user=, rip=c.c.c.c,
pid=13539: read() failed with index cache file
/home15/kforrist/.imapidx/.INBOX/dovecot.index.cache: Input/output error

-Brad

Re: [Dovecot] dovecot-1.2.8 imap crash (with backtrace)

2009-12-22 Thread Brandon Davidson

We've started seeing the maildir_uidlist_records_array_delete assert crash as 
well. It always seems to be preceded by a 'stale NFS file handle' error from a 
the same user on a different connection.

Dec 22 10:12:20 oh-popmap5p dovecot: imap: user=, rip=a.a.a.a, pid=2439: 
fdatasync(/home11/apbao/Maildir/dovecot-uidlist) failed: Stale NFS file handle 
Dec 22 10:12:20 oh-popmap5p dovecot: imap: user=, rip=a.a.a.a, pid=2439: 
/home11/apbao/Maildir/dovecot-uidlist: next_uid was lowered (2642 -> 2641, 
hdr=2641)
Dec 22 11:17:26 cc-popmap2p dovecot: imap: user=, rip=b.b.b.b, 
pid=28088: Panic: file maildir-uidlist.c: line 403 
(maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
Dec 22 11:17:26 cc-popmap2p dovecot: imap: user=, rip=b.b.b.b, 
pid=28088: Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] -> imap(i_fatal+0) 
[0x4d8c7a] -> imap [0x44f2cc] -> imap [0x44f814] -> imap [0x4500a2] -> 
imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap [0x44bff1] -> imap 
[0x44c0a8] -> imap [0x44c178] -> imap(maildir_storage_sync_init+0x7c) 
[0x44c6e6] -> imap(mailbox_sync_init+0x44) [0x489922] -> 
imap(imap_sync_init+0xab) [0x42e02b] -> imap [0x42f107] -> 
imap(cmd_sync_delayed+0x1c6) [0x42f663] -> imap(client_handle_input+0x119) 
[0x4244d4] -> imap(client_input+0xb4) [0x424594] -> 
imap(io_loop_handler_run+0x17d) [0x4e5020] -> imap(io_loop_run+0x3b) [0x4e4214] 
-> imap(main+0xa6) [0x4300af] -> /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x3c4ea1d994] -> imap [0x419aa9] 
Dec 22 11:17:26 cc-popmap2p dovecot: dovecot: child 28088 (imap) killed with 
signal 6 (core dumped)

Dec 22 13:16:49 cc-popmap3p dovecot: imap: user=, rip=x.x.x.x, pid=3908: 
fdatasync(/home2/ndunn/Maildir/dovecot-uidlist) failed: Stale NFS file handle 
Dec 22 13:25:16 cc-popmap3p dovecot: imap: user=, rip=y.y.y.y, pid=3228: 
Panic: file maildir-uidlist.c: line 403 (maildir_uidlist_records_array_delete): 
assertion failed: (pos != NULL) 
Dec 22 13:25:16 cc-popmap3p dovecot: imap: user=, rip=y.y.y.y, pid=3228: 
Raw backtrace: imap [0x4d8986] -> imap [0x4d97b0] -> imap(i_fatal+0) [0x4d8c7a] 
-> imap [0x44f2cc] -> imap [0x44f814] -> imap [0x4500a2] -> 
imap(maildir_uidlist_refresh+0x9d) [0x450686] -> imap [0x44bff1] -> imap 
[0x44c0a8] -> imap [0x44c178] -> imap(maildir_storage_sync_init+0x7c) 
[0x44c6e6] -> imap(mailbox_sync_init+0x44) [0x489922] -> 
imap(imap_sync_init+0xab) [0x42e02b] -> imap [0x42f107] -> 
imap(cmd_sync_delayed+0x1c6) [0x42f663] -> imap(client_handle_input+0x119) 
[0x4244d4] -> imap(client_input+0xb4) [0x424594] -> 
imap(io_loop_handler_run+0x17d) [0x4e5020] -> imap(io_loop_run+0x3b) [0x4e4214] 
-> imap(main+0xa6) [0x4300af] -> /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x3e5021d994] -> imap [0x419aa9] 
Dec 22 13:25:16 cc-popmap3p dovecot: dovecot: child 3228 (imap) killed with 
signal 6 (core dumped)

I will note that we did not start seeing this crash until we took 'noac' out of 
our NFS mount options, as discussed on this list late last week. On the other 
hand, load on our NFS server (as measured in IOPS/sec) has dropped by a factor 
of 10.

-Brad

> -Original Message-
> From: dovecot-bounces+brandond=uoregon@dovecot.org [mailto:dovecot-
> bounces+brandond=uoregon@dovecot.org] On Behalf Of David Halik
> Sent: Tuesday, December 22, 2009 7:48 AM
> To: dovecot@dovecot.org
> Subject: Re: [Dovecot] dovecot-1.2.8 imap crash (with backtrace)
> 
> 
> I'm seeing both of these dumps on multiple users now with 1.2.9, so I
> went ahead and did backtraces for them both.
> 
> maildir_uidlist_records_array_delete panic: http://pastebin.com/f20614d8
> 
> ns_get_listed_prefix panic: http://pastebin.com/f1420194c
> 
> 
> On 12/21/2009 12:43 PM, David Halik wrote:
> >
> > Just wanted to update you that I just upgraded all of our servers to
> > 1.2.9 and I'm still seeing the array_delete panic:
> >
> > Dec 21 12:10:16 gehenna11.rutgers.edu dovecot: IMAP(user1): Panic:
> > file maildir-uidlist.c: line 403
> > (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
> > Dec 21 12:15:12 gehenna19.rutgers.edu dovecot: IMAP(user2): Panic:
> > file maildir-uidlist.c: line 403
> > (maildir_uidlist_records_array_delete): assertion failed: (pos != NULL)
> >
> > I also started receiving a good deal of these, but only from one user
> > so far:
> >
> > Dec 21 12:16:42 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic:
> > file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed:
> > (match == IMAP_MATCH_YES)
> > Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic:
> > file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed:
> > (match == IMAP_MATCH_YES)
> > Dec 21 12:18:20 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic:
> > file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed:
> > (match == IMAP_MATCH_YES)
> > Dec 21 12:19:57 gehenna14.rutgers.edu dovecot: IMAP(user3): Panic:
> > file cmd-list.c: line 242 (ns_get_listed_prefix): assertion failed:
> > (match ==

[Dovecot] Maildir on NFS - attribute caching question

2009-12-18 Thread Brandon Davidson

Hi Timo,

We've been running Dovecot with Maildir on NFS for quite a while - since
back in the 1.0 days I believe. I'm somewhat new here. Anyway...

The Wiki article on NFS states that 1.1 and newer will flush attribute
caches if necessary with mail_nfs_storage=yes. We're running 1.2.8  with
that set, as well as mail_nfs_index=yes, mmap_disable=yes and
fsync_disable=no. We have a pool of POP/IMAP and SMTP machines that are
accessing the maildirs, and can't guarantee any sort of user session
affinity to a particular host.

We also mount our NFS shares with 'noac', which is what I'm writing to
ask about. I'd like to stop doing that for performance reasons. Do you
see any issues with taking that out of the mount options, given our
environment?

Thanks,

-Brad

Re: [Dovecot] 1.2.7: recs[i]->uid < rec-> uid

2009-11-25 Thread Brandon Davidson

Timo,

> -Original Message-
> > I'm not really sure why these are happening. I anyway changed them from
> > being assert-crashes to just logged errors. I'm interested to find out
> > what it logs now and if there are any user-visible errors.
> > http://hg.dovecot.org/dovecot-1.2/rev/e47eb506eebd
> 
> FWIW, I'm seeing this on 1.2.8 as well - just for one user so far. I'll try
> applying this patch, and report if I see anything else logged.

The user who encountered an assert crash prior to this patch now seems to be 
working properly. I am not aware of any errors presented to the client, but the 
logs show the following on the first login after application of the patch:

Nov 25 07:51:28 oh-popmap1p dovecot: imap: user=, rip=x.x.x.x, pid=13702: 
/home6/youm/Maildir/.Deleted Messages/dovecot-uidlist: uid=24464 exists in 
index, but not in uidlist 
Nov 25 07:51:28 oh-popmap1p dovecot: imap: user=, rip=x.x.x.x, pid=13702: 
/home6/youm/Maildir/.Deleted Messages/dovecot-uidlist: uid=24520 exists in 
index, but not in uidlist 
Nov 25 07:51:28 oh-popmap1p dovecot: imap: user=, rip=x.x.x.x, pid=13702: 
/home6/youm/Maildir/.Deleted Messages/dovecot-uidlist: uid=24532 exists in 
index, but not in uidlist

I have not seen it repeated since.

-Brad

Re: [Dovecot] 1.2.7: recs[i]->uid < rec-> uid

2009-11-24 Thread Brandon Davidson

> -Original Message-
> On Sun, 2009-11-22 at 23:54 +0100, Edgar Fuß wrote:
> > I'm getting this Panic with some users on dovecot-1.2.7:
> >
> > Panic: file maildir-uidlist.c: line 1242
> > (maildir_uidlist_records_drop_expunges): assertion failed: (recs[i]-
> >  >uid < rec->
> > uid)
> 
> I'm not really sure why these are happening. I anyway changed them from
> being assert-crashes to just logged errors. I'm interested to find out
> what it logs now and if there are any user-visible errors.
> http://hg.dovecot.org/dovecot-1.2/rev/e47eb506eebd

FWIW, I'm seeing this on 1.2.8 as well - just for one user so far. I'll try 
applying this patch, and report if I see anything else logged.

For the record, the old epoll_ctl issue was resolved by the patch that 
reordered the fd closes. We ran 1.2.6 with that patch for quite a while and it 
didn't reoccur once.

-Brad

Re: [Dovecot] pop3-login: Fatal: io_loop_handle_add: epoll_ctl(1, 5):

2009-10-29 Thread Brandon Davidson

Hi Timo,

> -Original Message-
> From: Timo Sirainen [mailto:t...@iki.fi]
> 
> On Thu, 2009-10-29 at 12:08 -0700, Brandon Davidson wrote:
> > I haven't applied the fd leak detection patch, but I do have lsof
output
> > and a core file available here:
> > http://uoregon.edu/~brandond/dovecot-1.2.6/
> 
> There's no 0,12 in the lsof list.. Annoying, I can't seem to find what
> it is. 0,10 is inotify, 0,11 is epoll, but 0,12 just doesn't show up
> anywhere.

It looks like eventpoll uses the dynamic minor stuff (.minor =
MISC_DYNAMIC_MINOR), so it could well be that this is just what it got
on his system due to something else loading and requesting a dynamic
minor before eventpoll loaded. A better check (if one is necessary)
might be to see if the minor of the leaked device is different from the
minor of the epoll device right after creation.

> The core file is also pretty useless without the exact same binaries
and
> libraries that produced it.

RPMs are now in that directory.

> You could also set login_process_per_connection=no and this should go
> away, because then it only creates login processes at startup and
can't
> fail randomly later.

Are there any downsides to doing this?

-Brad

Re: [Dovecot] pop3-login: Fatal: io_loop_handle_add: epoll_ctl(1, 5):

2009-10-29 Thread Brandon Davidson

We've had this reoccur twice this week. In both cases, it seems to hit a
swath of machines all within a few minutes. For some reason it's been
limited to the master serving pop3 only. In all cases, the logging
socket at fd 5 had gone missing.

I haven't applied the fd leak detection patch, but I do have lsof output
and a core file available here:
http://uoregon.edu/~brandond/dovecot-1.2.6/

Timo, is there anything else I can collect to assist in debugging this?
I'd rather not go back to 1.2.4, but my coworkers are becoming annoyed
at having to restart the master processes every few days.

-Brad

Re: [Dovecot] pop3-login: Fatal: io_loop_handle_add: epoll_ctl(1, 5):

2009-10-22 Thread Brandon Davidson

Hi Marco,

On 10/22/09 1:50 AM, "Marco Nenciarini"  wrote:
> This morning it happened another time, another time during the daily
> cron execution.
> 
> Oct 22 06:26:57 server dovecot: pop3-login: Panic: Leaked file fd 5: dev
> 0.12 inode 1005
> Oct 22 06:26:57 server dovecot: dovecot: Temporary failure in creating
> login processes, slowing down for now
> Oct 22 06:26:57 server dovecot: dovecot: child 21311 (login) killed with
> signal 6 (core dumps disabled)
> 
> I have dovecot 1.2.6 with Timo's patch to check leaked descriptors.

I rebuilt the binaries on our hosts with optimization disabled, and I'm
still waiting for it to reoccur so I can gather file descriptor information
and a core. I don't have the leak-detect patch applied.

Let's see what Timo has to say about that log file bit. Since it seems to
happen to you fairly frequently, it might be worth enabling core dumps as
well?

-Brad

Re: [Dovecot] NFS random redirects

2009-10-22 Thread Brandon Davidson

Thomas,

On 10/22/09 1:29 AM, "Thomas Hummel"  wrote:
> On Wed, Oct 21, 2009 at 09:39:22AM -0700, Brandon Davidson wrote:
>> As a contrasting data point, we run NFS + random redirects with almost no
>> problems. 
> 
> Thanks for your answer as well.
> 
> What mailbox format are you using ?

We switched to Maildir a while back due to performance issues with mbox,
primarily centered around locking and the cost of rewriting the entire file
when one message changes. Haven't looked back since.

Our config is pretty vanilla - users in LDAP (via pam_ldap), standard UNIX
home directory layout, Sendmail on the MTA hosts.

-Brad

Re: [Dovecot] NFS random redirects

2009-10-21 Thread Brandon Davidson

On 10/21/09 8:59 AM, "Guy"  wrote:
> Our current setup uses two NFS mounts accessed simultaneously by two
> servers. Our load balancing tries to keep a user on the same server whenever
> possible. Initially we just had roundrobin load balancing which led to index
> corruption.
> The problems we've had with that corruption have simply been that some
> messages are displayed twice or not displayed at all in mail clients.
> Deletion of the corrupted index allowed Dovecot to recreate it correctly, so
> the client can't do anything about it. You'd probably have to do it manually
> or have some sort of web interface for users to do it themselves.
> 
> I certainly wouldn't use NFS with multiple servers accessing it again for
> Dovecot. Looking at a clustered FS on SAN solution at the moment.

As a contrasting data point, we run NFS + random redirects with almost no
problems. We host ~7TB of mail for ~45k users with a peak connection count
of 10k IMAP connections, and maybe a handful of POP3. We make absolutely no
effort to make sure that connections from the same user or IP are routed to
the same server.

We do occasionally see index corruption, but it is almost always related to
the user going over quota, and Dovecot being unable to write to the logs. If
we wanted to solve this problem, we could move the indexes off to a second
tier of storage. It is a very minor issue though. Locking has not been a
problem at all.

I will say that this may be a situation where you get what you pay for.
We've invested a fair amount of money in our storage system (Netapp), server
pool (RHEL5), and networking technology (F5 BigIP LTM). Our mail is spread
across 16 volumes on two filers, and we are careful to stress-test the
servers and storage backend before rolling out major upgrades.

That is not of course to neglect the value of things that are free - like
Dovecot! Many thanks to Timo for maintaining such a wonderful piece of
software!

-Brad

Re: [Dovecot] master:Error @2.0 TLS login, "Too large auth data_size sent"

2009-10-15 Thread Brandon Davidson

On Red Hat based distros, do:

echo 'DAEMON_COREFILE_LIMIT="unlimited"' >> /etc/sysconfig/dovecot &&
service dovecot restart

Might be worth putting in the wiki if it's not there already?

-Brad

> -Original Message-
>   ==> /var/log/dovecot/dovecot.log <==
>   Oct 15 09:07:33 master: Info: Dovecot v2.0.alpha1 starting up
(core
> dumps disabled)
> 
> how do i enable coredumps?

Re: [Dovecot] Dovecot 1.2.6 segfault in imap_fetch_begin

2009-10-15 Thread Brandon Davidson

Hi Timo,

> -Original Message-
> From: Timo Sirainen [mailto:t...@iki.fi]
> 
> This just shouldn't be happening. Are you using NFS? Anyway this
should
> replace the crash with a nicer error message:
> http://hg.dovecot.org/dovecot-1.2/rev/6c6460531514

Yes, we've got a pool of servers with Maildir on NFS with quotas
enabled. Occasionally the users will run out of space and the indexes
will get corrupted or out of sync. Our Helpdesk staff will increase
their quota or help them delete things, and  Dovecot logs a stream of "
Expunged message reappeared" and "Duplicate file entry" messages as it
straightens things out. This is a fairly common occurrence given the
size of our user base, so I'm assuming this is the root cause... but
this is the first time I've seen Dovecot crash as a result.

-Brad

Re: [Dovecot] pop3-login: Fatal: io_loop_handle_add: epoll_ctl(1, 5):

2009-10-15 Thread Brandon Davidson

I seem to have run into the same issue on two of our 12 Dovecot servers
this morning:

Oct 15 03:41:51 oh-popmap5p dovecot: dovecot: child 7529 (login)
returned error 89 (Fatal failure) 
Oct 15 03:41:51 oh-popmap5p dovecot: dovecot: child 7532 (login)
returned error 89 (Fatal failure) 
Oct 15 03:41:51 oh-popmap5p dovecot: dovecot: Temporary failure in
creating login processes, slowing down for now 
Oct 15 03:41:51 oh-popmap5p dovecot: dovecot: Temporary failure in
creating login processes, slowing down for now 
Oct 15 03:41:51 oh-popmap5p dovecot: imap-login: Fatal:
io_loop_handle_add: epoll_ctl(1, 5): Operation not permitted 
Oct 15 03:41:51 oh-popmap5p dovecot: dovecot: Created login processes
successfully, unstalling 
Oct 15 03:41:51 oh-popmap5p dovecot: pop3-login: Fatal:
io_loop_handle_add: epoll_ctl(1, 5): Operation not permitted 
Oct 15 03:41:51 oh-popmap5p dovecot: dovecot: Created login processes
successfully, unstalling 
Oct 15 03:41:52 oh-popmap5p dovecot: dovecot: child 7576 (login)
returned error 89 (Fatal failure) 
Oct 15 03:41:52 oh-popmap5p dovecot: dovecot: Temporary failure in
creating login processes, slowing down for now

All 12 of our servers are running Dovecot 1.2.6, and all of them were
upgraded from 1.2.4 and restarted at Oct 13 04:00 by a cron job that
updates packages from our internal Yum repo. Only two of the servers
encountered this issue.

We run two separate master processes on each host - one for IMAP, one
for POP3. The IMAP service runs with a significantly increased
login_max_processes_count, and continued to serve user requests. The
POP3 service hit the max login process limit and stopped accepting new
connections, which triggered our alerting system.

For what it's worth, I was able to kill -HUP the master processes on
both machines and things seemed to return to normal. I also took the
precaution of killing off the pop3 login processes to get new
connections accepted.

Timo, is there any more information I could gather about this issue?
We've got a fairly large pool of machines, and odds are it will crop up
again if we wait long enough.

Thanks,

-Brad

Re: [Dovecot] Dovecot 1.2.6 segfault in imap_fetch_begin

2009-10-14 Thread Brandon Davidson

Timo,

> -Original Message-
> -O2 compiling has dropped one stage from the backtrace, but I think
this
> will fix the crash:
> 
> I guess it would be time for 1.2.7 somewhat soon..

Thanks! As always, you're one step ahead of us with the bug fixes! I've
got one more for you that just popped up. I'm guessing that it's also
due to expunging causing sequence numbers to mixed up, and one of the
existing patches will fix it?

The error from the logs is:
Panic: file mail-transaction-log-view.c: line 108
(mail_transaction_log_view_set): assertion failed: (min_file_seq <=
max_file_seq) 
Raw backtrace: imap [0x49e4a0] -> imap [0x49e503] -> imap [0x49db66] ->
imap(mail_transaction_log_view_set+0x4ac) [0x48651c] ->
imap(mail_index_view_sync_begin+0xe5) [0x480055] ->
imap(index_mailbox_sync_init+0x7f) [0x45e84f] ->
imap(maildir_storage_sync_init+0x100) [0x43cd30] ->
imap(imap_sync_init+0x67) [0x428257] -> imap(cmd_sync_delayed+0x174)
[0x4284a4] -> imap(client_handle_input+0x19e) [0x420aee] ->
imap(client_input+0x5f) [0x4214df] -> imap(io_loop_handler_run+0xf8)
[0x4a61f8] -> imap(io_loop_run+0x1d) [0x4a530d] -> imap(main+0x620)
[0x428da0] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x31d5a1d994] ->
imap [0x419a89] 
dovecot: child 11758 (imap) killed with signal 6 (core dumped)

Backtrace and such here:
http://uoregon.edu/~brandond/dovecot-1.2.6/bt2.txt

Thanks again,

-Brad

[Dovecot] Dovecot 1.2.6 segfault in imap_fetch_begin

2009-10-14 Thread Brandon Davidson

We recently upgraded from Dovecot 1.2.4 to 1.2.6 (with the sieve patches
of course). Everything has been running quite well since the upgrade.
The occasional issue with assert-crashing when expunging has gone away.

However, one of our users seems to have triggered a new issue. She's
been the only one to see it, but whenever she logs in, her imap process
segfaults immediately. It appears that the crash is a null pointer deref
in the array library, but I'm not sure what code is at fault for calling
in without checking array validity... or even if I'm on the right track.

Backtraces and some further information are available here. Cores
available on request.
http://uoregon.edu/~brandond/dovecot-1.2.6/bt.txt

Thanks,

-Brad

[Dovecot] Dovecot 1.2.4 - assertion crash in view_lookup_seq_range

2009-10-06 Thread Brandon Davidson

Hi all,

We have a number of machines running Dovecot 1.2.4 that have been assert
crashing occasionally. It looks like it's occurring when the users
expunge their mailboxes, but I'm not sure as I can't reproduce it
myself. The error in the logs is:

Oct  6 07:33:09 oh-popmap3p dovecot: imap: user=, rip=,
pid=11931: Panic: file mail-index-view.c: line 264
(view_lookup_seq_range): assertion failed: (first_uid > 0) 
Oct  6 07:33:09 oh-popmap3p dovecot: imap: user=, rip=,
pid=11931: Raw backtrace: imap [0x49e130] -> imap [0x49e193] -> imap
[0x49d816] -> imap [0x47e462] -> imap(mail_index_lookup_seq+0x12)
[0x47e022] -> imap(mail_index_view_sync_begin+0x36a) [0x47ffba] ->
imap(index_mailbox_sync_init+0x7f) [0x45e56f] ->
imap(maildir_storage_sync_init+0x100) [0x43cb70] ->
imap(imap_sync_init+0x67) [0x428177] -> imap(cmd_sync_delayed+0x174)
[0x4283c4] -> imap(client_handle_input+0x19e) [0x420a0e] ->
imap(client_input+0x5f) [0x4213ff] -> imap(io_loop_handler_run+0xf8)
[0x4a5e98] -> imap(io_loop_run+0x1d) [0x4a4fad] -> imap(main+0x620)
[0x428cc0] -> /lib64/libc.so.6(__libc_start_main+0xf4) [0x323dc1d994] ->
imap [0x4199f9] 
Oct  6 07:33:09 oh-popmap3p dovecot: dovecot: child 11931 (imap) killed
with signal 6 (core dumped)

GDB stack information and some additional details are available here:
http://uoregon.edu/~brandond/dovecot-1.2.4/stack.txt

We are planning to go to 1.2.6 sometime in the next week or two, but I
thought I'd try to track this particular error down just in case it's
still an issue after the upgrade.

-Brad

Re: [Dovecot] Dovecot 1.2.5 segfaults.

2009-09-22 Thread Brandon Davidson


Tom,

Tom Diehl wrote:

I just updated to dovecot 1.2.5 on centos5.

1.2.4 did not show this problem. I am going to roll back for the time being
but I am willing to do whatever I need to to fix this.

This is an x86_64 system. filesystem is ext3.

I am now seeing the following in the logs:

Sep 22 17:31:06 vfoggy kernel: imap[18644]: segfault at  
rip  rsp 7fff83e31c88 error 14


I think I just posted a patch for your issue. It's possible there is another 
null function call in 1.2.5, but I'd bet against it.


I can provide updated RPMs for testing if you are interested.

-Brandon

[Dovecot] Segfault in quota-fs plugin

2009-09-22 Thread Brandon Davidson

Hi all,

We recently attempted to update our Dovecot installation to version
1.2.5. After doing so, we noticed a constant stream of crash messages in
our log file:

Sep 22 15:58:41 hostname dovecot: imap-login: Login: user=,
method=PLAIN, rip=X.X.X.X, lip=X.X.X.X, TLS
Sep 22 15:58:41 hostname dovecot: dovecot: child 6339 (imap) killed with
signal 11 (core dumps disabled)

We rolled back to version 1.2.4, and installed 1.2.5 on a test system -
something we'll have to make sure to do *before* rolling new versions
into production.

Anyway, after examining a few core files from the test system, it looks
like the recent changes to the quota plugin (specifically the maildir
backend's late initialization fix) have broken the other backends. Stack
trace and further debugging are available here:
http://uoregon.edu/~brandond/dovecot-1.2.5/bt.txt

The relevant code seems to have been added in changeset 9380:
http://hg.dovecot.org/dovecot-1.2/rev/fe063e0d7109

Specifically, quota.c line 447 does not check to see if the backend
implements init_limits before calling it, resulting in a null function
call for all backends that do not do so. Since this crash would appear
to affect all quota backends other than maildir, it should be a pretty
easy to reproduce.

I've attached a patch which seems to fix the obvious code issue. I can't
guarantee it's the correct fix since this is my first poke at the
Dovecot source, but it seems to have stopped the crashing on our test
host.

Regards,

-Brandon


dovecot-1.2.5-check-init_limits.patch
Description: dovecot-1.2.5-check-init_limits.patch

72 matches

Mail list logo