Hey all, I've been struggling with this for a bit then it dawned on me that I can't possibly be the only one doing this. Here's the skinny:
I'm using the vanilla UW imap-2001a release of software to do IMAP, IMAPS, POP, and POPS for all users at my university. All of these protocols are answered by the host email.mtu.edu. This hostname is really load balanced across 4 identical Sun 420Rs running Solaris 8 release 10/1 (patched appropriately) via F5 BigIP load balancers. Authentication is performed via pam_ldap from padl.com. The home directories for the users are NFS version 3 (non-udp) mounted on each 420R as /export/homes/*. The physical storage is a Network Appliance F820 with two disk shelves and is only accessible by the 420Rs. That's the setup. The problem is, every now and again, one of the servers (usually an IMAP server) goes "crazy". Usually the "craziness" occurs when someone leaves themselves logged in at work, then attempts to check mail from home (probably with a client different from the one at work). More than likely the load balancer places them on a physically different IMAP server than the original. The new server is stuck checking mail and the "old" server tends to end up in a loop with the following truss output that is repeated over and over again: 12407: sigprocmask(SIG_SETMASK, 0xFEA0BDE4, 0x00000000) = 0 12407: lwp_sema_post(0x001748F0) = 0 12407: lwp_sema_wait(0x001748F0) = 0 12407: lwp_mutex_wakeup(0xFEE05560) = 0 12407: lwp_mutex_lock(0xFEE05560) = 0 12407: setitimer(ITIMER_REAL, 0xFEA0B730, 0x00000000) = 0 12407: sigprocmask(SIG_SETMASK, 0xFEE0AD70, 0x00000000) = 0 12407: setcontext(0xFEA0B6C8) 12407: sigprocmask(SIG_BLOCK, 0xFEDFFA00, 0x00000000) = 0 12407: setitimer(ITIMER_REAL, 0xFEA0BC68, 0x00000000) = 0 12407: sigprocmask(SIG_UNBLOCK, 0xFEDFFA00, 0x00000000) = 0 12407: Received signal #14, SIGALRM, in lwp_sema_wait() [caught] 12407: lwp_sema_wait(0xFEDFFA10) Err#91 ERESTART Other imapd processes are possibly running as the offending user on one or more servers, but they appear to be stuck in an lwp_sema_wait or lwp_sema_cond call. This does not seem to be a locking problem, especially since it's not really locking across NFS. It would appear to be a threading problem or some sort of race condition, but I've been tracing it for 4 days now and I can't find it. Has anyone seen this happen, or is anyone load balancing successfully in a similar fashion? Ideas, comments, anything could be helpful. This occurs with IMAP and IMAPS or any combination thereof. The problem is highly reproducible using Netscape Messenger 4.79 and Outlook Express. -- Regards, ------------------------------------------------------------ | Todd Piket | Email: [EMAIL PROTECTED] | | Programmer/Analyst | Phone: (906) 487-1720 | | Distributed Computing Services | | | Michigan Technological University | | ------------------------------------------------------------ -- ----------------------------------------------------------------- For information about this mailing list, and its archives, see: http://www.washington.edu/imap/imap-list.html -----------------------------------------------------------------
