Load balancing...

Todd Piket Mon, 29 Apr 2002 13:57:09 -0700

Hey all,

I've been struggling with this for a bit then it dawned on me that I
can't possibly be the only one doing this.  Here's the skinny:


I'm using the vanilla UW imap-2001a release of software to do IMAP,
IMAPS, POP, and POPS for all users at my university.  All of these
protocols are answered by the host email.mtu.edu.  This hostname is
really load balanced across 4 identical Sun 420Rs running Solaris 8
release 10/1 (patched appropriately) via F5 BigIP load balancers. 
Authentication is performed via pam_ldap from padl.com.

The home directories for the users are NFS version 3 (non-udp) mounted
on each 420R as /export/homes/*.  The physical storage is a Network
Appliance F820 with two disk shelves and is only accessible by the
420Rs.

That's the setup.  The problem is, every now and again, one of the
servers (usually an IMAP server) goes "crazy".  Usually the "craziness"
occurs when someone leaves themselves logged in at work, then attempts
to check mail from home (probably with a client different from the one
at work).  More than likely the load balancer places them on a
physically different IMAP server than the original.  The new server is
stuck checking mail and the "old" server tends to end up in a loop with
the following truss output that is repeated over and over again:

12407:  sigprocmask(SIG_SETMASK, 0xFEA0BDE4, 0x00000000) = 0
12407:  lwp_sema_post(0x001748F0)                       = 0
12407:  lwp_sema_wait(0x001748F0)                       = 0
12407:  lwp_mutex_wakeup(0xFEE05560)                    = 0
12407:  lwp_mutex_lock(0xFEE05560)                      = 0
12407:  setitimer(ITIMER_REAL, 0xFEA0B730, 0x00000000)  = 0
12407:  sigprocmask(SIG_SETMASK, 0xFEE0AD70, 0x00000000) = 0
12407:  setcontext(0xFEA0B6C8)
12407:  sigprocmask(SIG_BLOCK, 0xFEDFFA00, 0x00000000)  = 0
12407:  setitimer(ITIMER_REAL, 0xFEA0BC68, 0x00000000)  = 0
12407:  sigprocmask(SIG_UNBLOCK, 0xFEDFFA00, 0x00000000) = 0
12407:      Received signal #14, SIGALRM, in lwp_sema_wait() [caught]
12407:  lwp_sema_wait(0xFEDFFA10)                       Err#91 ERESTART

Other imapd processes are possibly running as the offending user on one
or more servers, but they appear to be stuck in an lwp_sema_wait or
lwp_sema_cond call.  This does not seem to be a locking problem,
especially since it's not really locking across NFS.  It would appear to
be a threading problem or some sort of race condition, but I've been
tracing it for 4 days now and I can't find it.  Has anyone seen this
happen, or is anyone load balancing successfully in a similar fashion? 
Ideas, comments, anything could be helpful.

This occurs with IMAP and IMAPS or any combination thereof.  The problem
is highly reproducible using Netscape Messenger 4.79 and Outlook
Express.

-- 
Regards,

 ------------------------------------------------------------
| Todd Piket                        | Email: [EMAIL PROTECTED] |
| Programmer/Analyst                | Phone: (906) 487-1720  |
| Distributed Computing Services    |                        |
| Michigan Technological University |                        |
 ------------------------------------------------------------
-- 
-----------------------------------------------------------------
 For information about this mailing list, and its archives, see: 
 http://www.washington.edu/imap/imap-list.html
-----------------------------------------------------------------

Load balancing...

Reply via email to