Hi,
Our qmail system has been working great for over 4 months now, and we've
got it to do everything we need so far. However, there's an issue that my
boss has pointed out to me and I'm having trouble finding a good answer to
the problem.
We have a cluster of "front end" machines that handle receiving and sending
mail. Each machine has a locally mounted /var/qmail setup, and they all
share a /mailpath that is mounted via NFS. All delivery is to Maildirs so
we have no concurrency issues.
If one front end machine goes down it's no big deal; we have a layer 4
switch that will remove that one from the cluster. The problem is when the
NFS server goes down, or when a front end machine stays up but its link to
the NFS server is interrupted for whatever reason. It appears that if the
front end machines can't deliver to the NFS server, the mail will be
immediately bounced instead of being requeued with a transient error.
I wasn't sure what could be causing this at first, but then I looked at our
setup and realized it's probably failing in the /var/qmail/users/assign
lookup for qmail-local. The lines look like:
+cnmnetwork.com-:cnmnetwork.com:201:201:/mailpath/vpop/cn/cnmnetwork.com:-:
:
where "/mailpath/vpop/cn/cnmnetwork.com" is the root path for mail for that
domain, and therefore is where the .qmail-default file is stored.
What I think is happening: qmail-local attempts to change to the root path,
and the chdir for THAT fails because the NFS mount is down. Around line 90
in qmail-local.c:
if (chdir(dir) == -1) { if (error_temp(errno)) _exit(1); _exit(2); }
Now according to the documentation, temporary failures are supposed to have
an exit code of 111, but for the moment I'll assume the "error_temp" stuff
is working as it's supposed to. The problem then becomes, why is
qmail-local apparently interpreting the error (NFS read timeout?) as
permanent and not temporary?
Any help would be appreciated. I've gone looking through a good bit of the
source code but I'm having trouble figuring out exactly where the message
is bouncing. I'd also like to hear from anyone else who runs a setup like
this and find out how you deal with the inevitable occasional NFS server
crash. Ideally I'd like qmail to figure out that the mount is missing and
just queue without attempting delivery but I'm not sure how to do this
automagically.
Thanks much-
shag
=====
Judd Bourgeois | CNM Network +1 (805) 520-7170
Software Architect | 1900 Los Angeles Avenue, 2nd Floor
[EMAIL PROTECTED] | Simi Valley, CA 93065
To ignore evil is to become an accomplice to it.
-- Martin Luther King, Jr.