Hello. A few weeks ago I found out after the fact that an important email
someone had sent me had never gotten to me. I dismissed it at first,
thinking it must be their crappy Microsoft mail client or outgoing mail
server. Surely it couldn't be our UNIX-based mail server. ;^>
But then it happened again, and then yet again, where I failed to receive
important work emails. The three losses were from disparate senders and
domains.
Clearly the problem was on our side. I asked around and couldn't find any
other instances of people losing mail, so my mail client (nmh) was under
suspicion, but since its POP code has been pretty stable since like the
mid-1980s (!), I decided to investigate the server side first.
With a lot of log-surfing on the server, which is running qpopper 3.0.2 on
Solaris 2.5, I figured out that the two most recent mail losses coincided
with a mail server crash (unfortunately not all that rare an occurrence due
to an apparent hardware problem we have yet to figure out).
The crash didn't occur around the time of the mail delivery, so it was not
sendmail messing up here. Instead, the crashes occurred during or shortly
after POP3 number-of-messages queries by my mail-checking scripts. The
"Stats:" line in the log file before the crash showed I had messages
waiting, but the next one after the crash showed they were gone (and I know
by the timing that I was not doing any message-pulling during these
crashes).
It looks to me like what's happening is that my scripts do a POP3 connect
(which I do more often than anyone else, explaining why only _I_ have
noticed mail loss), my spool is emptied out of /var/mail/<user> into
/var/mail/.<user>.pop, the machine crashes, and then after the machine's
back up again, my spool is zero-length and the temp_drop is overwritten by
the first check.
I didn't pore through the code exhaustively, but I couldn't find any code
that would prevent this. Shouldn't there be code that would check for the
pre-existence of the temp_drop file and merge its messages back into the
spool before doing anything else??
As I understand things, the only way to prevent any possibility of
overwriting an existing temp_drop file would be to do it atomically, with
O_EXCL specified along with O_CREAT on the open() call. This is not being
done in 3.0.2, nor has this been fixed in subsequent versions. Here's line
1487 of qpopper 4.0.3's pop_dropcopy.c:
dfd = open ( p->temp_drop, O_RDWR | O_CREAT, 0660 );
This should be:
dfd = open ( p->temp_drop, O_RDWR | O_CREAT | O_EXCL, 0660 );
Even with that change alone, you'd prevent the mail loss that I'm seeing.
Ideally, though, there should also be appropriate checking of the errno and
if it's EEXIST, temp_drop's contents should be merged back into the mail
spool to prevent the mail lossage that I'm seeing.
--
Dan Harkless
SpeedGate Communications, Inc.