Updates following some off-line discussions and debugging with John on
IRC.  I've cc'd gshapiro@ because the problem appears to be sendmail,
rather than the FreeBSD kernel.

On 2010-Feb-23 12:35:22 +1100, John Marshall <john.marsh...@riverwillow.com.au> 
wrote:
>Environment: sendmail 8.14.4 on FreeBSD 8.0-RELEASE-p2

Note that this is stock ISC sendmail, not the sendmail in either the
base system or the port.

>I posted about this in comp.mail.sendmail and was told...
>
>> sleep() should be one of these calls:
>> 
>>         if (njobs == 0 && WorkGrp[wgrp].wg_lowqintvl < MIN_SLEEP_TIME)
>>                 sleep(MIN_SLEEP_TIME);
>>         else if (WorkGrp[wgrp].wg_lowqintvl <= 0)
>>                 sleep(QueueIntvl > 0 ? QueueIntvl : MIN_SLEEP_TIME);
>>         else
>>                 sleep(WorkGrp[wgrp].wg_lowqintvl);

Whilst it's true that the code calls sleep(), it's not calling
sleep(3) in the FreeBSD libc.  Instead it's calling a sleep() defined
in libsm/clock.c - which is a horrible maze of #ifdefs.

John has pre-processed that code and the result it at:
http://www.riverwillow.net.au/~john/sm/clock.preprocessed

At a quick look, the code is broken: sm_seteventm() generates a
one-off timer using setitimer(2), which will send SIGALRM when it
expires.  sm_releasesignal() then unblocks SIGALRM.  In theory, the
SIGALRM could be delivered anywhere after the (!SmSleepDone) test and
before pause() is called - in which case, the signal is lost and
pause() will sleep forever.

On 2010-Feb-24 08:13:06 +1100, John Marshall <john.marsh...@riverwillow.com.au> 
wrote:
>My ktrace file was created with 'ktrace -g 48501'.  I have the result of
>'kdump -R -p 48504' available at:
>
> <http://www.riverwillow.net.au/~john/8_0/rwsrv04_201002240725.kdump.gz>

The syscall pattern near the end of this file is significantly different
from that elsewhere in the file - with gettimeofday(), sigprocmask() and
sigsuspend() looping fairly rapidly.  Interestingly, sigsuspend() is
returning EINTR but no signal is reported.  I'm not sure what could
cause this.

This syscall pattern looks like the while() loop in sendmail's sleep(),
though it does appear that the loop is exited on that occasion but not
on the following occasion (though the reason for this behaviour is
unclear).

Overall, it appears that there is a race condition in sendmail and
something in the 8.0 signal handling appears to make this race easier
to lose.

Going back to the original clock.c source code, the other thing that
is obvious is the HAVE_NANOSLEEP block - if this was active, sleep()
would call nanosleep(2) and the whole signal mess would be avoided.
It's not clear when that code was added but clock.c has not been
touched for many years.  In the sendmail in FreeBSD-8.0, there is no
other reference to HAVE_NANOSLEEP within sendmail.  sendmail 8.14.4
(in 8-STABLE) has HAVE_NANOSLEEP enabled on Solaris 11 only.

Is there any reason why HAVE_NANOSLEEP is not defined for FreeBSD?
Looking back through the commit logs, nanosleep(2) was implemented in
sys/kern/kern_time.c v1.23 on Thu May 8 14:16:25 1997 UTC - that's
just before RELENG_2_2.

-- 
Peter Jeremy

Attachment: pgpqviZLgiMfU.pgp
Description: PGP signature

Reply via email to