Hi All, Sorry for being so verbose, but I was not really sure which of all those details is important to understand the source of the problem :)
I've installed a new machine into production on Thursday (19 Apr 2007). Machine is still running since then: $ uptime 21:10:43 up 2 days, 10:11, 2 users, load average: 0.02, 0.04, 0.01 The machine is a Dual Core Xeon 3GHz. with HyperThreading enabled (so "grep processor /proc/cpuinfo | wc -l" says "4"). Machine is running Gentoo Linux, with kernel 2.6.19-gentoo-r5, x86_64. Now for my problem. I installed the system with Vixie-Cron, and crond appears to be running. It appears in the processlist, and it sits in the Ss state on ps. So far - very normal. According to the logs, until April 20th, 19:00, processes ran exactly when they were defined to run (I have a process that runs every 15 minutes). From 19:00 until 23:16:55 (note the 1 minute and 55 seconds after the round hour quarter) - there is a complete silence in the logs. It then resumed running with the same delta from the quarter hour until 00:02am on Apr 21. A bit later, I see that ntpd reports that ntp had no servers available to sync (at 2:31am). Doesn't seem related, but I am mentioning it anyways, as cron is, after all, time based. Occasionally from that time I see Gentoo's run-crons acting at some hours, like 2:50am, 03:07:17am (where is removes the lastrun of cron.daily and 04:26:15am where it ran cron.weekly). At 03:14:25am I also see ntpd synchronized back against 192.43.244.18, stratum 1. There is another run-crons at 05:40am, and weirdly enough, cron kicks back to life at Apr 21, 07:45:41am, with my regular 15-minutes task, which it executes once. Since then, silence until 09:00am where only CERTAIN tasks are executed (and the every-15 minutes DOES NOT), and again silence until 17:16:13 where the every-15 kicks in, then it works at 17:33 and 17:46, and since then, silence again. I tried restarting cron at 20:51:27, the restart got logged. Didn't seem to have any effect. on 21:20:31 I see a run-crons, yet the every-15 does not work. I tried stracing the crond process, and I got the following: Process 6796 attached - interrupt to quit stat("crontabs", {st_mode=S_IFDIR|0750, st_size=120, ...}) = 0 stat("/etc/cron.d", {st_mode=S_IFDIR|0755, st_size=96, ...}) = 0 stat("/etc/crontab", {st_mode=S_IFREG|0644, st_size=1365, ...}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {0x40272b, [], SA_RESTORER|SA_RESTART, 0x2b28e1e255c0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({9, 0}, Which I gather should have existed from sleeping pretty quickly, but did not. Only after some time I got this (first line is continuing of last snippet) : {9, 0}) = 0 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b28e2125e10) = 6941 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {0x40272b, [], SA_RESTORER|SA_RESTART, 0x2b28e1e255c0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({10, 0}, 0x7fffc8ed59c0) = ? ERESTART_RESTARTBLOCK (To be restarted) --- SIGCHLD (Child exited) @ 0 (0) --- rt_sigreturn(0x11) = -1 EINTR (Interrupted system call) wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 6941 wait4(-1, 0x7fffc8ed59dc, WNOHANG, NULL) = -1 ECHILD (No child processes) stat("crontabs", {st_mode=S_IFDIR|0750, st_size=120, ...}) = 0 stat("/etc/cron.d", {st_mode=S_IFDIR|0755, st_size=96, ...}) = 0 stat("/etc/crontab", {st_mode=S_IFREG|0644, st_size=1365, ...}) = 0 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {0x40272b, [], SA_RESTORER|SA_RESTART, 0x2b28e1e255c0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({30, 0}, I did the same strace on my cron at my computer at home, which appeared to be sleeping for a different period (namely, 60), but this doesn't look so important, as the sleep on my computer at home ends rather quickly and I see many scans of crontabs, /etc/cron.d and /etc/crontab, which is normal behavior, I guess. So I am thinking there is something maybe wrong with nanosleep(). But what can it be? My guess is related to time drifting due to all those CPUs, but in that case, why did it work great in the begining? I didn't try rebooting the machine, which might have solved the problem (either temporarly or not), but wanted to try and solve it (or at least understand the problem) before I might get it gone. So, any hint from you guys will be greatly appreciated. Thanks, -- Shimi ================================================================To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]