I've been having what "seemed" to be random crashes that left nothing in the logs, until I noticed that they always happen just after 2:02 (while my daily cron jobs are running) - so they're not random after all. Here are the last 3 crashes - from 10/4, 6/5 and 9/5. You can see that there are no log entries after 2:02, until I do a hard re-boot:
----- 1 ------ Apr 10 01:58:01 shlomo1 crond[9786]: (root) CMD (/data1/myscripts/myADSLtest) Apr 10 02:00:01 shlomo1 crond[9811]: (root) CMD (/data1/myscripts/myADSLtest) Apr 10 02:00:01 shlomo1 crond[9812]: (root) CMD (/data1/myscripts/myAlive) Apr 10 02:01:01 shlomo1 crond[9830]: (root) CMD (nice -n 19 run-parts /etc/cron.hourly) Apr 10 02:02:01 shlomo1 crond[9845]: (root) CMD (/data1/myscripts/myADSLtest) Apr 10 02:02:01 shlomo1 crond[9846]: (root) CMD (nice -n 19 time run-parts /etc/cron.daily) Apr 10 02:02:02 shlomo1 anacron[9856]: Updated timestamp for job `cron.daily' to 2008-04-10 Apr 10 02:02:02 shlomo1 /etc/cron.daily/awffull[9859]: the /tmp/awffull.lock file was found indicating an error. Maybe awffull is still running... Apr 10 02:02:03 shlomo1 logrotate: ALERT exited abnormally with [1] Apr 10 05:38:51 shlomo1 syslogd 1.4.2: restart. Apr 10 05:38:51 shlomo1 kernel: klogd 1.4.2, log source = /proc/kmsg started. Apr 10 05:38:51 shlomo1 kernel: Linux version 2.6.22.12-desktop586-1mdv ([EMAIL PROTECTED]) (gcc version 4.2.2 20070909 (prerelease) (4.2.2-0.RC.1mdv2008.0)) #1 SMP Tue Nov 20 08:09:17 EST 2007 ----- 2 ------ May 6 01:58:01 shlomo1 crond[21897]: (root) CMD (/data1/myscripts/myADSLtest) May 6 02:00:01 shlomo1 crond[21916]: (root) CMD (/data1/myscripts/myAlive) May 6 02:00:01 shlomo1 crond[21917]: (root) CMD (/data1/myscripts/myADSLtest) May 6 02:01:01 shlomo1 crond[21937]: (root) CMD (nice -n 19 run-parts /etc/cron.hourly) May 6 02:02:01 shlomo1 crond[21951]: (root) CMD (/data1/myscripts/myADSLtest) May 6 02:02:01 shlomo1 crond[21952]: (root) CMD (nice -n 19 time run-parts /etc/cron.daily) May 6 02:02:02 shlomo1 anacron[21962]: Updated timestamp for job `cron.daily' to 2008-05-06 May 6 02:02:02 shlomo1 /etc/cron.daily/awffull[21965]: the /tmp/awffull.lock file was found indicating an error. Maybe awffull is still running... May 6 02:02:03 shlomo1 logrotate: ALERT exited abnormally with [1] May 6 04:47:50 shlomo1 syslogd 1.4.2: restart. May 6 04:47:50 shlomo1 kernel: klogd 1.4.2, log source = /proc/kmsg started. May 6 04:47:50 shlomo1 kernel: Linux version 2.6.22.12-desktop586-1mdv ([EMAIL PROTECTED]) (gcc version 4.2.2 20070909 (prerelease) (4.2.2-0.RC.1mdv2008.0)) #1 SMP Tue Nov 20 08:09:17 EST 2007 ----- 3 ------ May 9 01:58:01 shlomo1 crond[27692]: (root) CMD (/data1/myscripts/myADSLtest) May 9 02:00:01 shlomo1 crond[27708]: (root) CMD (/data1/myscripts/myAlive) May 9 02:00:01 shlomo1 crond[27709]: (root) CMD (/data1/myscripts/myADSLtest) May 9 02:01:01 shlomo1 crond[27726]: (root) CMD (nice -n 19 run-parts /etc/cron.hourly) May 9 02:02:01 shlomo1 crond[27741]: (root) CMD (/data1/myscripts/myADSLtest) May 9 02:02:01 shlomo1 crond[27742]: (root) CMD (nice -n 19 time run-parts /etc/cron.daily) May 9 02:02:01 shlomo1 anacron[27752]: Updated timestamp for job `cron.daily' to 2008-05-09 May 9 02:02:01 shlomo1 /etc/cron.daily/awffull[27755]: the /tmp/awffull.lock file was found indicating an error. Maybe awffull is still running... May 9 02:02:02 shlomo1 logrotate: ALERT exited abnormally with [1] May 9 05:36:05 shlomo1 syslogd 1.4.2: restart. May 9 05:36:05 shlomo1 kernel: klogd 1.4.2, log source = /proc/kmsg started. May 9 05:36:05 shlomo1 kernel: Linux version 2.6.22.12-desktop586-1mdv ([EMAIL PROTECTED]) (gcc version 4.2.2 20070909 (prerelease) (4.2.2-0.RC.1mdv2008.0)) #1 SMP Tue Nov 20 08:09:17 EST 2007 The common factor "seems" to be a problem with logrotate, but that's not the cause. Here's an example of logrotate aborting and NOT causing a crash. In fact, it seems logrotate gives that error every day. The "strange" thing is that all the logs seem to be properly rotated, despite the error message. May 7 01:58:01 shlomo1 crond[2870]: (root) CMD (/data1/myscripts/myADSLtest) May 7 02:00:01 shlomo1 crond[2888]: (root) CMD (/data1/myscripts/myAlive) May 7 02:00:01 shlomo1 crond[2889]: (root) CMD (/data1/myscripts/myADSLtest) May 7 02:01:01 shlomo1 crond[2906]: (root) CMD (nice -n 19 run-parts /etc/cron.hourly) May 7 02:02:01 shlomo1 crond[2920]: (root) CMD (/data1/myscripts/myADSLtest) May 7 02:02:01 shlomo1 crond[2921]: (root) CMD (nice -n 19 time run-parts /etc/cron.daily) May 7 02:02:01 shlomo1 anacron[2931]: Updated timestamp for job `cron.daily' to 2008-05-07 May 7 02:02:01 shlomo1 /etc/cron.daily/awffull[2934]: the /tmp/awffull.lock file was found indicating an error. Maybe awffull is still running... May 7 02:02:02 shlomo1 logrotate: ALERT exited abnormally with [1] May 7 02:04:01 shlomo1 crond[3112]: (root) CMD (/data1/myscripts/myADSLtest) May 7 02:06:01 shlomo1 crond[3138]: (root) CMD (/data1/myscripts/myADSLtest) May 7 02:08:02 shlomo1 crond[3153]: (root) CMD (/data1/myscripts/myADSLtest) May 7 02:09:02 shlomo1 crond[3164]: (root) CMD ([ -d /var/lib/php ] && find /var/lib/php/ -type f -mmin +$(/usr/lib/php/maxlifetime) -print0 | xargs -r -0 rm) So, how do I find out what's causing the crash? My guess is that it's one of the daily cron jobs, but how can I find out which? Since the crashes happen at irregular intervals (sometimes 3 or 4 weeks apart and sometimes 2 days apart), it's not a simple matter of disabling some of the jobs to see if that solves the problem. That approach could take months. BTW, here's a list f the daily cron jobs. My guess is that the problem is a job running after logrotate, so that leaves 8 possibilities. [EMAIL PROTECTED] cron.daily]$ ls -l total 56 -rwxr-xr-x 1 root root 276 2007-08-17 02:56 0anacron* -rwxr-xr-x 1 root root 2575 2007-09-01 13:56 awffull* -rwxr-xr-x 1 root root 396 2007-11-16 23:00 getskyepg* -rwxr-xr-x 1 root root 400 2007-08-28 21:44 hylafax* -rwxr-xr-x 1 root root 37 2007-01-28 19:59 logcheck* -rwxr-xr-x 1 root root 180 2007-07-19 23:57 logrotate* -rwxr-xr-x 1 root root 410 2007-08-31 01:48 makewhatis.cron* -rwxr-xr-x 1 root root 137 2007-09-24 17:26 mlocate.cron* lrwxrwxrwx 1 root root 27 2008-01-02 05:56 msec -> /usr/share/msec/security.sh* -rwxr-xr-x 1 root root 431 2006-02-05 22:56 my-aa-findlargefiles* lrwxrwxrwx 1 root root 26 2008-01-02 20:16 myRPMlist -> /data1/myscripts/myRPMlist* -rwxr-xr-x 1 root root 167 2005-01-10 12:51 reoback* -rwxr-xr-x 1 root root 118 2007-10-02 12:09 rpm* -rwxr-xr-x 1 root root 101 2007-11-20 19:55 tetex.cron* -rwxr-xr-x 1 root root 371 2007-08-08 18:35 tmpwatch* -rwxr-xr-x 1 root root 315 2007-09-05 13:24 tripwire-check* Can anyone can suggest how to debug this problem? I did think of one idea and I'd like comments or suggestions. I could add several cron jobs to run after each of the "real" jobs (or add a line to each existing job) to send myself an e-mail to know what jobs have run, in order to see when the e-mails stop coming. However, I'm not sure if there are overlaps in the running of cron jobs - for example, if it possible that job number 2 starts before job number 1 has ended? If so, hte my idea probably wouldn't work. -- Shlomo Solomon http://the-solomons.net Sent by KMail (KDE 3.5.7) on LINUX Mandriva 2008.0 ================================================================= To unsubscribe, send mail to [EMAIL PROTECTED] with the word "unsubscribe" in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
