All 4 of our machines running openACS went down simultaneously back on March
31.  And, they all went down again as soon as we rebooted them.  We finally
traced it down to the procedures in ad-robot_filter.tcl.  After we commented
out the lines at the end of that file, the lines following
"#Check to see if the robots table needs to be updated",
the machines stayed up and have been up ever since.

It was really wierd.  We tried setting the system date to different dates:
other years, the end of other quarters ... whatever permutation we could
think of.  March 31, 2001 was the only date that caused a crash, and did so
reliably.

 Rocael Hernandez wrote on Wed, 25 Jul 2001:
> I have tracked the time when my server is unavailable and compared with the
> aolserver logs, I found some coincidence in this:
> /var/log/messages here is the time between the last access and the restart
> signal
>
> Jul 23 21:27:48 localhost su(pam_unix)[1103]: session opened for user root
> by nsa\
> dmin(uid=502)
> Jul 23 21:46:21 localhost syslogd 1.4-0: restart.
>
>
> then in my log file of one of the services is this:
> [23/Jul/2001:21:18:26 -0400] "GET /uptime.txt HTTP/1.0" 200 7 \
> "" ""
> [23/Jul/2001:21:27:28][987.1026][-sched-] Notice: Running scheduled proc
> wd_mail_\
> errors...
> [23/Jul/2001:21:27:28][987.1026][-sched-] Notice: Looking for errors...
> [23/Jul/2001:21:55:10][1039.1024][-main-] Notice: nsmain:
> AOLserver/3.2+ad12 star\
> ting
> [23/Jul/2001:21:55:10][1039.1024][-main-] Notice: nsmain: security info:
> uid=502,\
> euid=502, gid=501, egid=501
>
>
> check our the times:
> Jul 23 21:27:48 last linux box log
> [23/Jul/2001:21:27:28] last server log
>
> This specific server is running openacs, and using nsd76 (that's because it
> has no problems with spanish caracters).
>
> I have seen other coincidence with this service and the last access time,
> any suggestions??
>
> Thank you,
> Rocael.
>
>
>
> Patrick Giagnocavo <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I can think of two cases that you might want to check:
>
> 1.  You don't have enough swap space, and when scheduled procs run during
> the middle of the night you run out of swap and the machine dies.
>
> 2. Could it be a sometimes-bad RAM module? To check this, get the Mersenne
> prime tester program from www.mersenne.org. Ignore the setup part, just get
> it to run the tests. It will heavily stress your RAM and the CPU's cache
> and memory interface. Note: it will use all your available CPU and a few
> megs of RAM; but it will only use CPU when all other processes are not.
>
> The only other thing I could suggest would be to run a cron script that
> grabs a web page every 10 minutes or so and emails the result to you.  Get
> the nstelemetry.adp file from aolserver.com and then set up your cron
> script to grab the page via lynx and email it to you. Then you will have a
> chance at catching the error shortly before it occurs, or at least getting
> some useful diagnostics.
>
> Cordially
>
> Patrick Giagnocavo
> [EMAIL PROTECTED]
> OpenACS Hosting:  www.zill.net

Reply via email to