Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-31 Thread Seena Kasmai
Title: RE: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem





You are right Andrew, we are using ACS and I believe the version is 2.2.3. Now the info tclversion says 8.3, but the info patchlevel says 8.3.2, also the directory is aolserver/lib/tcl8.3/, so not sure what is running right now. 

I've been digging into the application but since everything is happy and no Error is happening I have no idea what can cause this. We have a lot of tracing and logging in the critical sections and so forth but as I said nothing shows up when the webserevr starts eating all the memory. 

I haven't exactly found a pattern where I can create the problem, but basically if we start clicking on the pages for 10 minutes (load level ~2.5), then the problem shows up. But that doesn't tell anything because there might be a specific section that needs to be hit in order to create the memory problem. Now last night I tried to use some of our Admin pages which heavily touched data base and involves TCL usage a lot, the free memory dropped 30MB (which might be normal), and now after 12 hours or so, still is in the same usage, so I think it has something to do with the load and amount of traffic.

Would using -z (zippy memory allocator switch) help to do more tracing/monitoring ?


We use ns_share massively, could that be the cause ?


Thanks,
Seena


P.S as far as memory leak subject, so should I ignore the discussion I've found which I though it's similar to my problem ? Could you access the messages ? (the links I provided was broken I think, sorry about that)

Here is what Kris had said for the solution which seemed to work, and I ahev attached couple of emails that present the same issue.

---
On the subject of memory leaks, there is a known symptom of nsd8x
where it can grow without bound in certain circumstances. We do not
yet know the cause, but it appears to be endemic to Tcl 8.3.0. If you
use nsd76 the problem completely disappears.


Kris


-


The next release of AOLserver (which we'll be releasing very soon) has Tcl
8.3.1 which appears to have cleared up the memory leak. It does/will have a
range-checking memory allocator, too. If you have CVS access, you can use it
right now (as of 8/8/2000, in fact).


As far as an official comment, AOLserver is an open-source product.
Anyone with the means and the skill can help debug the server. I fail to
understand how a suggestion to move to nsd76 to solve an evident memory leak
in Tcl 8.3.0 equates to moving to IIS, as one writer on this mailing list
so eloquently put it.


Now, as for nsd76 growing without bound: that is news to AOL Digital City.
They run nsd76 in production on some of the busiest systems in the world and
we have yet to see a memory leak in the core AOLserver 3.0 (it's always been
in various C modules we load for our applications).


It's also important to understand the difference between RSS and SZ. The
RSS, or resident set size, is the amount of core memory being used by a
process. The SZ is the total amount of core memory plus virtual memory being
used. As any Unix administrator or developer can tell you, it is perfectly
normal and acceptable for a process to have a bigger SZ than RSS due to the
simple fact that not all data in a process' address space is used all the
time. This is very dependent on the flavor of Unix -- different systems have
different algorithms that decide when to write pages to swap. If you'd like
to read a fairly simple explanation of this, visit
http://www.freebsd.org/FAQ/misc.html, the book Operating System Concepts,
3e (Silberschatz/Peterson/Galvin), Unix Internals (Valhalia), and of
course the Tanenbaum book.


Finally, about Purify. We have access to the very latest versions of
Purify. Unfortunately, Purify dumps core when encountering such innocuous
messages as UMR. We are working on getting this issue resolved and using
Purify on Irix in the meantime, and haven't found much to suggest a problem
exists in nsd76 (though we deferred testing nsd8x until Tcl 8.3.1 is put
in).


I hope this message finds understanding readers.


Regards,


Kris


---




-Original Message-
From: Andrew Piskorski [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 31, 2003 2:19 AM
To: [EMAIL PROTECTED]
Subject: Re: [AOLSERVER] ns_mutex is likely causing our AOL web server
to hung - Memory problem



On Thu, Jan 30, 2003 at 09:41:27PM -0500, Seena Kasmai wrote:
 With 2.3.3 we use ACS and we use Oracle. Everything in the application seems


 We sort of have our own version of ACS (we have added/modified it), given
 it's functioning with 3.3.1, is it possible to upgrade to 3.5.1 w/ TCL 8.4 ?


Seena, since your email address is @away.com, I figured you must be
using some flavor of ACS. But, exactly which version of the ACS was
your software based on originally? 3.4, 3.2, maybe even 2.x

Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Seena Kasmai



I 
found some old messages talking about "Memory Leak" inAOLserver 3 (I think I'm 
running to the same problem as far as memory and slowness issues we have right 
now).

According to answers, the source of the problem is TCL 8.0 and the 
solution is to upgrade TCL library to 8.3.1. http://listserv.aol.com/cgi-bin/wa?A2=ind0008L=aolserverD=0I=-3X=67CBE07276211DD16C[EMAIL PROTECTED]P=1878(and 
solution from Kriston : 
http://listserv.aol.com/cgi-bin/wa?A2=ind0008L=aolserverD=0I=-3X=0C65431B8EB0007184[EMAIL PROTECTED]P=3300)

Would 
some please be kind enough and assist me how to only upgrademy TCL to 
8.3.1 from my AOLserver/3.3.1+ad13 w/TCL 8.3 ??

Thank 
you,
Seena

  -Original Message-From: Seena Kasmai 
  [mailto:[EMAIL PROTECTED]]Sent: Thursday, January 30, 2003 11:09 
  AMTo: [EMAIL PROTECTED]Subject: Re: [AOLSERVER] 
  ns_mutex is likely causing our AOL web server to hung - Memeory 
  problem
  
  Hello 
  Again,
  
  Finally I put 
  exception handling (catch) afterthe ns_mutex lock, all across the 
  application to make sure we are unlocking the mutex. But again after running 
  some traffic to the web server, the requests to the page that actually calls 
  the ns_mutex, started to getting stuck and eventually server locked up. 
  
  
  Then I suspect that 
  maybe we are locking that mutex simultaneously (between the 2 procs) and 
  somehow it creates a conflict. So after removing the lock for the proc that 
  increment the array, I could never lock the server!! So it looks like we have 
  some sort of conflict when locking the same mutex, although I assume the locks 
  should go the a queue sort of thing and the unlocking should act in the order. 
  I wrote a test page to only lock a mutex (and not unlock). I run this page, 10 
  times, all of the requests get stuck in the queue, then I run a unlock mutex, 
  and every time I run, the first request in the queue gets releases, so the 
  functionality seems to be working but still don't know why in that case server 
  gets into trouble.
  
  Another issue that 
  might be related (or may be not), is that I have noticed, while the AOLServer 
  is running, the memory keeps getting shrink and eventually system runs out of 
  memory and web serve dies. Initially when AOLServer comes up, system has about 
  840MB memory. So far in about every 24-hour period, the memory becomes under 
  16MB and eventually server crashes (and memory gets back to 875MB). Here is a 
  snap shot of TOP when server starts up:
  
  CPU states: 
  100% idle, 0.0% user, 0.0% kernel, 0.0% iowait, 0.0% swap
  Memory: 1024M real, 829M free, 58M swap in use, 4809M swap 
  free
  
   PID USERNAME 
  THR PRI NICE SIZE RES STATE TIME CPU 
  COMMAND
  27834 nsadmin 8 59 0 52M 47M sleep 0:45 0.02% 
  nsd8x
  
  The only thing that 
  can use memory a lot while traffic is running on the site, is that our 
  application uses Memoize a lot, which caches the result of database queries in 
  a list of list format. but I saw the server was eating 1MB memory per second 
  (according to "top") even when nothing was going on the server !
  
  Again please not that 
  the same code/application and setup is running fine with AOL version 2.3.3 / 
  TCL 7, so I can't think of any nasty bug or a infinite loop that can be exist. 
  I've been closely looking at the error logs and there is no Error. Any comment 
  oridea that anyone may have to point out why the new version is acting 
  differently in this situation, is greatlyappreciated.
  
  BTW, 
  here is the configuration file : (should I have attached it !? 
  )
  
  ## Translated on Thu Jan 16 02:58:05 EST 2003# from .ini format 
  with translate-ini## config file for a Netra farm 
  box
  
  ns_section ns/db/poolsns_param main mainns_param subquery 
  subqueryns_param secondary secondaryparam secondary_subquery 
  secondary_subqueryns_param log logns_param clickstream 
  clickstreamns_param search search
  
  ns_section ns/db/drivers#ora8=ora8.2.0.1-816-.sons_param ora8 
  /home/nsadmin/bin/ora8.so
  
  ns_section ns/db/pool/mainns_param Driver ora8ns_param 
  Connections 6ns_param DataSource ora8_tcpns_param Userns_param 
  Passwordns_param Verbose Onns_param ExtendedTableInfo Onns_param 
  LogSQLErrors On
  
  ns_section ns/db/pool/subqueryns_param Driver ora8ns_param 
  Connections 6ns_param DataSource ora8_tcpns_param Userns_param 
  Passwordns_param Verbose Onns_param ExtendedTableInfo Onns_param 
  LogSQLErrors On
  
  ns_section ns/db/pool/secondaryns_param Driver ora8ns_param 
  Connections 6ns_param DataSource 
  testds#DataSource=ora8_tcpns_param User ns_param Password 
  ns_param Verbose onns_param ExtendedTableInfo Onns_param 
  LogSQLErrors On
  
  ns_section ns/db/pool/secondary_subqueryns_param Driver 
  ora8ns_param Connections 6ns_param DataSource 
  testds#DataSource=ora8_tcpns_param Userns_param 
  Passwordns_param Verbose onns_param ExtendedTableInfo Onns_param 
  LogSQLErrors On
  
  ns_section 

Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Peter M. Jansson
On Thursday, January 30, 2003, at 07:58 PM, Seena Kasmai wrote:


Would some please be kind enough and assist me how to only upgrade my TCL
to 8.3.1 from my AOLserver/3.3.1+ad13 w/TCL 8.3 ??


For versions of AOLserver prior to 3.5, the Tcl implementation was tightly
tied to the AOLserver, and the only way to change the version of Tcl was
to use a different AOLserver version.  Given that you're using the
3.3.1+ad13 version of AOLserver, you're probably using OpenACS (or ACS
itself), and switching to AOLserver 3.5.2 is not possible.


Another issue that might be related (or may be not), is that I have
noticed, while the AOLServer is running, the memory keeps getting shrink
and eventually system runs out of memory and web serve dies. Initially
when AOLServer comes up, system has about 840MB memory. So far in about
every 24-hour period, the memory becomes under 16MB and eventually server
crashes (and memory gets back to 875MB). Here is a snap shot of TOP when
server starts up:


Seena, this behavior is not caused by a memory leak.  There is no leak
that serious in AOLserver.  Plenty of folks have had 3.3.1 systems that
take fair amounts of traffic and don't consume 800 MB of memory in 24
hours.  There is something in your application that is grabbing memory and
making it unavailable to the rest of the system.  Even though Tcl uses
garbage collection, Tcl can't GC memory that's being referenced (such as
in a Memoize cache).

Can you put some logging around your memoization to try to see what the
size of the memoize cache is?  Perhaps you could register a pre-auth trace
that captures the size of the memoize cache, and then register a trace
that computes the size again (after the request has run, because it's a
trace) and logs the difference?  If you could get a handle on whether one
request is particularly demanding on memory.

Even if you were able to update your Tcl, I think that, given the
magnitude of your memory issue, you would not see a meaningful improvement.

Pete.



Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Seena Kasmai
Title: RE: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem





Well, the strange thing is we never see such a behavior on 2.3.3 w/TCL 7.0, and we run 4 web server with the same code/application. That's why I can't think of any code related issue. 

I did check the size of the cache array we use for Memoizing stuff, and it's not that big at the time server is eating the memory. We were able to re-create the problem in 20 Minutes just by clicking on various pages (including TCL pages) and after we stop clicking the memory was kept getting eaten like 2-3MB per seconds and then it stops for a while and the starts again (while no activity), until it gets down to 16MB, and then it uses the max swap file allowed until it dies. 

Anyhow, would you recommend to upgrade to 3.4.2 or 3.5.1 w/ TCL 8.3.1 ?


Thanks Pete for your follow up,
Seena


-Original Message-
From: Peter M. Jansson [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 30, 2003 9:18 PM
To: [EMAIL PROTECTED]
Subject: Re: [AOLSERVER] ns_mutex is likely causing our AOL web server
to hung - Memory problem



On Thursday, January 30, 2003, at 07:58 PM, Seena Kasmai wrote:


 Would some please be kind enough and assist me how to only upgrade my TCL
 to 8.3.1 from my AOLserver/3.3.1+ad13 w/TCL 8.3 ??


For versions of AOLserver prior to 3.5, the Tcl implementation was tightly
tied to the AOLserver, and the only way to change the version of Tcl was
to use a different AOLserver version. Given that you're using the
3.3.1+ad13 version of AOLserver, you're probably using OpenACS (or ACS
itself), and switching to AOLserver 3.5.2 is not possible.


 Another issue that might be related (or may be not), is that I have
 noticed, while the AOLServer is running, the memory keeps getting shrink
 and eventually system runs out of memory and web serve dies. Initially
 when AOLServer comes up, system has about 840MB memory. So far in about
 every 24-hour period, the memory becomes under 16MB and eventually server
 crashes (and memory gets back to 875MB). Here is a snap shot of TOP when
 server starts up:


Seena, this behavior is not caused by a memory leak. There is no leak
that serious in AOLserver. Plenty of folks have had 3.3.1 systems that
take fair amounts of traffic and don't consume 800 MB of memory in 24
hours. There is something in your application that is grabbing memory and
making it unavailable to the rest of the system. Even though Tcl uses
garbage collection, Tcl can't GC memory that's being referenced (such as
in a Memoize cache).


Can you put some logging around your memoization to try to see what the
size of the memoize cache is? Perhaps you could register a pre-auth trace
that captures the size of the memoize cache, and then register a trace
that computes the size again (after the request has run, because it's a
trace) and logs the difference? If you could get a handle on whether one
request is particularly demanding on memory.


Even if you were able to update your Tcl, I think that, given the
magnitude of your memory issue, you would not see a meaningful improvement.


Pete.





Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Nathan Folkman

In a message dated 1/30/03 9:27:18 PM, [EMAIL PROTECTED] writes:


Anyhow, would you recommend to upgrade to 3.4.2 or 3.5.1 w/ TCL 8.3.1 ?


3.5.x is Tcl 8.4.x only. I'd recommend upgrading to 3.5 if you're going to try and upgrade. It will put you in a good position to move to 4.0 once it gets released.

- Nathan


Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Peter M. Jansson
On Thursday, January 30, 2003, at 09:19 PM, Seena Kasmai wrote:


Well, the strange thing is we never see such a behavior on 2.3.3 w/TCL 7.
0, and we run 4 web server with the same code/application. That's why I
can't think of any code related issue.


It's been a long time since I've used 2.3.3, but I can't help but think
that there are some functions in 2.3.3 that are not compatible with 3.x,
so I don't think it's possible to pick up a 2.3.3 app (which was Tcl 7.6,
not Tcl 7.0) and run it directly on 3.x without some modifications.  (Well,
 no significant application, anyway.  OK, I'm sure there's a
counterexample out there somewhere.)


I did check the size of the cache array we use for Memoizing stuff, and
it's not that big at the time server is eating the memory. We were able
to re-create the problem in 20 Minutes just by clicking on various pages
(including TCL pages) and after we stop clicking the memory was kept
getting eaten like 2-3MB per seconds and then it stops for a while and
the starts again (while no activity), until it gets down to 16MB, and
then it uses the max swap file allowed until it dies.


That memory is going somewhere.  Perhaps not into the memoize cache; I
only pointed out that one because you identified it in your message.  I
would start generously sprinkling ns_log statements through one of the
execution paths taken by one of the pages you've identified, including
filters and traces.  One possibility is that some function call you made
under 2.3.3 is now failing, and the application is retrying the operation,
 which could cause a lot of activity, since the retries will not fail.

Is there database activity going on?  Perhaps if you turn on verbose SQL
logging, you'll see a pattern of queries that could point you to the
problem.


Anyhow, would you recommend to upgrade to 3.4.2 or 3.5.1 w/ TCL 8.3.1 ?


If you are using ACS and Oracle, or OpenACS, you must use a version of
AOLserver with arsDigita patches.  If you can upgrade, meaning that you
don't use any ACS stuff nor Oracle, then you want to use 3.5.1, and not 3.
4.2.  The 3.5.1 release will allow you to use Tcl 8.4, which is faster,
among other things, but the main thing is that with 3.5.1, if there's a
Tcl update, you can update Tcl without updating AOLserver.  So, if you do
not use ACS or OpenACS, nor Oracle, I suggest upgrading to AOLserver 3.5.1.

Again, given the pathological behavior you're reporting, I strongly doubt
the problem is something as subtle as a bug in Tcl.  I think such a bug
would not manifest itself so dramatically, unless it segfaulted
immediately.

Pete.



Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Seena Kasmai
Title: RE: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem





With 2.3.3 we use ACS and we use Oracle. Everything in the application seems to be working fine and we heavily tested all parts of the site, we don't see any Error or failure when the server starts acting strange. We fixed a few syntax changes which were not compatible with the new version, but if anything major needed to be changed, we should see some errors at least.

We sort of have our own version of ACS (we have added/modified it), given it's functioning with 3.3.1, is it possible to upgrade to 3.5.1 w/ TCL 8.4 ?


-Original Message-
From: Peter M. Jansson [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 30, 2003 9:38 PM
To: [EMAIL PROTECTED]
Subject: Re: [AOLSERVER] ns_mutex is likely causing our AOL web server
to hung - Memory problem



On Thursday, January 30, 2003, at 09:19 PM, Seena Kasmai wrote:


 Well, the strange thing is we never see such a behavior on 2.3.3 w/TCL 7.
 0, and we run 4 web server with the same code/application. That's why I
 can't think of any code related issue.


It's been a long time since I've used 2.3.3, but I can't help but think
that there are some functions in 2.3.3 that are not compatible with 3.x,
so I don't think it's possible to pick up a 2.3.3 app (which was Tcl 7.6,
not Tcl 7.0) and run it directly on 3.x without some modifications. (Well,
 no significant application, anyway. OK, I'm sure there's a
counterexample out there somewhere.)


 I did check the size of the cache array we use for Memoizing stuff, and
 it's not that big at the time server is eating the memory. We were able
 to re-create the problem in 20 Minutes just by clicking on various pages
 (including TCL pages) and after we stop clicking the memory was kept
 getting eaten like 2-3MB per seconds and then it stops for a while and
 the starts again (while no activity), until it gets down to 16MB, and
 then it uses the max swap file allowed until it dies.


That memory is going somewhere. Perhaps not into the memoize cache; I
only pointed out that one because you identified it in your message. I
would start generously sprinkling ns_log statements through one of the
execution paths taken by one of the pages you've identified, including
filters and traces. One possibility is that some function call you made
under 2.3.3 is now failing, and the application is retrying the operation,
 which could cause a lot of activity, since the retries will not fail.


Is there database activity going on? Perhaps if you turn on verbose SQL
logging, you'll see a pattern of queries that could point you to the
problem.


 Anyhow, would you recommend to upgrade to 3.4.2 or 3.5.1 w/ TCL 8.3.1 ?


If you are using ACS and Oracle, or OpenACS, you must use a version of
AOLserver with arsDigita patches. If you can upgrade, meaning that you
don't use any ACS stuff nor Oracle, then you want to use 3.5.1, and not 3.
4.2. The 3.5.1 release will allow you to use Tcl 8.4, which is faster,
among other things, but the main thing is that with 3.5.1, if there's a
Tcl update, you can update Tcl without updating AOLserver. So, if you do
not use ACS or OpenACS, nor Oracle, I suggest upgrading to AOLserver 3.5.1.


Again, given the pathological behavior you're reporting, I strongly doubt
the problem is something as subtle as a bug in Tcl. I think such a bug
would not manifest itself so dramatically, unless it segfaulted
immediately.


Pete.





Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Andrew Piskorski
On Thu, Jan 30, 2003 at 07:58:55PM -0500, Seena Kasmai wrote:

 Would some please be kind enough and assist me how to only upgrade my TCL to
 8.3.1 from my AOLserver/3.3.1+ad13 w/TCL 8.3 ??

3.3+ad13 ships with Tcl 8.3.2.  You can verify this.  If you compiled
from source, look for the directory aolserver/tcl8.3.2/.  More
conclusively, just display the results of running info tclversion
and info patchlevel in a Tcl page

--
Andrew Piskorski [EMAIL PROTECTED]
http://www.piskorski.com



Re: [AOLSERVER] ns_mutex is likely causing our AOL web server to hung - Memory problem

2003-01-30 Thread Andrew Piskorski
On Thu, Jan 30, 2003 at 09:41:27PM -0500, Seena Kasmai wrote:
 With 2.3.3 we use ACS and we use Oracle. Everything in the application seems

 We sort of have our own version of ACS (we have added/modified it), given
 it's functioning with 3.3.1, is it possible to upgrade to 3.5.1 w/ TCL 8.4 ?

Seena, since your email address is @away.com, I figured you must be
using some flavor of ACS.  But, exactly which version of the ACS was
your software based on originally?  3.4, 3.2, maybe even 2.x?  And
have you ever upgraded to or backported from newer ACS versions?

I don't recall when the internationalization stuff went into ACS.  The
safe bet is to to stick to the same versions of AOLserver that are ok
for OpenACS.  However, the fact that you were using AOLserver 2.3.3
until recently probably means that your ACS version is compatible with
ANY AOLserver 3.x version, as long as you have your Oracle driver and
any other loadable modules you need compiled for it.

The other people here are right though, there's no way what massive
memory usage problems you're seeing are do to an AOLserver or Tcl bug.
It's been a long time now, but I don't think any of the leak problems
fixed over time in 3.x were EVER that big, not even with 3.0 before
Rob Mayoff made any of his fixes at all.  Instead, sounds like
something in your application is tripping over some AOLserver 2.3
vs. 3.3 difference.

--
Andrew Piskorski [EMAIL PROTECTED]
http://www.piskorski.com