The short story:

SDS server seems to be working better now.  We haven't totally fixed the 
problem but it seems to be much better.


To summarize the problem again:

When Rails is run using lighttpd several processes are created to handle 
incoming requests.  The processes grow in size as the ruby code run by 
rails creates objects.  Ruby has a garbage collector like Java so the 
memory in the processes should be reused. 

For a while now the processes running the SDS have been using too much 
memory.  Some as large as 1GB each.   They start small and then get 
larger and larger.  The SDS uses several processes so when each is a GB 
that really slows down the server.

Recent events:

Aaron made some changes to the latest codebase which we hoped would 
improve the situation.  When things were getting particularly bad on 
Tuesday, Stephen decided we might as well upgrade the production SDS to 
this latest codebase, because it was having so many problems there 
wouldn't be anything lost.   So Tuesday night Stephen did the upgrade.  
Then on Wednesday morning we got calls from teachers saying things 
weren't working.  And after looking at the SDS server it seemed that 
things were worse not better.  

Our theory has been the xml processing is causing the large process 
size.   So when Stephen did the upgrade on Tuesday night he used a pure 
ruby xml processing library called rexml.   Then on Wednesday afternoon, 
since things hadn't gotten better he decided to try using a new version 
of the native xml processing library libxml.   And voila things got 
better.  We could not track this precisely, but it appears the processes 
get large and then small again pretty quickly.  My theory is that this 
new libxml is freeing up its memory after it is used and this happens 
outside of the ruby garbage collector.  It seems the ruby garbage 
collector (like java) never gives up memory once it has been allocated.  
So rexml was creating lots of objects, which made the process size go up 
and then it never when down again.

Next steps:

Better logging of process size.  Currently the change in process size is 
recorded on certain suspicious requests.   This is done by checking the 
size before the request is handled and then after it is handled.  
Because the size now goes up and down quickly this approach isn't 
picking up the ballooning processes.   This could be improved by having 
the process size be recorded every 50ms during the request.  And then 
the sequence of sizes could be reported at the end of the request.

Verify certain bundle posts cause this problem.  This is our theory.  
Aaron has a bundle which used to cause the size to go very high, so we 
need to get this bundle and see what happens now.

If it is really the bundle posting code then:
- the bundle posting code would be split into 2 parts: bundle receiving 
and bundle processing
- bundle receiving would be done by a non-rails application which could 
do it more efficiently.
- bundle processing would be done using a queue so it could be throttled 
down.  This way we can control the number of processes doing the bundle 
processing.
- try once again to reduce the memory usage of the bundle processing code

Scott

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"SAIL-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/SAIL-Dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to