The short story: SDS server seems to be working better now. We haven't totally fixed the problem but it seems to be much better.
To summarize the problem again: When Rails is run using lighttpd several processes are created to handle incoming requests. The processes grow in size as the ruby code run by rails creates objects. Ruby has a garbage collector like Java so the memory in the processes should be reused. For a while now the processes running the SDS have been using too much memory. Some as large as 1GB each. They start small and then get larger and larger. The SDS uses several processes so when each is a GB that really slows down the server. Recent events: Aaron made some changes to the latest codebase which we hoped would improve the situation. When things were getting particularly bad on Tuesday, Stephen decided we might as well upgrade the production SDS to this latest codebase, because it was having so many problems there wouldn't be anything lost. So Tuesday night Stephen did the upgrade. Then on Wednesday morning we got calls from teachers saying things weren't working. And after looking at the SDS server it seemed that things were worse not better. Our theory has been the xml processing is causing the large process size. So when Stephen did the upgrade on Tuesday night he used a pure ruby xml processing library called rexml. Then on Wednesday afternoon, since things hadn't gotten better he decided to try using a new version of the native xml processing library libxml. And voila things got better. We could not track this precisely, but it appears the processes get large and then small again pretty quickly. My theory is that this new libxml is freeing up its memory after it is used and this happens outside of the ruby garbage collector. It seems the ruby garbage collector (like java) never gives up memory once it has been allocated. So rexml was creating lots of objects, which made the process size go up and then it never when down again. Next steps: Better logging of process size. Currently the change in process size is recorded on certain suspicious requests. This is done by checking the size before the request is handled and then after it is handled. Because the size now goes up and down quickly this approach isn't picking up the ballooning processes. This could be improved by having the process size be recorded every 50ms during the request. And then the sequence of sizes could be reported at the end of the request. Verify certain bundle posts cause this problem. This is our theory. Aaron has a bundle which used to cause the size to go very high, so we need to get this bundle and see what happens now. If it is really the bundle posting code then: - the bundle posting code would be split into 2 parts: bundle receiving and bundle processing - bundle receiving would be done by a non-rails application which could do it more efficiently. - bundle processing would be done using a queue so it could be throttled down. This way we can control the number of processes doing the bundle processing. - try once again to reduce the memory usage of the bundle processing code Scott --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "SAIL-Dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/SAIL-Dev?hl=en -~----------~----~----~----~------~----~------~--~---
