Hi Sergio,

On Feb 20, 2010, at 13:57 , Sergio Ballestrero wrote:

> The desktop systems show a very slow (but not uniformly slow) memory leak in 
> Xorg, growing up to 3GB, sometimes even 5GB, and finally bringing the systems 
> to some kind of crash - usually just the GUI freezes, but sometimes OOM 
> Killer gets badly in the way and the whole system is left in a bad state and 
> needs to be rebooted.

you should be able to avoid the OOM Killer (and the other weird effects typical 
for true OOM situations) if you "echo 2 >> /proc/sys/vm/overcommit_memory". See 
/usr/share/doc/kernel-doc-2.6.18/Documentation/vm/overcommit-accounting. This 
requires sufficient swap space though, and "sufficient" means "at least as much 
as the sum of VSZ for all processes on the system". Which can be a lot. But you 
should get a meaningful error message instead of a machine that needs to be 
rebooted the next time you actually run out of (virtual) memory.

> Sometimes we can see the problem before it becomes critical and request the 
> users to restart X11, but this is not a welcome procedure.

If it's a true leak (memory being allocated and later simply forgotten and 
never accessed again), simply adding more swap space may allow normal operation 
for periods long enough to not make this an actual problem. You could even do 
this on the fly without interruption if you have some LVM capacity left (or 
swapfiles still work). And then routinely reboot the PCs during machine 
interventions or on some other occasion when shift crews are bored anyway ;-)

> Simply closing the applications (either gently or by killing) does not let 
> Xorg release the occupied memory. Even logging out (without restarting X11) 
> does not free the memory allocated by Xorg. 
> 
> It takes anything between one week and more than 4 weeks for this to happen 
> (depending on how heavily the specific desk is used and which applications 
> are ran on it), so it's very hard to correlate to a specific application or 
> usage pattern, and we are not finding a way to reproduce it in a shorter 
> time, to de able to debug it. 
> 
> xrestop only shows <20 entries with 10~20 MB pixmaps allocated, nowhere near 
> to justifying the 3GB or more. The memory map from /proc/<Xorg pid>/smaps 
> does show a heap of >650MB (not very different from a freshly started Xorg) 
> and many allocated memory blocks, some as large as 800MB, but these are 
> unlabeled and I don't see a way to correlate them with something useful. As 
> you can imagine running Xorg under Valgrind on a production system is 
> basically out of question, and doing it on a test system without knowing what 
> to try and test seems quite pointless.

I have never used MEMWATCH ( http://www.linkdata.se/sourcecode/memwatch ) 
myself, but it's high on my list of things to try should I ever get into a 
situation like the one you're in. This will require tinkering with the xorg 
source and rebuilding the packages though.

> The systems are dual quad-core Xeon systems, with 8 to 12GB RAM, 4GB swap, 
> dual nVidia cards (NVS285 or FX370), quad screens, from 4 to 12 virtual 
> desktops, and they now run KDE 3.5.10 (from kde-redhat.sourceforge.net) on 
> SLC 5.4, x86_64, kernel 2.6.18-164.11.1, with nVidia drivers packaged by CERN 
> IT (kernel-module-nvidia-2.6.18-164.11.1.el5-185.18.36-1.slc5.x86_64) . We 
> had been seeing the same behavior with SLC 5.3 and standard KDE 3.5.6. The 
> most used applications are Konqueror, PVSS (detector control system), plus a 
> variety of CERN or ATLAS specific applications, mostly Java or Python. The 
> issue appears also on desks where no 3D/OpenGL app is used.

You could try to run some stations without KDE, or with a much older nvidia 
driver, just to narrow down the problem space. Sadly, using the vesa or nv 
drivers is not an option with quad screens. Another option could be to separate 
the applications and the X servers they render to from the actual display 
devices by running the apps in a vncsession on some server and the VNC client 
on the desks (this may be awkward to do with your quad screen setup though).

> While this must be, at the bottom, a bug in Xorg, we could already be happy 
> with identifying one or more specific applications which trigger this, and 
> try to add workarounds / mitigations in the applications, if the Xorg bug 
> can't be pinned down or is untreatable.
> 
> Any help or suggestion of tools or procedures that may help us debug this 
> issue would be most welcome.


Good luck. Let us know how it goes.

Cheers,
        Stephan

-- 
Stephan Wiesand
DESY -DV-
Platanenenallee 6
15738 Zeuthen, Germany

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to