Dear Linuxers,
we  are using SL CERN 5.4 for the ATLAS Control Room at CERN, and we are 
experiencing a problem with the Xorg server that is proving very hard to track 
down. I'm hoping someone in the SL community will have the patience to read all 
this and offer some suggestion...

 The desktop systems show a very slow (but not uniformly slow) memory leak in 
Xorg, growing up to 3GB, sometimes even 5GB, and finally bringing the systems 
to some kind of crash - usually just the GUI freezes, but sometimes OOM Killer 
gets badly in the way and the whole system is left in a bad state and needs to 
be rebooted. Sometimes we can see the problem before it becomes critical and 
request the users to restart X11, but this is not a welcome procedure.
 Simply closing the applications (either gently or by killing) does not let 
Xorg release the occupied memory. Even logging out (without restarting X11) 
does not free the memory allocated by Xorg. 

 It takes anything between one week and more than 4 weeks for this to happen 
(depending on how heavily the specific desk is used and which applications are 
ran on it), so it's very hard to correlate to a specific application or usage 
pattern, and we are not finding a way to reproduce it in a shorter time, to de 
able to debug it. 

 xrestop only shows <20 entries with 10~20 MB pixmaps allocated, nowhere near 
to justifying the 3GB or more. The memory map from /proc/<Xorg pid>/smaps does 
show a heap of >650MB (not very different from a freshly started Xorg) and many 
allocated memory blocks, some as large as 800MB, but these are unlabeled and I 
don't see a way to correlate them with something useful. As you can imagine 
running Xorg under Valgrind on a production system is basically out of 
question, and doing it on a test system without knowing what to try and test 
seems quite pointless.

 The systems are dual quad-core Xeon systems, with 8 to 12GB RAM, 4GB swap, 
dual nVidia cards (NVS285 or FX370), quad screens, from 4 to 12 virtual 
desktops, and they now run KDE 3.5.10 (from kde-redhat.sourceforge.net) on SLC 
5.4, x86_64, kernel 2.6.18-164.11.1, with nVidia drivers packaged by CERN IT 
(kernel-module-nvidia-2.6.18-164.11.1.el5-185.18.36-1.slc5.x86_64) . We had 
been seeing the same behavior with SLC 5.3 and standard KDE 3.5.6. The most 
used applications are Konqueror, PVSS (detector control system), plus a variety 
of CERN or ATLAS specific applications, mostly Java or Python. The issue 
appears also on desks where no 3D/OpenGL app is used.

 While this must be, at the bottom, a bug in Xorg, we could already be happy 
with identifying one or more specific applications which trigger this, and try 
to add workarounds / mitigations in the applications, if the Xorg bug can't be 
pinned down or is untreatable.

 Any help or suggestion of tools or procedures that may help us debug this 
issue would be most welcome.

 Thanks, and cheers,
   Sergio

-- 
 Sergio Ballestrero  - http://physics.uj.ac.za/psiwiki/Ballestrero
 University of Johannesburg, Physics Department
 ATLAS TDAQ sysadmin group 

Reply via email to