Hi Marco, Do you have any large (and growing) files in /tmp, /etc/svc/volatile, or /var/run? How long was the system up when you did the reboot -d? 15k bytes may be a leak, but it is not so excessive that it should be causing problems (unless you are leaking 15k every few minutes...).
max Marco Marongiu wrote: > Hello there > > This mail is a followup from a same-named thread in the SAGE Member's > mailing list. > > THE SHORT VERSION: > > I've probably found a kernel memory leak in OpenSolaris 2009.06. I ran > the findleaks command in mdb on the crash dumps, which yeld: > > >>> bronto at brabham:/var/crash/brabham# echo '::findleaks' | mdb unix.0 >>> vmcore.0 | tee findleaks.out >>> CACHE LEAKED BUFCTL CALLER >>> ffffff0142828860 1 ffffff014c022a38 AcpiOsAllocate+0x1c >>> ffffff01428265a0 2 ffffff014c1f5148 AcpiOsAllocate+0x1c >>> ffffff0142828860 1 ffffff014c022be8 AcpiOsAllocate+0x1c >>> ffffff0149b85020 1 ffffff015b00fe98 rootnex_coredma_allochdl+0x84 >>> ffffff0149b85020 1 ffffff015b00e038 rootnex_coredma_allochdl+0x84 >>> ffffff0149b85020 1 ffffff015b00e110 rootnex_coredma_allochdl+0x84 >>> ffffff0149b85020 1 ffffff015b0055e8 rootnex_coredma_allochdl+0x84 >>> ffffff0149b85020 1 ffffff015b00e2c0 rootnex_coredma_allochdl+0x84 >>> ffffff0149b85020 1 ffffff015b00fdc0 rootnex_coredma_allochdl+0x84 >>> ------------------------------------------------------------------------ >>> Total 10 buffers, 15424 bytes >>> > > > > > Any hint? > > > THE LONG STORY > > > I am using OpenSolaris 2009.06; my workstation is an HP-Compaq dc7800, > with 2GB of RAM and a SATA disk of about 200GB; the video card is an > ATI (I know, I know...); the swap space is 2GB. > > When the problem first showed up months ago, that was my main > workstation. Now I am using another one, mounting my home filesystem > from the OpenSolaris machine. When used as a workstation, that also > ran an > Ubuntu Linux virtual machine on VirtualBox 3.0.4 for all those > applications that I couldn't find on OpenSolaris (e.g.: Skype). > > When the system is freshly booted, it works like charm: it's is quick, > responsive, even enjoyable to use. 24 hours later, it is slower > already, and swaps a lot. > 24 more hours, and it's barely usable. It was quite clear to me that > the machine was going low on memory. > > The first step was then to close the greedier applications when I go > out, and restart them the next morning (e.g.: thunderbird, firefox, > virtual machine...), but that didn't change. Anyway, I spotted a > number of interrupts that was a bit too high, even while the system > was doing almost nothing. That problem improved when I disabled > VT-x/AMD-V setting for the linux virtual machine. > > I also tried to restart my X session to have the X server restarted, > and also to disable the GDM alltogether to surely get a fresh X server > every time using "startx" from the command line. > > Nonetheless, the memory problem was still there. > > At the point I saved the output of "ps -o pid,ppid,vsz,args -e | sort > -nr -k3", restarted the system, re'run the ps and compared the two > outputs: I got no evidence of ever-growing processes, just slight > changes in size. Also tried with prstat -s size: no success. > > Then I kept a vmstat running. When I left, I closed all applications > but the terminal window where vmstat was running:I had 2080888 kB swap > and 426932 kB RAM free. The morning after, the numbers were 1709776 kB > swap and 55460 kB RAM free. > > At that point, I thought that the problem might be what ps doesn't > show. Maybe the drivers. Unfortunately, modinfo didn't shed any light. > The output of "modinfo | perl -alne '$size = $F[2] =~ m{[0-9a-f]}i? > hex($F[2]) : qq{>>>} ; print qq{$size\t$F[2]\t$F[0]\t$F[5]}' | sort > -nr" showed identical size values for almost all the entries. > > Following the suggestions of my colleagues at SAGE, I tried mdb's > memstat. With a fresh system it said: > > > >>> Page Summary Pages MB %Tot >>> ------------ ---------------- ---------------- ---- >>> Kernel 123331 481 24% >>> ZFS File Data 65351 255 13% >>> Anon 264183 1031 51% >>> Exec and libs 4623 18 1% >>> Page cache 38954 152 8% >>> Free (cachelist) 8410 32 2% >>> Free (freelist) 8249 32 2% >>> >>> Total 513101 2004 >>> Physical 513100 2004 >>> > > > > > Later: > > > >>> Page Summary Pages MB %Tot >>> ------------ ---------------- ---------------- ---- >>> Kernel 205125 801 40% >>> ZFS File Data 1536 6 0% >>> Anon 281519 1099 55% >>> Exec and libs 1714 6 0% >>> Page cache 11927 46 2% >>> Free (cachelist) 6212 24 1% >>> Free (freelist) 5068 19 1% >>> >>> Total 513101 2004 >>> Physical 513100 2004 >>> > > > and that didn't change a lot when I closed Virtual Box: > > > >>> Page Summary Pages MB %Tot >>> ------------ ---------------- ---------------- ---- >>> Kernel 201160 785 39% >>> ZFS File Data 35228 137 7% >>> Anon 143764 561 28% >>> Exec and libs 1860 7 0% >>> Page cache 12347 48 2% >>> Free (cachelist) 20978 81 4% >>> Free (freelist) 97764 381 19% >>> >>> Total 513101 2004 >>> Physical 513100 2004 >>> > > > > > Following Jason King's suggestion I also ran some dtrace scripts, but > that still didn't took us much forward. This one: > > >>> bronto at brabham:~$ pfexec dtrace -n 'fbt::kmem_alloc:entry { @a[execname, >>> stack()] = sum(args[0]); } END { trunc(@a, 20) }' >>> > > > > showed a lot of ZFS-related information, of which Jason said: > > > >>> That definately doesn't look right to me.. One other thing just to >>> eliminate.. I think the ARC (zfs's cache) should show up in the ZFS >>> File Data, but to be sure, http://cuddletech.com/arc_summary/ is a >>> handy perl script you can run which will (among other things) tell you >>> how much cache ZFS is using (though it should be releasing that as >>> other apps require it, which wouldn't explain the behavior you're >>> seeing, but just to cross it off the list, it might be worthwhile to >>> try) >>> > > and > > >>> I'd still go for generating the crash dump (reboot -d) and shoot an >>> email to mdb-discuss (maybe with the ::findleaks output attached -- it >>> can be a bit long) to see if anyone can give you some pointers on >>> tracking down the source. >>> > > arc_summary yeld these results: > > >>> bronto at brabham:~/bin$ ./arc_summary.pl >>> System Memory: >>> Physical RAM: 2004 MB >>> Free Memory : 54 MB >>> LotsFree: 31 MB >>> >>> ZFS Tunables (/etc/system): >>> >>> ARC Size: >>> Current Size: 351 MB (arcsize) >>> Target Size (Adaptive): 336 MB (c) >>> Min Size (Hard Limit): 187 MB (zfs_arc_min) >>> Max Size (Hard Limit): 1503 MB (zfs_arc_max) >>> >>> ARC Size Breakdown: >>> Most Recently Used Cache Size: 6% 21 MB (p) >>> Most Frequently Used Cache Size: 93% 315 MB (c-p) >>> >>> ARC Efficency: >>> Cache Access Total: 51330312 >>> Cache Hit Ratio: 99% 51130451 [Defined State >>> for buffer] >>> Cache Miss Ratio: 0% 199861 [Undefined >>> State for Buffer] >>> REAL Hit Ratio: 99% 50874745 [MRU/MFU Hits Only] >>> >>> Data Demand Efficiency: 94% >>> Data Prefetch Efficiency: 33% >>> >>> CACHE HITS BY CACHE LIST: >>> Anon: 0% 209685 [ >>> New Customer, First Cache Hit ] >>> Most Recently Used: 1% 579174 (mru) [ Return >>> Customer ] >>> Most Frequently Used: 98% 50295571 (mfu) [ Frequent >>> Customer ] >>> Most Recently Used Ghost: 0% 23712 (mru_ghost) [ >>> Return Customer Evicted, Now Back ] >>> Most Frequently Used Ghost: 0% 22309 (mfu_ghost) [ >>> Frequent Customer Evicted, Now Back ] >>> CACHE HITS BY DATA TYPE: >>> Demand Data: 1% 923119 >>> Prefetch Data: 0% 6709 >>> Demand Metadata: 96% 49210017 >>> Prefetch Metadata: 1% 990606 >>> CACHE MISSES BY DATA TYPE: >>> Demand Data: 28% 56627 >>> Prefetch Data: 6% 13460 >>> Demand Metadata: 51% 102980 >>> Prefetch Metadata: 13% 26794 >>> --------------------------------------------- >>> > > > In a moment of low load at $WORK I've finally generated the crash > dumps and ran findleaks on it, and the result is on top of this email. > > Any hint? > > Ciao > --bronto > _______________________________________________ > mdb-discuss mailing list > mdb-discuss at opensolaris.org > >