[mdb-discuss] Well hidden memory leak in OpenSolaris 2009.06?

m...@bruningsystems.com Thu, 05 Nov 2009 14:42:00 +0100

Hi Marco,

Do you have any large (and growing) files in /tmp, /etc/svc/volatile,
or /var/run?  How long was the system up when you did the reboot -d?
15k bytes may be a leak, but it is not so excessive that it should be
causing problems (unless you are leaking 15k every few minutes...).


max

Marco Marongiu wrote:
> Hello there
>
> This mail is a followup from a same-named thread in the SAGE Member's
> mailing list.
>
> THE SHORT VERSION:
>
> I've probably found a kernel memory leak in OpenSolaris 2009.06. I ran
> the findleaks command in mdb on the crash dumps, which yeld:
>
>   
>>> bronto at brabham:/var/crash/brabham# echo '::findleaks' | mdb unix.0 
>>> vmcore.0 | tee findleaks.out
>>> CACHE             LEAKED           BUFCTL CALLER
>>> ffffff0142828860       1 ffffff014c022a38 AcpiOsAllocate+0x1c
>>> ffffff01428265a0       2 ffffff014c1f5148 AcpiOsAllocate+0x1c
>>> ffffff0142828860       1 ffffff014c022be8 AcpiOsAllocate+0x1c
>>> ffffff0149b85020       1 ffffff015b00fe98 rootnex_coredma_allochdl+0x84
>>> ffffff0149b85020       1 ffffff015b00e038 rootnex_coredma_allochdl+0x84
>>> ffffff0149b85020       1 ffffff015b00e110 rootnex_coredma_allochdl+0x84
>>> ffffff0149b85020       1 ffffff015b0055e8 rootnex_coredma_allochdl+0x84
>>> ffffff0149b85020       1 ffffff015b00e2c0 rootnex_coredma_allochdl+0x84
>>> ffffff0149b85020       1 ffffff015b00fdc0 rootnex_coredma_allochdl+0x84
>>> ------------------------------------------------------------------------
>>>            Total      10 buffers, 15424 bytes
>>>       
>
>
>
>
> Any hint?
>
>
> THE LONG STORY
>
>
> I am using OpenSolaris 2009.06; my workstation is an HP-Compaq dc7800,
> with 2GB of RAM and a SATA disk of about 200GB; the video card is an
> ATI (I know, I know...); the swap space is 2GB.
>
> When the problem first showed up months ago, that was my main
> workstation. Now I am using another one, mounting my home filesystem
> from the OpenSolaris machine.  When used as a workstation, that also
> ran an
> Ubuntu Linux virtual machine on VirtualBox 3.0.4 for all those
> applications that I couldn't find on OpenSolaris (e.g.: Skype).
>
> When the system is freshly booted, it works like charm: it's is quick,
> responsive, even enjoyable to use. 24 hours later, it is slower
> already, and swaps a lot.
> 24 more hours, and it's barely usable. It was quite clear to me that
> the machine was going low on memory.
>
> The first step was then to close the greedier applications when I go
> out, and restart them the next morning (e.g.: thunderbird, firefox,
> virtual machine...), but that didn't change. Anyway, I spotted a
> number of interrupts that was a bit too high, even while the system
> was doing almost nothing. That problem improved when I disabled
> VT-x/AMD-V setting for the linux virtual machine.
>
> I also tried to restart my X session to have the X server restarted,
> and also to disable the GDM alltogether to surely get a fresh X server
> every time using "startx" from the command line.
>
> Nonetheless, the memory problem was still there.
>
> At the point I saved the output of "ps -o pid,ppid,vsz,args -e | sort
> -nr -k3", restarted the system, re'run the ps and compared the two
> outputs: I got no evidence of ever-growing processes, just slight
> changes in size. Also tried with prstat -s size: no success.
>
> Then I kept a vmstat running. When I left, I closed all applications
> but the terminal window where vmstat was running:I had 2080888 kB swap
> and 426932 kB RAM free. The morning after, the numbers were 1709776 kB
> swap and 55460 kB RAM free.
>
> At that point, I thought that the problem might be what ps doesn't
> show. Maybe the drivers. Unfortunately, modinfo didn't shed any light.
> The output of "modinfo | perl -alne '$size = $F[2] =~ m{[0-9a-f]}i?
> hex($F[2]) : qq{>>>} ;  print qq{$size\t$F[2]\t$F[0]\t$F[5]}' | sort
> -nr" showed identical size values for almost all the entries.
>
> Following the suggestions of my colleagues at SAGE, I tried mdb's
> memstat. With a fresh system it said:
>
>
>   
>>> Page Summary                Pages                MB  %Tot
>>> ------------     ----------------  ----------------  ----
>>> Kernel                     123331               481   24%
>>> ZFS File Data               65351               255   13%
>>> Anon                       264183              1031   51%
>>> Exec and libs                4623                18    1%
>>> Page cache                  38954               152    8%
>>> Free (cachelist)             8410                32    2%
>>> Free (freelist)              8249                32    2%
>>>
>>> Total                      513101              2004
>>> Physical                   513100              2004
>>>       
>
>
>
>
> Later:
>
>
>   
>>> Page Summary                Pages                MB  %Tot
>>> ------------     ----------------  ----------------  ----
>>> Kernel                     205125               801   40%
>>> ZFS File Data                1536                 6    0%
>>> Anon                       281519              1099   55%
>>> Exec and libs                1714                 6    0%
>>> Page cache                  11927                46    2%
>>> Free (cachelist)             6212                24    1%
>>> Free (freelist)              5068                19    1%
>>>
>>> Total                      513101              2004
>>> Physical                   513100              2004
>>>       
>
>
> and that didn't change a lot when I closed Virtual Box:
>
>
>   
>>> Page Summary                Pages                MB  %Tot
>>> ------------     ----------------  ----------------  ----
>>> Kernel                     201160               785   39%
>>> ZFS File Data               35228               137    7%
>>> Anon                       143764               561   28%
>>> Exec and libs                1860                 7    0%
>>> Page cache                  12347                48    2%
>>> Free (cachelist)            20978                81    4%
>>> Free (freelist)             97764               381   19%
>>>
>>> Total                      513101              2004
>>> Physical                   513100              2004
>>>       
>
>
>
>
> Following Jason King's suggestion I also ran some dtrace scripts, but
> that still didn't took us much forward. This one:
>
>   
>>> bronto at brabham:~$ pfexec dtrace -n 'fbt::kmem_alloc:entry { @a[execname, 
>>> stack()] = sum(args[0]); } END { trunc(@a, 20) }'
>>>       
>
>
>
> showed a lot of ZFS-related information, of which Jason said:
>
>
>   
>>> That definately doesn't look right to me.. One other thing just to
>>> eliminate.. I think the ARC (zfs's cache) should show up in the ZFS
>>> File Data, but to be sure, http://cuddletech.com/arc_summary/ is a
>>> handy perl script you can run which will (among other things) tell you
>>> how much cache ZFS is using (though it should be releasing that as
>>> other apps require it, which wouldn't explain the behavior you're
>>> seeing, but just to cross it off the list, it might be worthwhile to
>>> try)
>>>       
>
> and
>
>   
>>> I'd still go for generating the crash dump (reboot -d) and shoot an
>>> email to mdb-discuss (maybe with the ::findleaks output attached -- it
>>> can be a bit long) to see if anyone can give you some pointers on
>>> tracking down the source.
>>>       
>
> arc_summary yeld these results:
>
>   
>>> bronto at brabham:~/bin$ ./arc_summary.pl
>>> System Memory:
>>>         Physical RAM:  2004 MB
>>>         Free Memory :  54 MB
>>>         LotsFree:      31 MB
>>>
>>> ZFS Tunables (/etc/system):
>>>
>>> ARC Size:
>>>         Current Size:             351 MB (arcsize)
>>>         Target Size (Adaptive):   336 MB (c)
>>>         Min Size (Hard Limit):    187 MB (zfs_arc_min)
>>>         Max Size (Hard Limit):    1503 MB (zfs_arc_max)
>>>
>>> ARC Size Breakdown:
>>>         Most Recently Used Cache Size:           6%    21 MB (p)
>>>         Most Frequently Used Cache Size:        93%    315 MB (c-p)
>>>
>>> ARC Efficency:
>>>         Cache Access Total:             51330312
>>>         Cache Hit Ratio:      99%       51130451       [Defined State
>>> for buffer]
>>>         Cache Miss Ratio:      0%       199861         [Undefined
>>> State for Buffer]
>>>         REAL Hit Ratio:       99%       50874745       [MRU/MFU Hits Only]
>>>
>>>         Data Demand   Efficiency:    94%
>>>         Data Prefetch Efficiency:    33%
>>>
>>>        CACHE HITS BY CACHE LIST:
>>>          Anon:                        0%        209685                 [ 
>>> New Customer, First Cache Hit ]
>>>          Most Recently Used:          1%        579174 (mru) [ Return 
>>> Customer ]
>>>          Most Frequently Used:       98%        50295571 (mfu) [ Frequent 
>>> Customer ]
>>>          Most Recently Used Ghost:    0%        23712 (mru_ghost)      [ 
>>> Return Customer Evicted, Now Back ]
>>>          Most Frequently Used Ghost:  0%        22309 (mfu_ghost) [ 
>>> Frequent Customer Evicted, Now Back ]
>>>        CACHE HITS BY DATA TYPE:
>>>          Demand Data:                 1%        923119
>>>          Prefetch Data:               0%        6709
>>>          Demand Metadata:            96%        49210017
>>>          Prefetch Metadata:           1%        990606
>>>        CACHE MISSES BY DATA TYPE:
>>>          Demand Data:                28%        56627
>>>          Prefetch Data:               6%        13460
>>>          Demand Metadata:            51%        102980
>>>          Prefetch Metadata:          13%        26794
>>> ---------------------------------------------
>>>       
>
>
> In a moment of low load at $WORK  I've finally generated the crash
> dumps and ran findleaks on it, and the result is on top of this email.
>
> Any hint?
>
> Ciao
> --bronto
> _______________________________________________
> mdb-discuss mailing list
> mdb-discuss at opensolaris.org
>
>

[mdb-discuss] Well hidden memory leak in OpenSolaris 2009.06?

Reply via email to