I guess he runs Argonaut as well. More suggestions about this problem?
Thanks! -- Regards, Sébastien Han. On Mon, Jan 7, 2013 at 8:09 PM, Samuel Just <sam.j...@inktank.com> wrote: > > Awesome! What version are you running (ceph-osd -v, include the hash)? > -Sam > > On Mon, Jan 7, 2013 at 11:03 AM, Dave Spano <dsp...@optogenics.com> wrote: > > This failed the first time I sent it, so I'm resending in plain text. > > > > Dave Spano > > Optogenics > > Systems Administrator > > > > > > > > ----- Original Message ----- > > > > From: "Dave Spano" <dsp...@optogenics.com> > > To: "Sébastien Han" <han.sebast...@gmail.com> > > Cc: "ceph-devel" <ceph-devel@vger.kernel.org>, "Samuel Just" > > <sam.j...@inktank.com> > > Sent: Monday, January 7, 2013 12:40:06 PM > > Subject: Re: OSD memory leaks? > > > > > > Sam, > > > > Attached are some heaps that I collected today. 001 and 003 are just after > > I started the profiler; 011 is the most recent. If you need more, or > > anything different let me know. Already the OSD in question is at 38% > > memory usage. As mentioned by Sèbastien, restarting ceph-osd keeps things > > going. > > > > Not sure if this is helpful information, but out of the two OSDs that I > > have running, the first one (osd.0) is the one that develops this problem > > the quickest. osd.1 does have the same issue, it just takes much longer. Do > > the monitors hit the first osd in the list first, when there's activity? > > > > > > Dave Spano > > Optogenics > > Systems Administrator > > > > > > ----- Original Message ----- > > > > From: "Sébastien Han" <han.sebast...@gmail.com> > > To: "Samuel Just" <sam.j...@inktank.com> > > Cc: "ceph-devel" <ceph-devel@vger.kernel.org> > > Sent: Friday, January 4, 2013 10:20:58 AM > > Subject: Re: OSD memory leaks? > > > > Hi Sam, > > > > Thanks for your answer and sorry the late reply. > > > > Unfortunately I can't get something out from the profiler, actually I > > do but I guess it doesn't show what is supposed to show... I will keep > > on trying this. Anyway yesterday I just thought that the problem might > > be due to some over usage of some OSDs. I was thinking that the > > distribution of the primary OSD might be uneven, this could have > > explained that some memory leaks are more important with some servers. > > At the end, the repartition seems even but while looking at the pg > > dump I found something interesting in the scrub column, timestamps > > from the last scrubbing operation matched with times showed on the > > graph. > > > > After this, I made some calculation, I compared the total number of > > scrubbing operation with the time range where memory leaks occurred. > > First of all check my setup: > > > > root@c2-ceph-01 ~ # ceph osd tree > > dumped osdmap tree epoch 859 > > # id weight type name up/down reweight > > -1 12 pool default > > -3 12 rack lc2_rack33 > > -2 3 host c2-ceph-01 > > 0 1 osd.0 up 1 > > 1 1 osd.1 up 1 > > 2 1 osd.2 up 1 > > -4 3 host c2-ceph-04 > > 10 1 osd.10 up 1 > > 11 1 osd.11 up 1 > > 9 1 osd.9 up 1 > > -5 3 host c2-ceph-02 > > 3 1 osd.3 up 1 > > 4 1 osd.4 up 1 > > 5 1 osd.5 up 1 > > -6 3 host c2-ceph-03 > > 6 1 osd.6 up 1 > > 7 1 osd.7 up 1 > > 8 1 osd.8 up 1 > > > > > > And there are the results: > > > > * Ceph node 1 which has the most important memory leak performed 1608 > > in total and 1059 during the time range where memory leaks occured > > * Ceph node 2, 1168 in total and 776 during the time range where > > memory leaks occured > > * Ceph node 3, 940 in total and 94 during the time range where memory > > leaks occurred > > * Ceph node 4, 899 in total and 191 during the time range where > > memory leaks occurred > > > > I'm still not entirely sure that the scrub operation causes the leak > > but the only relevant relation that I found... > > > > Could it be that the scrubbing process doesn't release memory? Btw I > > was wondering, how ceph decides at what time it should run the > > scrubbing operation? I know that it's once a day and control by the > > following options > > > > OPTION(osd_scrub_min_interval, OPT_FLOAT, 300) > > OPTION(osd_scrub_max_interval, OPT_FLOAT, 60*60*24) > > > > But how ceph determined the time where the operation started, during > > cluster creation probably? > > > > I just checked the options that control OSD scrubbing and found that by > > default: > > > > OPTION(osd_max_scrubs, OPT_INT, 1) > > > > So that might explain why only one OSD uses a lot of memory. > > > > My dirty workaround at the moment is to performed a check of memory > > use by every OSD and restart it if it uses more than 25% of the total > > memory. Also note that on ceph 1, 3 and 4 it's always one OSD that > > uses a lot of memory, for ceph 2 only the mem usage is high but almost > > the same for all the OSD process. > > > > Thank you in advance. > > > > -- > > Regards, > > Sébastien Han. > > > > > > On Wed, Dec 19, 2012 at 10:43 PM, Samuel Just <sam.j...@inktank.com> wrote: > >> > >> Sorry, it's been very busy. The next step would to try to get a heap > >> dump. You can start a heap profile on osd N by: > >> > >> ceph osd tell N heap start_profiler > >> > >> and you can get it to dump the collected profile using > >> > >> ceph osd tell N heap dump. > >> > >> The dumps should show up in the osd log directory. > >> > >> Assuming the heap profiler is working correctly, you can look at the > >> dump using pprof in google-perftools. > >> > >> On Wed, Dec 19, 2012 at 8:37 AM, Sébastien Han <han.sebast...@gmail.com> > >> wrote: > >> > No more suggestions? :( > >> > -- > >> > Regards, > >> > Sébastien Han. > >> > > >> > > >> > On Tue, Dec 18, 2012 at 6:21 PM, Sébastien Han <han.sebast...@gmail.com> > >> > wrote: > >> >> Nothing terrific... > >> >> > >> >> Kernel logs from my clients are full of "libceph: osd4 > >> >> 172.20.11.32:6801 socket closed" > >> >> > >> >> I saw this somewhere on the tracker. > >> >> > >> >> Does this harm? > >> >> > >> >> Thanks. > >> >> > >> >> -- > >> >> Regards, > >> >> Sébastien Han. > >> >> > >> >> > >> >> > >> >> On Mon, Dec 17, 2012 at 11:55 PM, Samuel Just <sam.j...@inktank.com> > >> >> wrote: > >> >>> > >> >>> What is the workload like? > >> >>> -Sam > >> >>> > >> >>> On Mon, Dec 17, 2012 at 2:41 PM, Sébastien Han > >> >>> <han.sebast...@gmail.com> wrote: > >> >>> > Hi, > >> >>> > > >> >>> > No, I don't see nothing abnormal in the network stats. I don't see > >> >>> > anything in the logs... :( > >> >>> > The weird thing is that one node over 4 seems to take way more memory > >> >>> > than the others... > >> >>> > > >> >>> > -- > >> >>> > Regards, > >> >>> > Sébastien Han. > >> >>> > > >> >>> > > >> >>> > On Mon, Dec 17, 2012 at 11:31 PM, Sébastien Han > >> >>> > <han.sebast...@gmail.com> wrote: > >> >>> >> > >> >>> >> Hi, > >> >>> >> > >> >>> >> No, I don't see nothing abnormal in the network stats. I don't see > >> >>> >> anything in the logs... :( > >> >>> >> The weird thing is that one node over 4 seems to take way more > >> >>> >> memory than the others... > >> >>> >> > >> >>> >> -- > >> >>> >> Regards, > >> >>> >> Sébastien Han. > >> >>> >> > >> >>> >> > >> >>> >> > >> >>> >> On Mon, Dec 17, 2012 at 7:12 PM, Samuel Just <sam.j...@inktank.com> > >> >>> >> wrote: > >> >>> >>> > >> >>> >>> Are you having network hiccups? There was a bug noticed recently > >> >>> >>> that > >> >>> >>> could cause a memory leak if nodes are being marked up and down. > >> >>> >>> -Sam > >> >>> >>> > >> >>> >>> On Mon, Dec 17, 2012 at 12:28 AM, Sébastien Han > >> >>> >>> <han.sebast...@gmail.com> wrote: > >> >>> >>> > Hi guys, > >> >>> >>> > > >> >>> >>> > Today looking at my graphs I noticed that one over 4 ceph nodes > >> >>> >>> > used a > >> >>> >>> > lot of memory. It keeps growing and growing. > >> >>> >>> > See the graph attached to this mail. > >> >>> >>> > I run 0.48.2 on Ubuntu 12.04. > >> >>> >>> > > >> >>> >>> > The other nodes also grow, but slowly than the first one. > >> >>> >>> > > >> >>> >>> > I'm not quite sure about the information that I have to provide. > >> >>> >>> > So > >> >>> >>> > let me know. The only thing I can say is that the load haven't > >> >>> >>> > increase that much this week. It seems to be consuming and not > >> >>> >>> > giving > >> >>> >>> > back the memory. > >> >>> >>> > > >> >>> >>> > Thank you in advance. > >> >>> >>> > > >> >>> >>> > -- > >> >>> >>> > Regards, > >> >>> >>> > Sébastien Han. > >> >>> >> > >> >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html