Yeh, There is bug which would use huge memory. It be triggered when osd down or add into cluster and do recovery/backfilling.
The patch https://github.com/ceph/ceph/pull/5656 https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and it would be backport. I think ceph v0.93 or newer version maybe hit this bug. 2015-09-07 20:42 GMT+08:00 Shinobu Kinjo <[email protected]>: > How heavy network traffic was? > > Have you tried to capture that traffic between cluster and public network > to see where such a bunch of traffic came from? > > Shinobu > > ----- Original Message ----- > From: "Jan Schermer" <[email protected]> > To: "Mariusz Gronczewski" <[email protected]> > Cc: [email protected] > Sent: Monday, September 7, 2015 9:17:04 PM > Subject: Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant > > Hmm, even network traffic went up. > Nothing in logs on the mons which started 9/4 ~6 AM? > > Jan > > > On 07 Sep 2015, at 14:11, Mariusz Gronczewski < > [email protected]> wrote: > > > > On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer <[email protected]> wrote: > > > >> Maybe some configuration change occured that now takes effect when you > start the OSD? > >> Not sure what could affect memory usage though - some ulimit values > maybe (stack size), number of OSD threads (compare the number from this OSD > to the rest of OSDs), fd cache size. Look in /proc and compare everything. > >> Also look in "ceph osd tree" - didn't someone touch it while you were > gone? > >> > >> Jan > >> > > > >> number of OSD threads (compare the number from this OSD to the rest of > > OSDs), > > > > it occured on all OSDs, and it looked like that > > http://imgur.com/IIMIyRG > > > > sadly I was on vacation so I didnt manage to catch it before ;/ but I'm > > sure there was no config change > > > > > >>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski < > [email protected]> wrote: > >>> > >>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <[email protected]> > wrote: > >>> > >>>> Apart from bug causing this, this could be caused by failure of other > OSDs (even temporary) that starts backfills. > >>>> > >>>> 1) something fails > >>>> 2) some PGs move to this OSD > >>>> 3) this OSD has to allocate memory for all the PGs > >>>> 4) whatever fails gets back up > >>>> 5) the memory is never released. > >>>> > >>>> A similiar scenario is possible if for example someone confuses "ceph > osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)). > >>>> > >>>> Did you try just restarting the OSD before you upgraded it? > >>> > >>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside > >>> that nothing changed. I've tried to wait till it stops eating CPU then > >>> restart it but it still eats >2GB of memory which means I can't start > >>> all 4 OSDs at same time ;/ > >>> > >>> I've also added noin,nobackfill,norecover flags but that didnt help > >>> > >>> it is suprising for me because before all 4 OSDs total ate less than > >>> 2GBs of memory so I though I have enough headroom, and we did restart > >>> machines and removed/added os to test if recovery/rebalance goes fine > >>> > >>> it also does not have any external traffic at the moment > >>> > >>> > >>>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski < > [email protected]> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> over a weekend (was on vacation so I didnt get exactly what happened) > >>>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which > was a > >>>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700 > >>>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs > >>>>> blocked the osds down to unusabiltity. > >>>>> > >>>>> I then upgraded one of OSDs to hammer which made it a bit better > (~2GB > >>>>> per osd) but still much higher usage than before. > >>>>> > >>>>> any ideas what would be a reason for that ? logs are mostly full on > >>>>> OSDs trying to recover and timed out heartbeats > >>>>> > >>>>> -- > >>>>> Mariusz Gronczewski, Administrator > >>>>> > >>>>> Efigence S. A. > >>>>> ul. Wołoska 9a, 02-583 Warszawa > >>>>> T: [+48] 22 380 13 13 > >>>>> F: [+48] 22 380 13 14 > >>>>> E: [email protected] > >>>>> <mailto:[email protected]> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> [email protected] > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> > >>> > >>> > >>> > >>> -- > >>> Mariusz Gronczewski, Administrator > >>> > >>> Efigence S. A. > >>> ul. Wołoska 9a, 02-583 Warszawa > >>> T: [+48] 22 380 13 13 > >>> F: [+48] 22 380 13 14 > >>> E: [email protected] > >>> <mailto:[email protected]> > >> > > > > > > > > -- > > Mariusz Gronczewski, Administrator > > > > Efigence S. A. > > ul. Wołoska 9a, 02-583 Warszawa > > T: [+48] 22 380 13 13 > > F: [+48] 22 380 13 14 > > E: [email protected] > > <mailto:[email protected]> > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Regards, xinze
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
