Maybe some configuration change occured that now takes effect when you start the OSD? Not sure what could affect memory usage though - some ulimit values maybe (stack size), number of OSD threads (compare the number from this OSD to the rest of OSDs), fd cache size. Look in /proc and compare everything. Also look in "ceph osd tree" - didn't someone touch it while you were gone?
Jan > On 07 Sep 2015, at 13:40, Mariusz Gronczewski > <[email protected]> wrote: > > On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <[email protected]> wrote: > >> Apart from bug causing this, this could be caused by failure of other OSDs >> (even temporary) that starts backfills. >> >> 1) something fails >> 2) some PGs move to this OSD >> 3) this OSD has to allocate memory for all the PGs >> 4) whatever fails gets back up >> 5) the memory is never released. >> >> A similiar scenario is possible if for example someone confuses "ceph osd >> crush reweight" with "ceph osd reweight" (yes, this happened to me :-)). >> >> Did you try just restarting the OSD before you upgraded it? > > stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside > that nothing changed. I've tried to wait till it stops eating CPU then > restart it but it still eats >2GB of memory which means I can't start > all 4 OSDs at same time ;/ > > I've also added noin,nobackfill,norecover flags but that didnt help > > it is suprising for me because before all 4 OSDs total ate less than > 2GBs of memory so I though I have enough headroom, and we did restart > machines and removed/added os to test if recovery/rebalance goes fine > > it also does not have any external traffic at the moment > > >>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski >>> <[email protected]> wrote: >>> >>> Hi, >>> >>> over a weekend (was on vacation so I didnt get exactly what happened) >>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a >>> problem considering that we had only 8GB of ram for 4 OSDs (about 700 >>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs >>> blocked the osds down to unusabiltity. >>> >>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB >>> per osd) but still much higher usage than before. >>> >>> any ideas what would be a reason for that ? logs are mostly full on >>> OSDs trying to recover and timed out heartbeats >>> >>> -- >>> Mariusz Gronczewski, Administrator >>> >>> Efigence S. A. >>> ul. Wołoska 9a, 02-583 Warszawa >>> T: [+48] 22 380 13 13 >>> F: [+48] 22 380 13 14 >>> E: [email protected] >>> <mailto:[email protected]> >>> _______________________________________________ >>> ceph-users mailing list >>> [email protected] >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > > -- > Mariusz Gronczewski, Administrator > > Efigence S. A. > ul. Wołoska 9a, 02-583 Warszawa > T: [+48] 22 380 13 13 > F: [+48] 22 380 13 14 > E: [email protected] > <mailto:[email protected]> _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
