Maybe some configuration change occured that now takes effect when you start 
the OSD?
Not sure what could affect memory usage though - some ulimit values maybe 
(stack size), number of OSD threads (compare the number from this OSD to the 
rest of OSDs), fd cache size. Look in /proc and compare everything.
Also look in "ceph osd tree" - didn't someone touch it while you were gone?

Jan

> On 07 Sep 2015, at 13:40, Mariusz Gronczewski 
> <[email protected]> wrote:
> 
> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <[email protected]> wrote:
> 
>> Apart from bug causing this, this could be caused by failure of other OSDs 
>> (even temporary) that starts backfills.
>> 
>> 1) something fails
>> 2) some PGs move to this OSD
>> 3) this OSD has to allocate memory for all the PGs
>> 4) whatever fails gets back up
>> 5) the memory is never released.
>> 
>> A similiar scenario is possible if for example someone confuses "ceph osd 
>> crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).
>> 
>> Did you try just restarting the OSD before you upgraded it?
> 
> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
> that nothing changed. I've tried to wait till it stops eating CPU then
> restart it but it still eats >2GB of memory which means I can't start
> all 4 OSDs at same time ;/
> 
> I've also added noin,nobackfill,norecover flags but that didnt help
> 
> it is suprising for me because before all 4 OSDs total ate less than
> 2GBs of memory so I though I have enough headroom, and we did restart
> machines and removed/added os to test if recovery/rebalance goes fine
> 
> it also does not have any external traffic at the moment
> 
> 
>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski 
>>> <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> over a weekend (was on vacation so I didnt get exactly what happened)
>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which was a
>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700
>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs
>>> blocked the osds down to unusabiltity.
>>> 
>>> I then upgraded one of OSDs to hammer which made it a bit better (~2GB
>>> per osd) but still much higher usage than before.
>>> 
>>> any ideas what would be a reason for that ? logs are mostly full on
>>> OSDs trying to recover and timed out heartbeats
>>> 
>>> -- 
>>> Mariusz Gronczewski, Administrator
>>> 
>>> Efigence S. A.
>>> ul. Wołoska 9a, 02-583 Warszawa
>>> T: [+48] 22 380 13 13
>>> F: [+48] 22 380 13 14
>>> E: [email protected]
>>> <mailto:[email protected]>
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 
> 
> 
> -- 
> Mariusz Gronczewski, Administrator
> 
> Efigence S. A.
> ul. Wołoska 9a, 02-583 Warszawa
> T: [+48] 22 380 13 13
> F: [+48] 22 380 13 14
> E: [email protected]
> <mailto:[email protected]>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to