Yeh, There is bug which would use huge memory. It be triggered when osd
down or add into cluster and do recovery/backfilling.

The patch https://github.com/ceph/ceph/pull/5656
https://github.com/ceph/ceph/pull/5451 merged into master would fix it, and
it would be backport.

I think ceph v0.93 or newer version maybe hit this bug.

2015-09-07 20:42 GMT+08:00 Shinobu Kinjo <[email protected]>:

> How heavy network traffic was?
>
> Have you tried to capture that traffic between cluster and public network
> to see where such a bunch of traffic came from?
>
>  Shinobu
>
> ----- Original Message -----
> From: "Jan Schermer" <[email protected]>
> To: "Mariusz Gronczewski" <[email protected]>
> Cc: [email protected]
> Sent: Monday, September 7, 2015 9:17:04 PM
> Subject: Re: [ceph-users] Huge memory usage spike in OSD on hammer/giant
>
> Hmm, even network traffic went up.
> Nothing in logs on the mons which started 9/4 ~6 AM?
>
> Jan
>
> > On 07 Sep 2015, at 14:11, Mariusz Gronczewski <
> [email protected]> wrote:
> >
> > On Mon, 7 Sep 2015 13:44:55 +0200, Jan Schermer <[email protected]> wrote:
> >
> >> Maybe some configuration change occured that now takes effect when you
> start the OSD?
> >> Not sure what could affect memory usage though - some ulimit values
> maybe (stack size), number of OSD threads (compare the number from this OSD
> to the rest of OSDs), fd cache size. Look in /proc and compare everything.
> >> Also look in "ceph osd tree" - didn't someone touch it while you were
> gone?
> >>
> >> Jan
> >>
> >
> >> number of OSD threads (compare the number from this OSD to the rest of
> > OSDs),
> >
> > it occured on all OSDs, and it looked like that
> > http://imgur.com/IIMIyRG
> >
> > sadly I was on vacation so I didnt manage to catch it before ;/ but I'm
> > sure there was no config change
> >
> >
> >>> On 07 Sep 2015, at 13:40, Mariusz Gronczewski <
> [email protected]> wrote:
> >>>
> >>> On Mon, 7 Sep 2015 13:02:38 +0200, Jan Schermer <[email protected]>
> wrote:
> >>>
> >>>> Apart from bug causing this, this could be caused by failure of other
> OSDs (even temporary) that starts backfills.
> >>>>
> >>>> 1) something fails
> >>>> 2) some PGs move to this OSD
> >>>> 3) this OSD has to allocate memory for all the PGs
> >>>> 4) whatever fails gets back up
> >>>> 5) the memory is never released.
> >>>>
> >>>> A similiar scenario is possible if for example someone confuses "ceph
> osd crush reweight" with "ceph osd reweight" (yes, this happened to me :-)).
> >>>>
> >>>> Did you try just restarting the OSD before you upgraded it?
> >>>
> >>> stopped, upgraded, started. it helped a bit ( <3GB per OSD) but beside
> >>> that nothing changed. I've tried to wait till it stops eating CPU then
> >>> restart it but it still eats >2GB of memory which means I can't start
> >>> all 4 OSDs at same time ;/
> >>>
> >>> I've also added noin,nobackfill,norecover flags but that didnt help
> >>>
> >>> it is suprising for me because before all 4 OSDs total ate less than
> >>> 2GBs of memory so I though I have enough headroom, and we did restart
> >>> machines and removed/added os to test if recovery/rebalance goes fine
> >>>
> >>> it also does not have any external traffic at the moment
> >>>
> >>>
> >>>>> On 07 Sep 2015, at 12:58, Mariusz Gronczewski <
> [email protected]> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> over a weekend (was on vacation so I didnt get exactly what happened)
> >>>>> our OSDs started eating in excess of 6GB of RAM (well RSS), which
> was a
> >>>>> problem considering that we had only 8GB of ram for 4 OSDs (about 700
> >>>>> pgs per osd and about 70GB space used. So spam of coredumps and OOMs
> >>>>> blocked the osds down to unusabiltity.
> >>>>>
> >>>>> I then upgraded one of OSDs to hammer which made it a bit better
> (~2GB
> >>>>> per osd) but still much higher usage than before.
> >>>>>
> >>>>> any ideas what would be a reason for that ? logs are mostly full on
> >>>>> OSDs trying to recover and timed out heartbeats
> >>>>>
> >>>>> --
> >>>>> Mariusz Gronczewski, Administrator
> >>>>>
> >>>>> Efigence S. A.
> >>>>> ul. Wołoska 9a, 02-583 Warszawa
> >>>>> T: [+48] 22 380 13 13
> >>>>> F: [+48] 22 380 13 14
> >>>>> E: [email protected]
> >>>>> <mailto:[email protected]>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> [email protected]
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Mariusz Gronczewski, Administrator
> >>>
> >>> Efigence S. A.
> >>> ul. Wołoska 9a, 02-583 Warszawa
> >>> T: [+48] 22 380 13 13
> >>> F: [+48] 22 380 13 14
> >>> E: [email protected]
> >>> <mailto:[email protected]>
> >>
> >
> >
> >
> > --
> > Mariusz Gronczewski, Administrator
> >
> > Efigence S. A.
> > ul. Wołoska 9a, 02-583 Warszawa
> > T: [+48] 22 380 13 13
> > F: [+48] 22 380 13 14
> > E: [email protected]
> > <mailto:[email protected]>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Regards,
xinze
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to