[ceph-users] Re: Untrimmed purged_snaps impacts osd startup, possibly oom killer

Eugen Block via ceph-users Thu, 29 Jan 2026 04:29:50 -0800

Hi, it appears that my mails to the lists are only visible on lists.ceph.io
but are not sent as emails, so I'm resending from a different account.


This is kind of a follow-up on a two years old thread [0] and I wanted to
raise some awareness for the corresponding tracker [1].

Back then we managed to limit the impact on the mon store performance with
some paxos configs, but now the OSDs are impacted as well when there are
new ones created:

Each newly created OSD process grows in RAM usage to around 140 GB within a
few minutes, easily causing oom killers on hosts if multiple OSDs are
created at once. The residual RAM usage drops to the memory target after it
successfully booted. The reason are the purged_snaps that are loaded during
OSD boot (snap_mapper.record_purged_snaps purged_snaps), two years ago the
customer had more than 42 million purged_snap entries. I don't know how
many there are today, I don't have access myself, but I'll try to get a
current number.

Anyway, the only way to safely create OSDs is one by one, maybe two at once
depending on the host's RAM capacity. An automated (unattended) OSD
deployment is currently not possible.

Unfortunately, the new command [2] doesn't seem to work as expected, at
least in my test cluster it didn't have any impact on the number of
purged_snaps in the mon store. That's why we haven't tried it on the
customer cluster(s) yet.

How do other operators/admins/users deal with this kind of scenario? Having
many snapshots can't be a corner case, but I can't remember having read
anything like this on the list(s). I'd appreciate any comments

Thanks!
Eugen
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Untrimmed purged_snaps impacts osd startup, possibly oom killer

Reply via email to