[ceph-users] Re: Debugging OSD cache thrashing

Hector Martin Tue, 24 Jun 2025 22:13:34 -0700

Hi Mark,

Thanks a lot for the pointers and info, it's really helpful.


Since the issue is happening in a live cluster (which is a homelab I can
screw around with to an extent, but not take down for very long periods
of time or lose data in), and since I don't have a lot of spare hours in
the coming weeks, I think this will probably be an investigation on the
longer side of things. So don't expect fast follow-ups, but rest assured
this is now on my radar and I'll be poking at it and seeing what I can
figure out.

Re the workload between snapshots, that's a bit hard to pin down right
now since I did some shuffling around and the snapshot deletions were
from different subtrees. At the time I saw this I had just wiped all my
snapshots in the cluster, ~2 weeks worth of daily snapshots in the
largest subtree, adding up to just over a terabyte of raw cluster
storage being freed going by stats graphs (5+4 EC, 8MB object CephFS,
probably mostly large files and full-size objects). In fact I just took
out the vast majority of that data from the tree that gets snapshotted
(that really shouldn't have been there); if it turns out I can't repro
after that I'm happy to revert it back though (but it would take another
2 weeks to recreate the exact scenario, if it comes to that).

I also just deleted a legacy pool, which at the time was involved in the
snaptrim but had only negligible data actually being snapshotted/deleted
as part of it. Again, if it comes to that to make things repro, I don't
mind recreating it and trying to force the same scenario.

FWIW, the setup here is 3 nodes:
- M1 Mac Mini (4p4e), 16GB RAM, OSDs + Mon
- M2 Mac Mini (4p4e), 24GB RAM, OSDs + Mon + MDS + monitoring (prom)
- M2 Pro Mac Mini (6p4e), 32GB RAM, OSDs + Mon + MDS + a bunch of other
stuff

Each node has 4x 16TB HDDs and 1x 1TB NVMe SSD, attached via USB (10G,
except 5G for the HDDs on one node because reasons). The HDD block.db
volumes are on internal Mac NVMe storage (recent change, just before the
snapshot story; yes, I did force migrate the data over so there should
be no DB files left on the HDDs). SSDs don't have separate block.db.
10GbE networking for all nodes. I have no reason to suspect any
hardware/driver related issues relevant to the snaptrim story (I've had
a few of those but I keep an eye out for it, and the current setup
should be stable).

One issue is that this may be hard to repro on demand. If I'm lucky I
can just instantly create and delete a pile of snapshots (without much
changes in between) and repro, but I might not be lucky. I'd usually
debug this by increasing logging verbosity and/or adding my own debug
logs, but if this ends up requiring days or weeks of real snapshots to
repro, that's going to get old fast. My best idea right now if it comes
to that is to attach gdb to an OSD currently experiencing the issue, let
the cluster mark it down in the meantime, and hope I can step through
the code and find out what the thrashing situation is before it notices
it got marked down and starts doing something else.

As I mentioned on Slack, one curious thing about my case is that it
seems to have been doing around ~100MB reads from BlueFS (probably
entire sst files?), but they *were* being satisfied from page cache, and
despite that the CPU time was almost all spent copying the data in RAM
in the kernel. So that suggests RocksDB wasn't actually *doing* almost
anything with the data (other than checksumming it, the only other place
I saw CPU usage), not even parsing through it once. It was just reading
it and throwing it away repeatedly, or at least that's what it looks
like to me. So I hope higher debug + gdb will point at where this
behavior comes from, and whether it even makes any sense or it's just a
bug. Same with the block cache not working as intended.

I do want to ask whether there are any debug/scaffolding tools to
possibly try to reproduce the RocksDB workload. I can't easily bring up
a full OSD outside the cluster with all the data (it's not impossible,
but a 16TB HDD is kind of a pain to copy around), but I do have block.db
in a separate volume. If there is some way to bring up that BlueFS
stand-alone, and then attempt to recreate the RocksDB workload (e.g. by
logging it from the real OSD), that might lead to a really handy
reproducer. I don't mind writing my own code, just want to know if
there's a good starting place for that or any existing tooling I could
jumpstart from.

Cheers,
- Hector

On 2025/06/25 7:48, Mark Nelson wrote:
> Hi Hector,
> 
> Just as a follow-up,
> 
> Here are the comments I mentioned in the ceph slack channel from the PR 
> where we re-enabled bluestore_buffered_io.  I recorded these notes back 
> when aclamk and I were digging into the RocksDB code to see if there was 
> anything we could do to improve the situation.  They may be valuable if 
> you want to dig into the RocksDB side of what is going on.
> 
> https://github.com/ceph/ceph/pull/38044#issuecomment-789192720
> 
> https://github.com/ceph/ceph/pull/38044#issuecomment-790157415
> 
> In the end we never solved the problem afaik.  We still rely on 
> bluestore_buffered_io and the linux page cache to hide excessive disk 
> reads (especially the repeated prefetch reads mentioned in the PR!) when 
> the block cache misses.  In your case, with very limited memory and 
> potentially limited page cache, we are likely hitting RocksDB 
> exceptionally hard for metadata lookups and there's no guarantee that 
> the page cache itself isn't thrashing as well when the block cache 
> misses.  If that's the case, the bluestore_buffered_io "fix" would not 
> be able to mitigate the repeated prefetch reads we observed.  That may 
> be related to what you are observing in your setup as well.
> 
> 
> Mark
> 
> On 6/24/25 08:06, Mark Nelson wrote:
>> Hi Hector,
>>
>> Sorry I'm a bit late to the party on this one.  I wrote the OSD memory 
>> autotuning code and am probably one of the most recent people to 
>> really dig in and refactor bluestore's caches.  I'll respond inline 
>> below.
>>
>> On 6/22/25 05:51, Hector Martin wrote:
>>> Hi all,
>>>
>>> I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
>>> 19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
>>> usage and cluster performance plummets. Most of the CPU usage was
>>> accounted as "system", while actual disk I/O usage was low, so that
>>> didn't sound right.
>>>
>>> I perf traced the system, and found that most of the usage is in
>>> __arch_copy_to_user in the kernel, in read() syscalls. That sounds like
>>> the OSD is thrashing buffered block device reads which the kernel
>>> satisfies from the page cache (hence no real I/O load, most of the CPU
>>> usage is from data copies), so it's repeatedly reading the same disk
>>> blocks, which doesn't sound right.
>> I suspect that what you are seeing here are RocksDB reads, since we 
>> allow those to come from page cache.  We have a long standing issue 
>> where RocksDB was repeatedly re-reading the same blocks for readahead 
>> when data isn't in the block cache and buffered IO is the crutch 
>> basically allowing it to perform well.  Tyler Stachecki mentioned the 
>> bluefs_buffered_io setting and that indeed is directly related to this 
>> issue and why we''ve flip-flopped on whether to leave it enabled.  
>> This is a bit of a distraction imho though, see below.
>>
>>
>>>
>>> I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
>>> config`, and CPU usage immediately dropped to near zero. However,
>>> waiting a bit eventually the CPU thrashing returns after memory usage
>>> increases. Restarting an OSD has a similar effect. I believe that
>>> something is wrong with the OSD bluestore cache allocation/flush policy,
>>> and when the cache becomes full it starts thrashing reads instead of
>>> evicting colder cached data (or perhaps some cache bucket is starving
>>> another cache bucket of space).
>>
>> In bluestore itself we have two primary caches:  The onode cache and 
>> the buffer cache.  Onode cache stores the metadata about objects while 
>> buffer cache stores object data.  A cache miss in either of these will 
>> result in data being read, but onode cache misses are far more 
>> expensive because it requires extra work to go fetch the data from 
>> RocksDB (even if the data is stored in RocksDB's block cache!)  It may 
>> even result in extra reads of the SST files if the data can't easily 
>> be found in early RocksDB levels.
>>
>> To me, what you are seeing sounds like severe onode cache misses, 
>> potentially combined with additional read throughput overhead due to 
>> cache thrashing during compaction if snaptrimming is resulting in a 
>> lot of writes or deletes (combined with listings).
>>
>>
>>>
>>> I would appreciate some hints on how to debug this. Are there any cache
>>> stats I should be looking at, or info on how the cache is partitioned?
>>
>> Just to start, the onode cache hit and miss rates in the OSD 
>> performance counter is a good place to start.  As Anthony mentioned, 
>> things tend not to work very well once you drop below ~1.5-2GB of 
>> memory.  There are some tricks you can play that may help (reducing 
>> various buffer sizes, pglog lengths, RocksDB WAL buffer sizes and 
>> counts, etc).  The honest truth though is that Ceph is incredibly 
>> sensitive to onode cache missses, especially if the backend storage is 
>> not very fast.
>>
>>
>>>
>>> Here is a perf call trace of the thrashing:
>>>
>>>> - OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>>>> - 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, 
>>>> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>>>> - OSD::dequeue_op(boost::intrusive_ptr<PG>, 
>>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)
>>>> - 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>>>> ThreadPool::TPHandle&)
>>>> - PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>>>> - ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>>>> - 88.86% ECBackend::handle_sub_write(pg_shard_t, 
>>>> boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, 
>>>> ECListener&)
>>>> - 88.78% non-virtual thunk to 
>>>> PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, 
>>>> std::allocator<ceph::os::Transaction> >&, 
>>>> boost::intrusive_ptr<OpRequest>)
>>>> - 
>>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
>>>>  
>>>> std::vector<ceph::os::Transaction, 
>>>> std::allocator<ceph::os::Transaction> >&, boost::intrusiv
>>>> - 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>>>> ceph::os::Transaction*)
>>>> - 88.71% BlueStore::_remove(BlueStore::TransContext*, 
>>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>>> boost::intrusive_ptr<BlueStore::Onode>&)
>>>> - BlueStore::_do_remove(BlueStore::TransContext*, 
>>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>>> boost::intrusive_ptr<BlueStore::Onode>&)
>>>> - 88.69% BlueStore::_do_truncate(BlueStore::TransContext*, 
>>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std
>>>> - 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*, 
>>>> boost::intrusive_ptr<BlueStore::Collection>&, 
>>>> boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ
>>>> - 88.65% 
>>>> BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)
>>>> - 88.64% RocksDBStore::get(std::__cxx11::basic_string<char, 
>>>> std::char_traits<char>, std::allocator<char> > const&, 
>>>> std::__cxx11::basic_string<char, std::ch
>>>> - 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, 
>>>> rocksdb::PinnableSlice*)
>>>> - rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, 
>>>> rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&)
>>>> - 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::LookupKey const&, rocksdb::PinnableSlice*, 
>>>> rocksdb::PinnableWideColumns*, std
>>>> - 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, 
>>>> rocksdb::Sl
>>>> - rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, 
>>>> rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform 
>>>> const*, bo
>>>> - 88.32% 
>>>> rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*,
>>>>  
>>>> rocksdb::Slice const&, bool, rocksdb::SliceTransfor
>>>> - rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, 
>>>> bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks
>>>> - 88.29% 
>>>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,
>>>>  
>>>> rocksdb::GetContext*, rocksdb
>>>> - 
>>>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable
>>>>  
>>>> const*, rocksdb::Fi
>>>> - rocksdb::Status 
>>>> rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
>>>>  
>>>> rocks
>>>> - rocksdb::Status 
>>>> rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
>>>> - 88.25% rocksdb::BlockFetcher::ReadBlockContents()
>>> [ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]
>>>> - 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions 
>>>> const&, unsigned long, unsigned long, rocksdb::Sli
>>> [ note: at this level the call trace splits into 4, but it all leads to
>>> the same place ]
>>>> - 35.99% 0xaaaac0992e40
>>>> BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
>>>> rocksdb::Slice*, char*) const
>>>> BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned 
>>>> long, char*)
>>>> - KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
>>>> - 35.98% __libc_pread
>>>> - el0_svc
>>>> - invoke_syscall
>>>> - 35.97% __arm64_sys_pread64
>>>> - 35.96% vfs_read
>>>> - blkdev_read_iter
>>>> - 35.93% filemap_read
>>>> - 35.28% copy_page_to_iter
>>>> 35.01% __arch_copy_to_user
>>
>> Note how much of the perf call trace is spent in RocksDB fetching 
>> blocks as a result of trying to load BlueStore shared_blobs via 
>> onodes.  Presumably your onode hit ratio is very low, and bluestore is 
>> trying to fetch a significant ratio of the onodes from RocksDB, which 
>> itself likely has very little block cache, so each onode read is 
>> coming from RocksDB, which is likely trying to fetch a block from 
>> disk, putting it in the block cache, but then may be immediately 
>> discarding it if there is very little memory for the next block.
>>
>> One thing that's missing here is a description of the workload that is 
>> happening between snapshots.  Are you doing a lot of IO that could 
>> result in severe fragmentation between snapshots?
>>
>>
>> Thanks,
>>
>> Mark
>>
>>
>>> - Hector
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Debugging OSD cache thrashing

Reply via email to