On 2025/08/12 1:00, Mark Nelson wrote: > Congrats on figuring this out Hector! This is a huge find! Comments below. > > > On 8/11/25 4:31 AM, Hector Martin wrote: >> For those who have been following along, I figured it out. I left all >> the details with Mark on Slack, but TL;DR: The fix is *either one* (or >> both works too) of these: >> >> ceph config set osd rocksdb_cache_index_and_filter_blocks false >> (Ceph default: true, RocksDB default: false) >> ceph config set osd rocksdb_cache_shard_bits 0 >> (or 1, Ceph default: 4, RocksDB default: variable but probably 6) >> >> (Restart all OSDs after changing these settings for them to take effect, >> I don't think they apply live) >> >> The side effect of the first option is that it might increase >> unaccounted heap OSD memory usage (not managed by the cache autosize >> code), as filter blocks are preloaded and cached outside the block >> cache. The side effect of the second option is that it reduces the >> parallelism/core scalability of the block cache. I suspect the impact of >> either will be generally small for most typical deployments, and the >> benefit of not having horrible snaptrim thrashing far outweighs it for >> those affected. >> >> Both options were changed from their RocksDB defaults in Ceph commits >> long ago without any explanation / relevant commit messages, so I >> suspect this is a case of "some people flipped some knobs that seemed >> harmless and nobody really tested their impact". Other things have >> changed since then too, so I can't say when the impact likely began, as >> there are a lot of other factors involved. > > I'll take ownership of that one since I was the one that flipped those > knobs. ;) > > FWIW, a lot of the discussion around these issues was happening back in > the bluestore standup in those days. I recall that the idea behind > these changes is that we didn't want index and filter blocks to be able > to consume memory outside the context of the cache. IE at the time we > were concerned with runaway memory usage of the OSD, so having index and > filter blocks cached with high (pinned!) priority in the block cache was > preferable to having memory allocated for them without bound and no > oversight in RocksDB. The testing we did at the time didn't involve > running in scenarios where the RocksDB cache size was this small. We > didn't have the osd_memory_target autotuning back then and the cache > size was a static ratio of the overall bluestore cache size. Back in > those days we did regularly see the OSD swings in memory usage depending > on the workload. > > It seems to me this is really a combination of high memory pressure > forcing the RocksDB block cache to shrink to the minimum 64MB, having 16 > cache shards, and forcing index/filter blocks into the cache that leads > to the bad behavior.
Yup. In particular, index/filter blocks are *way* larger than data blocks, so while 64MB and 16 shards is still entirely reasonable for everything else, it falls over badly here. Dare I say, the underlying issue is shoving index/filter blocks into the same cache as data blocks, which is arguably a RocksDB flaw. Ideally I'd say they should go into a dedicated cache, managed to share space with the main block cache, but with no sharding itself. > > >> >> The root cause is that the sharding defaults to 16 shards, and that >> filter blocks are cached in the block cache instead of separately (not >> the RocksDB default). Filter blocks can be up to 17MB or thereabouts, >> which means with 16 shards, the minimum viable KV block cache size is >> ~272MB to ensure non-pathological behavior (if a filter block does not >> fit into its cache shard, it is not cached at all, causing the problem). >> Since snaptrim uses a lot of memory, it squeezes out the OSD caches and >> as soon as the RocksDB block cache size drops below that size or so >> (depending on how large filter blocks you ended up with on a particular >> OSD), you get pathological thrashing with potentially several gigabytes >> per second of kernel->user memory copies repeatedly rereading filter >> blocks from SSTs. > > My thought is that unless testing says us otherwise, we probably should > drop the shard count down to around 4 and bump up the minimum cache > allocation a bit. We could disable storing index/filter blocks in the > block cache and now the osd_memory_target code will attempt to > compensate for it (which we didn't have before), but it feels a bit like > going backwards (ideally we would be accounting for all significant > memory usage in the OSD). Yeah, that works. Shard bits = 2 (4 shards) and bump up the minimum cache size to 96MB or something should do the trick, assuming filter blocks much larger than 16MB don't happen (not sure what the distribution of those is or what limits their size, I only know what I see on my setup). > > >> >> Fixing this *might* mean that bluefs_buffered_io can be flipped to false >> too. > > This is one of the most exciting aspects of this discovery imho! > > >> >> As for who is likely affected, it depends on snaptrim memory usage vs. >> osd_memory_target. If you can guarantee 1GB or so (rough) for managed >> caches, then it might never affect you, though it still depends on what >> PriCache wants to do and it might still squeeze the KV cache under some >> other set of conditions. With the default of 4G for osd_memory_target, >> my guess is things are just about on the edge of being safe enough most >> of the time, which is probably why most people don't see things go >> horribly wrong. If you increase osd_memory_target above the default, >> you're probably safe. If you decrease it, you're in danger. > > Hence the bluefs_buffered_io crutch. Even if you hit this, if you have > enough page cache to keep the SST files cached, it doesn't really affect > you either. Oh it does. Remember, the issue I'm hitting is that merely *copying stuff from page cache* eats all CPU, there is zero disk I/O. The underlying complexity blowup here, I suspect, is that in order to read/interpret *a small part* of the filter block, RocksDB has to load *the whole block* every time. So if the block is already in RocksDB cache, there is no copying involved and the complexity is effectively O(accesses). When the block is being repeatedly loaded from the kernel page cache, that becomes O(accesses * filter block size), and you get a complexity explosion. I'm not sure if attachments are allowed here, but let's just say that when I repro'd this earlier today, I was watching the Grafana graph, and the instant an OSD hit the runaway condition, the random_read I/O throughput graph shot up and rescaled to the point all other lines got compressed down to look like zero. *looks back* yeah, it was a near instantaneous jump from ~34MB/s to almost 8GB/s for the first affected OSD. If I look back at some of the painful days where all the OSDs spent a while thrashing, ~all of them end up hovering around 3GB/s. With 4 OSDs per host, that's 12GB/s total kernel->user copy throughput happening on the system; even if we ignore the overhead of the kernel/user special case, that's the ~entire bandwidth of the fastest available grade of a single DDR4 DIMM. Good thing the little Apple boxes I'm running this on are famous for having oodles of memory bandwidth, otherwise it would have been worse! :^) I'm pretty sure there are two levels of pain here. The level that inspired bluefs_buffered_io=true probably involves something like either a) some general cache thrashing, but not on every access loop within a thread, just overall across threads (general cache contention), or b) the same kind of pathological thrashing I saw, but with very limited duration (perhaps not snaptrim or some other less aggressive workload or something else changed to make it worse now/for me). When you have just a *little* bit of this corner case, then you see it as direct I/O pain and the kernel page cache "fixes it". But when it's as horribly pathological as the case I'm hitting, that only partially helps. I didn't get around to testing bluefs_buffered_io=false with my reproducing environment, but I'm pretty sure the outcome of that would have been that snaptrim just never comes even close to completing/catching up, especially if I had done it before I'd moved DB to SSDs. There's also the part where the bottleneck here is kernel->user copies, which is probably something that performs differently depending on a bunch of factors including security mitigations, architecture, and all sorts of other stuff. It might be that on some systems, these copies are fast enough that even if you hit the pathological case I did, it flies under the radar a bit more easily. If I look at the grafana graph, the explosion isn't truly instantaneous, there's just an inflection point. On snaptrim, the first affected OSD starts out reading around 1.5-2MB/s. This stays for around 5 minutes, as memory usage grows but the cache remains healthily sized. As the cache starts to get squeezed, it grows within the span of about 5 minutes to 34MB/s. At that point, it hits the "a single block doesn't fit" tipping point and explodes. So there is clearly a steady state of memory capacity where you're doing up to >10x the I/O that you'd be doing otherwise, and that's where bluefs_buffered_io=true would fix things. But when you hit the point where a single block doesn't fit at all, even that doesn't save you. So, that does mean that the calculations above and which I made only avoid the worst-case here. I think if you want to avoid thrashing to the point where you can flip bluefs_buffered_io=false, the cache/shard sizing might need to be more conservative... > > >> >> Either of the above settings makes it impossible for this pathological >> situation to occur. rocksdb_cache_index_and_filter_blocks=false is >> safest, rocksdb_cache_shard_bits=0 could allow for some thrashing if >> multiple SSTs are involved, rocksdb_cache_shard_bits=1 is borderline, >> but neither should allow for the extreme pathological behavior where a >> single thread thrashes reads repeatedly at extreme speeds in any case >> (which is what I experienced). > > > I'd slightly disagree with the conclusion that > rocksdb_cache_index_and_filter_blocks=false is safest, especially when > running in extremely memory constrained scenarios. It's safest from the > angle that this data will now be forced into memory, but it's not safe > from the viewpoint that it will contribute to osd_memory_target overage > without oversight. It takes control out of the prioritycache's hands > and forces the memory to be used. That's why I'm leaning toward > reducing the number of cache shards and ensuring the minimum memory > allocation for the block cache is large enough that we can fit the > index/filter blocks into the cache. Right, what I meant is that it's safest in terms of never thrashing, not that it's safest in general. Sorry for being unclear. > Having said that, there is a > reasonable view as well that if you are *extremely* memory constrained, > loading the index/filter blocks from disk is the correct behavior even > if it's horribly slow. Eh, given how bad what I've seen is, I don't think that is ever correct. If you're on a system that is so horribly memory constrained that you legitimately can't load those blocks from disk (and honestly, you never should be, the problem here is just the sharding interaction, not that anyone literally doesn't have 16MB of RAM for a single filter block), you probably also have dinky small CPUs and low bandwidth memory that completely falls over in this case. The slightly smarter solution here would be to autotune the shard count based on some other metric. The only issue here is that (with the current code) this cannot be done at runtime, so it would have to be something like a computation based on osd_memory_target at boot time (which would make that option somewhat counter-intuitively a mostly-runtime option but with boot-time impact). Or someone could add runtime shard merging/splitting to the cache code... should be easier than PG merging at least! :-) - Hector > > > Mark > > >> >> - Hector >> >> >> On 2025/06/22 21:51, Hector Martin wrote: >>> Hi all, >>> >>> I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version >>> 19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU >>> usage and cluster performance plummets. Most of the CPU usage was >>> accounted as "system", while actual disk I/O usage was low, so that >>> didn't sound right. >>> >>> I perf traced the system, and found that most of the usage is in >>> __arch_copy_to_user in the kernel, in read() syscalls. That sounds like >>> the OSD is thrashing buffered block device reads which the kernel >>> satisfies from the page cache (hence no real I/O load, most of the CPU >>> usage is from data copies), so it's repeatedly reading the same disk >>> blocks, which doesn't sound right. >>> >>> I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph >>> config`, and CPU usage immediately dropped to near zero. However, >>> waiting a bit eventually the CPU thrashing returns after memory usage >>> increases. Restarting an OSD has a similar effect. I believe that >>> something is wrong with the OSD bluestore cache allocation/flush policy, >>> and when the cache becomes full it starts thrashing reads instead of >>> evicting colder cached data (or perhaps some cache bucket is starving >>> another cache bucket of space). >>> >>> I would appreciate some hints on how to debug this. Are there any cache >>> stats I should be looking at, or info on how the cache is partitioned? >>> >>> Here is a perf call trace of the thrashing: >>> >>>> - OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) >>>> - 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, >>>> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) >>>> - OSD::dequeue_op(boost::intrusive_ptr<PG>, >>>> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) >>>> - 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, >>>> ThreadPool::TPHandle&) >>>> - PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) >>>> - ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>) >>>> - 88.86% ECBackend::handle_sub_write(pg_shard_t, >>>> boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, >>>> ECListener&) >>>> - 88.78% non-virtual thunk to >>>> PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, >>>> std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>) >>>> - >>>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, >>>> std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >>>> >&, boost::intrusiv >>>> - 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*, >>>> ceph::os::Transaction*) >>>> - 88.71% BlueStore::_remove(BlueStore::TransContext*, >>>> boost::intrusive_ptr<BlueStore::Collection>&, >>>> boost::intrusive_ptr<BlueStore::Onode>&) >>>> - BlueStore::_do_remove(BlueStore::TransContext*, >>>> boost::intrusive_ptr<BlueStore::Collection>&, >>>> boost::intrusive_ptr<BlueStore::Onode>&) >>>> - 88.69% BlueStore::_do_truncate(BlueStore::TransContext*, >>>> boost::intrusive_ptr<BlueStore::Collection>&, >>>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std >>>> - 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*, >>>> boost::intrusive_ptr<BlueStore::Collection>&, >>>> boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ >>>> - 88.65% >>>> BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>) >>>> - 88.64% RocksDBStore::get(std::__cxx11::basic_string<char, >>>> std::char_traits<char>, std::allocator<char> > const&, >>>> std::__cxx11::basic_string<char, std::ch >>>> - 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, >>>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, >>>> rocksdb::PinnableSlice*) >>>> - rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice >>>> const&, rocksdb::DBImpl::GetImplOptions&) >>>> - 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&, >>>> rocksdb::LookupKey const&, rocksdb::PinnableSlice*, >>>> rocksdb::PinnableWideColumns*, std >>>> - 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&, >>>> rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, >>>> rocksdb::Sl >>>> - rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, >>>> rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform >>>> const*, bo >>>> - 88.32% >>>> rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*, >>>> rocksdb::Slice const&, bool, rocksdb::SliceTransfor >>>> - rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, >>>> rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks >>>> - 88.29% >>>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, >>>> rocksdb::GetContext*, rocksdb >>>> - >>>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable >>>> const*, rocksdb::Fi >>>> - rocksdb::Status >>>> rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, >>>> rocks >>>> - rocksdb::Status >>>> rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref >>>> - 88.25% rocksdb::BlockFetcher::ReadBlockContents() >>> [ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum] >>>> - 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, >>>> unsigned long, unsigned long, rocksdb::Sli >>> [ note: at this level the call trace splits into 4, but it all leads to >>> the same place ] >>>> - 35.99% 0xaaaac0992e40 >>>> BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, >>>> rocksdb::Slice*, char*) const >>>> BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, >>>> char*) >>>> - KernelDevice::read_random(unsigned long, unsigned long, char*, bool) >>>> - 35.98% __libc_pread >>>> - el0_svc >>>> - invoke_syscall >>>> - 35.97% __arm64_sys_pread64 >>>> - 35.96% vfs_read >>>> - blkdev_read_iter >>>> - 35.93% filemap_read >>>> - 35.28% copy_page_to_iter >>>> 35.01% __arch_copy_to_user >>> - Hector >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> - Hector >> - Hector _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io