For those who have been following along, I figured it out. I left all
the details with Mark on Slack, but TL;DR: The fix is *either one* (or
both works too) of these:

ceph config set osd rocksdb_cache_index_and_filter_blocks false
  (Ceph default: true, RocksDB default: false)
ceph config set osd rocksdb_cache_shard_bits 0
  (or 1, Ceph default: 4, RocksDB default: variable but probably 6)

(Restart all OSDs after changing these settings for them to take effect,
I don't think they apply live)

The side effect of the first option is that it might increase
unaccounted heap OSD memory usage (not managed by the cache autosize
code), as filter blocks are preloaded and cached outside the block
cache. The side effect of the second option is that it reduces the
parallelism/core scalability of the block cache. I suspect the impact of
either will be generally small for most typical deployments, and the
benefit of not having horrible snaptrim thrashing far outweighs it for
those affected.

Both options were changed from their RocksDB defaults in Ceph commits
long ago without any explanation / relevant commit messages, so I
suspect this is a case of "some people flipped some knobs that seemed
harmless and nobody really tested their impact". Other things have
changed since then too, so I can't say when the impact likely began, as
there are a lot of other factors involved.

The root cause is that the sharding defaults to 16 shards, and that
filter blocks are cached in the block cache instead of separately (not
the RocksDB default). Filter blocks can be up to 17MB or thereabouts,
which means with 16 shards, the minimum viable KV block cache size is
~272MB to ensure non-pathological behavior (if a filter block does not
fit into its cache shard, it is not cached at all, causing the problem).
Since snaptrim uses a lot of memory, it squeezes out the OSD caches and
as soon as the RocksDB block cache size drops below that size or so
(depending on how large filter blocks you ended up with on a particular
OSD), you get pathological thrashing with potentially several gigabytes
per second of kernel->user memory copies repeatedly rereading filter
blocks from SSTs.

Fixing this *might* mean that bluefs_buffered_io can be flipped to false
too.

As for who is likely affected, it depends on snaptrim memory usage vs.
osd_memory_target. If you can guarantee 1GB or so (rough) for managed
caches, then it might never affect you, though it still depends on what
PriCache wants to do and it might still squeeze the KV cache under some
other set of conditions. With the default of 4G for osd_memory_target,
my guess is things are just about on the edge of being safe enough most
of the time, which is probably why most people don't see things go
horribly wrong. If you increase osd_memory_target above the default,
you're probably safe. If you decrease it, you're in danger.

Either of the above settings makes it impossible for this pathological
situation to occur. rocksdb_cache_index_and_filter_blocks=false is
safest, rocksdb_cache_shard_bits=0 could allow for some thrashing if
multiple SSTs are involved, rocksdb_cache_shard_bits=1 is borderline,
but neither should allow for the extreme pathological behavior where a
single thread thrashes reads repeatedly at extreme speeds in any case
(which is what I experienced).

- Hector


On 2025/06/22 21:51, Hector Martin wrote:
> Hi all,
> 
> I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
> 19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
> usage and cluster performance plummets. Most of the CPU usage was
> accounted as "system", while actual disk I/O usage was low, so that
> didn't sound right.
> 
> I perf traced the system, and found that most of the usage is in
> __arch_copy_to_user in the kernel, in read() syscalls. That sounds like
> the OSD is thrashing buffered block device reads which the kernel
> satisfies from the page cache (hence no real I/O load, most of the CPU
> usage is from data copies), so it's repeatedly reading the same disk
> blocks, which doesn't sound right.
> 
> I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
> config`, and CPU usage immediately dropped to near zero. However,
> waiting a bit eventually the CPU thrashing returns after memory usage
> increases. Restarting an OSD has a similar effect. I believe that
> something is wrong with the OSD bluestore cache allocation/flush policy,
> and when the cache becomes full it starts thrashing reads instead of
> evicting colder cached data (or perhaps some cache bucket is starving
> another cache bucket of space).
> 
> I would appreciate some hints on how to debug this. Are there any cache
> stats I should be looking at, or info on how the cache is partitioned?
> 
> Here is a perf call trace of the thrashing:
> 
>> - OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
>> - 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, 
>> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
>> - OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, 
>> ThreadPool::TPHandle&)
>> - 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
>> ThreadPool::TPHandle&)
>> - PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
>> - ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
>> - 88.86% ECBackend::handle_sub_write(pg_shard_t, 
>> boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, 
>> ECListener&)
>> - 88.78% non-virtual thunk to 
>> PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, 
>> std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)
>> - 
>> BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
>>  std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> 
>> >&, boost::intrusiv
>> - 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
>> ceph::os::Transaction*)
>> - 88.71% BlueStore::_remove(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>&)
>> - BlueStore::_do_remove(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>&)
>> - 88.69% BlueStore::_do_truncate(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std
>> - 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*, 
>> boost::intrusive_ptr<BlueStore::Collection>&, 
>> boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ
>> - 88.65% 
>> BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)
>> - 88.64% RocksDBStore::get(std::__cxx11::basic_string<char, 
>> std::char_traits<char>, std::allocator<char> > const&, 
>> std::__cxx11::basic_string<char, std::ch
>> - 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
>> rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)
>> - rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice 
>> const&, rocksdb::DBImpl::GetImplOptions&)
>> - 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&, 
>> rocksdb::LookupKey const&, rocksdb::PinnableSlice*, 
>> rocksdb::PinnableWideColumns*, std
>> - 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
>> rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, 
>> rocksdb::Sl
>> - rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice 
>> const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bo
>> - 88.32% 
>> rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*, 
>> rocksdb::Slice const&, bool, rocksdb::SliceTransfor
>> - rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, 
>> rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks
>> - 88.29% 
>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,
>>  rocksdb::GetContext*, rocksdb
>> - 
>> rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable
>>  const*, rocksdb::Fi
>> - rocksdb::Status 
>> rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
>>  rocks
>> - rocksdb::Status 
>> rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
>> - 88.25% rocksdb::BlockFetcher::ReadBlockContents()
> [ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]
>> - 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, 
>> unsigned long, unsigned long, rocksdb::Sli
> [ note: at this level the call trace splits into 4, but it all leads to
> the same place ]
>> - 35.99% 0xaaaac0992e40
>> BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, 
>> rocksdb::Slice*, char*) const
>> BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, 
>> char*)
>> - KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
>> - 35.98% __libc_pread
>> - el0_svc
>> - invoke_syscall
>> - 35.97% __arm64_sys_pread64
>> - 35.96% vfs_read
>> - blkdev_read_iter
>> - 35.93% filemap_read
>> - 35.28% copy_page_to_iter
>> 35.01% __arch_copy_to_user
> 
> - Hector
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

- Hector
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to