Hi Nick
On 10/19/2018 10:14 AM, Nick Fisk wrote:
-----Original Message-----
From: Igor Fedotov [mailto:[email protected]]
Sent: 19 October 2018 01:03
To: [email protected]; [email protected]
Subject: Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of
space free in BlockDB on SSD?
On 10/18/2018 7:49 PM, Nick Fisk wrote:
Hi,
Ceph Version = 12.2.8
8TB spinner with 20G SSD partition
Perf dump shows the following:
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 21472731136,
"db_used_bytes": 3467640832,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 320063143936,
"slow_used_bytes": 4546625536,
"num_files": 124,
"log_bytes": 11833344,
"log_compactions": 4,
"logged_bytes": 316227584,
"files_written_wal": 2,
"files_written_sst": 4375,
"bytes_written_wal": 204427489105,
"bytes_written_sst": 248223463173
Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 4.5GB
of DB is stored on the spinning disk?
Correct. Most probably the rationale for this is the layered scheme RocksDB
uses to keep its sst. For each level It has a maximum
threshold (determined by level no, some base value and corresponding multiplier -
see max_bytes_for_level_base &
max_bytes_for_level_multiplier at
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
If the next level (at its max size) doesn't fit into the space available at DB
volume - it's totally spilled over to slow device.
IIRC level_base is about 250MB and multiplier is 10 so the third level needs
25Gb and hence doesn't fit into your DB volume.
In fact DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the slow
one. AFAIR current recommendation is about 4%.
Thanks Igor, these nodes were designed back in the filestore days where Small
10DWPD SSD's were all the rage, I might be able to shrink the OS/swap partition
and get each DB partition up to 25/26GB, they are not going to get any bigger
than that as that’s the NVME completely filled. But I'm then going have to
effectively wipe all the disks I've done so far and re-backfill. ☹ Are there
any tunables to change this behaviour post OSD deployment to move data back
onto SSD?
None I'm aware of.
However I've just completed development for offline BlueFS volume
migration feature within ceph-bluestore-tool. It allows DB/WAL volumes
allocation and resizing as well as moving BlueFS data between volumes
(with some limitations unrelated to your case). Hence one doesn't need
slow backfilling to adjust BlueFS volume configuration.
Here is the PR (Nautilus only for now):
https://github.com/ceph/ceph/pull/23103
On a related note, does frequently accessed data move into the SSD, or is the
overspill a one way ticket? I would assume writes would cause data in rocksdb
to be written back into L0 and work its way down, but I'm not sure about reads?
AFAIK reads don't trigger any data layout changes.
This is from a similar slightly newer node with 10TB spinners and 40G partition
"bluefs": {
"gift_bytes": 0,
"reclaim_bytes": 0,
"db_total_bytes": 53684985856,
"db_used_bytes": 10380902400,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 400033841152,
"slow_used_bytes": 0,
"num_files": 165,
"log_bytes": 15683584,
"log_compactions": 8,
"logged_bytes": 384712704,
"files_written_wal": 2,
"files_written_sst": 11317,
"bytes_written_wal": 564218701044,
"bytes_written_sst": 618268958848
So I see your point about the 25G file size making it over spill the partition,
as it obvious in this case that the 10G of DB used is completely stored on the
SSD. Theses OSD's are about 70% full, so I'm not expecting a massive increase
in usage. Albeit if I move to EC pools, I should expect maybe a doubling in
objects, so maybe that db_used might double, but it should still be within the
40G hopefully.
The 4% rule would not be workable in my case, there are 12X10TB disks in these
nodes, I would nearly 5TB worth of SSD, which would likely cost a similar
amount to the whole node+disks. I get the fact that any recommendations need to
take the worse case into account, but I would imagine for a lot of simple RBD
only use cases, this number is quite inflated.
So I think the lesson from this is that despite whatever DB usage you may think
you may end up with, always make sure your SSD partition is bigger than 26GB
(L0+L1)?
In fact that's
L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.
One more observation from my side - RocksDB might additionally use up to
100% of the level maximum size during compaction - hence it might make
sense to have up to 25GB of additional spare space. Surely this spare
space wouldn't be fully used most of the time. And actually I don't have
any instructions or clear knowledge base for this aspect. Just some
warning.
To track such an excess I used additional perf counters, see commit
2763c4de41ea55a97ed7400f54a2b2d841894bf5 in
https://github.com/ceph/ceph/pull/23208
Perhaps makes sense to have a separare PR for this stuff and even
backport it...
Am I also understanding correctly that BlueFS has reserved 300G of space on the
spinning disk?
Right.
Found a previous bug tracker for something which looks exactly the same case,
but should be fixed now:
https://tracker.ceph.com/issues/22264
Thanks,
Nick
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com