[ceph-users] Re: advice needed for sizing of wal+db device for hdd

Adrian Sevcenco Thu, 18 Sep 2025 22:37:14 -0700

Thank you all for views and information! I will answer by top-posting
just to keep my answer consistent.


So, the hardware restrictions were given by the budget.
The task was something like: have an object storage system, _reliable_ 
(tolerant to 1 component failure),
100 TB usable space, with capabilities to also run VMs/containers, that
can be expanded to at least 500TB and more in the next 10-15 years,
(warmish/cold usage, used for downlod/upload at start/end of jobs),
without a predictable budget, for about 75k EUR
So, the answers of type "throw more money at the problem" are not usable in 
this particular case.

> This all but limits you to having only 3 mons.  It also means that when one 
node is down, that’s fully 33% of your

> IOPS. Are you planning to use a replicated buckets.data pool? RGW deployments often use EC to maximize usable space > at the expense of performance; here to use EC safely you’d want to use a brand-new MSR rule, with lessened space

> efficiency.

well. yes... the plan is to have the pool with size 3, min_size 2, so rgw 
should have the same attributes..?
(if i set the same placement pool/ use default placement? .. i'm not sure about 
this, i did not use rgw until now)

About the raid stuff: i was to fixated on individual reliability when in fact 
ceph does this globally.
So, thank you for advice! i will just just use individual nvmes for db devices 
for the hdds

I will go with 5 osds for one nvme (6.4 TB nvme -> 1280 GB/OSD which for 26TB hdd 
-> 4.92% db size)
and hope for the better .. if there are problems in commissioning i will try 
with 4 osds/nvme

Thank you!
Adrian



-------- Original Message --------
Subject: [ceph-users] Re: advice needed for sizing of wal+db device for hdd
From: Anthony D'Atri
To: ceph-users
Date: 9/16/2025, 5:24:05 PM

Hi! i'm trying to estimate the size requiremenent for a wal+db volume for 26
TB HDD


What is your workload?  Consider that this is bottlenecking 10x the data behind 
the same interface that was limtiing for 2-3 TB HDDs years ago.

while trying to minimize the administrative hassle


SSDs with no WAL+DB offload.

for configuration
(meaning that i would prefer a wal+db volume not separate db from wal)


I need to revisit the docs, almost nobody has cause to have a dedicated WAL 
device.

(RGW usage, 3 node cluster)


Danger, Will Robinson!

This all but limits you to having only 3 mons.  It also means that when one 
node is down, that’s fully 33% of your IOPS.  Are you planning to use a 
replicated buckets.data pool?  RGW deployments often use EC to maximize usable 
space at the expense of performance; here to use EC safely you’d want to use a 
brand-new MSR rule, with lessened space efficiency.

I have a similar setup - servers with 2xNVMe and some HDDs.
I would recommend aiming for more smaller servers instead of three large ones,
though.


Absolutely.  Ultra-dense nodes are prone to bottlenecks:
* Backplane / expander throughput
* HBA throughput / congestion
* NIC saturation

When you lose one or bring it back, you have a thundering herd of 
recovery/backfill that will impact your clients.  And take weeks to complete, 
during which you have increased risk of data being unavailable or lost.

so, for a jbod that can go up to 44 drives (not all drives present, at most
up to 12)


Why have a chassis like that and only populate it 27% full?  Is this 
hand-me-down hardware?

Also note that with only three nodes you’ll want each to have the same 
aggregate OSD capacity (CRUSH weight).

I recommend at least five nodes so you can safely have five mons.  And with EC 
there are advantages to having at least k + m + 1 nodes.  4+2 is a reasonable 
EC profile if one is new to the tradeoffs of EC, which would mean at least 
seven nodes.

i have 2x 6.4TB NVME AFAIU for metadata (db) device, i need raid


Ceph does that for you.

as losing metadata, means losing the associated OSD .. did i get this right?


Yes. But I think the cluster setup _should_ be planned for losing an OSD
or even an entire server. So I would not bother using RAID-1 for metadata.


Agreed. RAID on top of RAID is rarely a great strategy.

So, what would the the best practice to map db+wal to hdd OSD?
should i do a mdraid from the 2 nvme and that to split in 12 partitions?


If you’re dead-set on using this gear, map six OSDs to each unmirrored NVMe 
SSD.  You will burn their endurance at half the rate that way, and at such a 
time that one fails, it won’t take out the entire node.


What I did is to put the system to both NVMes on RAID-1 partitions:

/boot/efi - 128 MB or something like that)
/ - I used 200 GB, but it is probably overkill


Congrats on eschewing the antiquated strategy of partitioning the boot volume 
to death.

swap - I used 32 GB


If you need any swap, what you really need is more physmem.  Don’t provision 
any swap at all.  This isn’t 1985.


Then I created a partition covering the rest of the free space on each NVMe
and used both of them as physical volumes for a single LVM volume group:


Sharing the boot volume with data is not an ideal strategy.  I have a customer 
who got themselves into an outage doing that.  You have a zillion SAS/SATA 
slots empty, put a pair of SSDs into each system for boot/OS, mirror them with 
MD, and don’t use them for data.


vgcreate nvme_vg /dev/nvme0n1p4 /dev/nvme1n1p4

Now I can create LVM-based mirrored logical volumes for local applications,
should I ever need them


Nononono.  See above.

  and non-mirrored LVs for Ceph metadata.
Something like this:

for i in `seq -w 01 06`; do lvcreate -n ceph_$i -L 100G nvme_vg /dev/nvme0n1p4; 
done
for i in `seq -w 11 16`; do lvcreate -n ceph_$i -L 100G nvme_vg /dev/nvme1n1p4; 
done
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_01 --data 
/dev/sda
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_11 --data 
/dev/sdb
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_02 --data 
/dev/sdc
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_12 --data 
/dev/sdd
...
ceph-volume lvm activate --all

The alternative would be to have a mirrored /boot only, and put everything
else on a NVMe-based LVM VG, using mirrored LVs for root and swap. This
way root FS would be easily resizable.


Put everything else on different media, use the entire boot volume for /.


I don't have experience with this (I am not even sure whether Anaconda can
install AlmaLinux onto mirrored LVs),


I don’t know about Alma but I’ve done this lots with EL.

so I went for more traditional md-raid
instead of LVM mirror for root, swap, and /boot/efi.


My sense is that LVM mirroring is mostly for temporary use while migrating 
devices, though it may actually use MD under the hood.  I tend to create an MD 
metadevice and create LVs on top of that.

should i split the nvme ssd in 12 namespaces and make individual mdraid for
each and that map to osds?


I don't think it would make a measurable difference to use NVMe namespaces.


I’ve yet to see a reason to use namespaces over traditional partitions.  There 
may be, but I’ve yet to discover it.

For a 6.4 TB NVMe SSD


You don’t need high-endurance mixed-use SSDs.  1DWPD read-intensive are fine.  
They’re the same hardware, with less overprovisioning, and less markup.  You 
can change one into the other with software, the mfgs do this all the time at 
the factory depending on what they need to ship.

(divided to 12) and 26 TB OSD, the db/wal would be ~533
GB, so around 2.05% .. how terrible is this number for RGW usage
(i get that the recommended is 4%)


That 4% figure is fairly arbitrary, but for RGW usage you usually want more 
than for, say, RBD.  With RocksDB compression enabled in the latest releases, 
some advocate as little as 2.5%.  This is another reason to not mirror your 
offload devices:  being able to have larger partitions, which will help with 
the higher RocksDB levels and with compaction, avoiding spillover.

At what size the data is in danger if db is too small and when it become
safe but only with performance degradation?


I think block.db can safely overflow to the main data area (with a performance
degradation, of course).


Yes, with recent releases there are no magic thresholds.  Back before column 
family sharding, there were discrete amounts of space that *could* be used with 
the rest ignored.  Like with a 55 GB partition, only ~33 GB would actually be 
used.


    We all agree on the necessity of compromise. We just can't agree on
    when it's necessary to compromise.                     --Larry Wall


We demand rigidly defined areas of doubt and uncertainty.


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: advice needed for sizing of wal+db device for hdd

Reply via email to