Thanks for the input everyone and sorry for the long delay, I wanted to post my conclusions.
Regarding the old pool, yes I don't think there is any way to retroactively fix it, and the only method to set it up properly is to copy over the data. And I'm very sure the past setup was a mistake, it definitely should have used layouts for the EC pool from the start. We always wanted the EC hdd pool for bulk, but it shouldn't have been the root pool. I did run quite some bigger benchmarking, full blast mdtest from 15 nodes, 16 tasks per node #SBATCH -N 15 #SBATCH --ntasks-per-node=16 NFILES=2000 NITER=10 FILESIZE=4k srun mdtest \ -d $TESTPATH \ -n $NFILES \ -F \ -w $FILESIZE \ -e $FILESIZE \ -i $NITER \ -u \ -P \ -N 1 So, the main question, if the old EC-HDD default pool noticeably impact the performance of a new NVME-pool New with 3xFlash for root volume (EC flash layout for test dir): File creation: 8873 ± 356 File read: 9720 ± 208 File removal: 5024 ± 348 Old with EC HDD for root pool (EC flash layout for test dir): File creation: 8498 ± 312 File read: 10305 ± 554 File removal: 5460 ± 553 240 tasks, 480000 files, 10 iterations. I can't detect any meaningful differences, all within the variance of the tests. Maybe the small NVME DB/wal partitions used for the HDD osds eliminates the differences here or something, but my conclusion is that this was in fact not important. The filesystem won't see a load anywhere close to this. We decided not to deal with the hassle of moving to a new cephfs, so I'm going to continue using the old setup. On Fri, Sep 12, 2025 at 9:28 PM Frédéric Nass <[email protected]> wrote: > Hi Mikael, > > This might be a long shot, but I have to ask: have you checked the average > file size on the current CephFS filesystem? Apart from extreme cost > efficiency or a bad design, the EC choice on HDD could have had a > legitimate reason in the past. It's probably not enough to make it the > default data pool, but it might help explain the current design. > I don't know the exact amount of data or how many users this filesystem > has, but the safest long-term approach in your situation is probably to > create a new filesystem and then migrate the data over, as you imagined. > > Best regards, > Frédéric. > > -- > Frédéric Nass > Ceph Ambassador France | Senior Ceph Engineer @ CLYSO > Try our Ceph Analyzer -- https://analyzer.clyso.com/ > https://clyso.com | [email protected] > > > Le mer. 10 sept. 2025 à 16:09, Mikael Öhman <[email protected]> a > écrit : > >> The recommendations for cephfs is to make a replicated default data pool, >> and adding any EC data pools using layouts: >> https://docs.ceph.com/en/latest/cephfs/createfs/ >> >> > If erasure-coded pools are planned for file system data, it is best to >> configure the default as a replicated pool to improve small-object write >> and read performance when updating backtraces. >> >> I have an cephfs that unfortunately wasn't set up like this: they just >> made >> an EC pool on the slow HDDs as the default, which sounds like the worst >> case scenario to me. I would like to add a NVME data pool to this ceph fs, >> but recommended gives me pause on if i should instead go through the >> hassle >> of creating a new cephfs and migrating all users. >> >> I've tried to run some mdtest with small 1k files to see if i could >> measure >> this difference, but speed is about the same in my relatively small tests >> so far. I'm also not sure what impact I should realistically expect here. >> I >> don't even know if creating files counts as "updating backtraces", so my >> testing might just be pointless. >> >> I guess my core question is; just how important is this suggestion to keep >> the default data pool on replicated NVME? >> >> Setup: >> 14 hosts x 42 HDD + 3 NVMEs for db/wal 2*2x25 GbitE bonds >> 12 hosts x 10 NVME. 2*2x100 GbitE bonds >> >> Old CephFS setup: >> - metadata: replicated NVME >> - data-pools: EC 10+2 on HDD (i plan to add a EC NVME pool here via >> layouts) >> >> New CephFS setup as recommended: >> - metadata: replicated NVME >> - data-pools: replicated NVME (default), EC 8+2 on HDD via layout, EC 8+2 >> on NVME via layout. >> >> Ceph 18.2.7 >> >> >> Best regards, Mikael >> _______________________________________________ >> ceph-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
