[ceph-users] Re: Recomand number of k and m erasure code

2024-01-15 Thread Phong Tran Thanh
Thanks Anthony for your knowledge.

I am very happy

Vào Th 7, 13 thg 1, 2024 vào lúc 23:36 Anthony D'Atri <
anthony.da...@gmail.com> đã viết:

> There are nuances, but in general the higher the sum of m+k, the lower the
> performance, because *every* operation has to hit that many drives, which
> is especially impactful with HDDs.  So there’s a tradeoff between storage
> efficiency and performance.  And as you’ve seen, larger parity groups
> especially mean slower recovery/backfill.
>
> There’s also a modest benefit to choosing values of m and k that have
> small prime factors, but I wouldn’t worry too much about that.
>
>
> You can find EC efficiency tables on the net:
>
>
>
> https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html
>
>
> I should really add a table to the docs, making a note to do that.
>
> There’s a nice calculator at the OSNEXUS site:
>
> Ceph Designer 
> osnexus.com 
> [image: favicon.ico] 
> 
>
>
> The overhead factor is (k+m) / k
>
> So for a 4,2 profile, that’s 6 / 4 == 1.5
>
> For 6,2, 8 / 6 = 1.33
>
> For 10,2, 12 / 10 = 1.2
>
> and so forth.  As k increases, the incremental efficiency gain sees
> diminishing returns, but performance continues to decrease.
>
> Think of m as the number of copies you can lose without losing data, and
> m-1 as the number you can lose / have down and still have data *available*.
>
> I also suggest that the number of failure domains — in your cases this
> means OSD nodes — be *at least* k+m+1, so in your case you want k+m to be
> at most 9.
>
> With RBD and many CephFS implementations, we mostly have relatively large
> RADOS objects that are striped over many OSDs.
>
> When using RGW especially, one should attend to average and median S3
> object size.  There’s an analysis of the potential for space amplification
> in the docs so I won’t repeat it here in detail. This sheet
> https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
>  visually
> demonstrates this.
>
> Basically, for an RGW bucket pool — or for a CephFS data pool storing
> unusually small objects — if you have a lot of S3 objects in the multiples
> of KB size, you waste a significant fraction of underlying storage.  This
> is exacerbated by EC, and the larger the sum of k+m, the more waste.
>
> When people ask me about replication vs EC and EC profile, the first
> question I ask is what they’re storing.  When EC isn’t a non-starter, I
> tend to recommend 4,2 as a profile until / unless someone has specific
> needs and can understand the tradeoffs. This lets you store ~~ 2x the data
> of 3x replication while not going overboard on the performance hit.
>
> If you care about your data, do not set m=1.
>
> If you need to survive the loss of many drives, say if your cluster is
> across multiple buildings or sites, choose a larger value of k.  There are
> people running profiles like 4,6 because they have unusual and specific
> needs.
>
>
>
>
> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh 
> wrote:
>
> Hi ceph user!
>
> I need to determine which erasure code values (k and m) to choose for a
> Ceph cluster with 10 nodes.
>
> I am using the reef version with rbd. Furthermore, when using a larger k,
> for example, ec6+2 and ec4+2, which erasure coding performance is better,
> and what are the criteria for choosing the appropriate erasure coding?
> Please help me
>
> Email: tranphong...@gmail.com
> Skype: tranphong079
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>

-- 
Trân trọng,


*Tran Thanh Phong*

Email: tranphong...@gmail.com
Skype: tranphong079


favicon.ico
Description: Binary data
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recomand number of k and m erasure code

2024-01-15 Thread Phong Tran Thanh
Dear Frank,

"For production systems I would recommend to use EC profiles with at least
m=3" -> can i set min_size with min_size=4 for ec4+2 it's ok for
productions? My data is video from the camera system, it's hot data, write
and delete in some day, 10-15 day ex... Read and write availability is more
important than data loss

Thanks Frank

Vào Th 2, 15 thg 1, 2024 vào lúc 16:46 Frank Schilder  đã
viết:

> I would like to add here a detail that is often overlooked:
> maintainability under degraded conditions.
>
> For production systems I would recommend to use EC profiles with at least
> m=3. The reason being that if you have a longer problem with a node that is
> down and m=2 it is not possible to do any maintenance on the system without
> loosing write access. Don't trust what users claim they are willing to
> tolerate - at least get it in writing. Once a problem occurs they will be
> at your door step no matter what they said before.
>
> Similarly, when doing a longer maintenance task and m=2, any disk fail
> during maintenance will imply loosing write access.
>
> Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being
> unavailable simultaneously while service is fully operational. That can be
> a life saver in many situations.
>
> An additional reason for larger m is systematic failures of drives if your
> vendor doesn't mix drives from different batches and factories. If a batch
> has a systematic production error, failures are no longer statistically
> independent. In such a situation, if one drives fails the likelihood that
> more drives fail at the same time is very high. Having a larger number of
> parity shards increases the chances of recovering from such events.
>
> For similar reasons I would recommend to deploy 5 MONs instead of 3. My
> life got so much better after having the extra redundancy.
>
> As some background, in our situation we experience(d) somewhat heavy
> maintenance operations including modifying/updating ceph nodes (hardware,
> not software), exchanging Racks, switches, cooling and power etc. This
> required longer downtime and/or moving of servers and moving the ceph
> hardware was the easiest compared with other systems due to the extra
> redundancy bits in it. We had no service outages during such operations.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Anthony D'Atri 
> Sent: Saturday, January 13, 2024 5:36 PM
> To: Phong Tran Thanh
> Cc: ceph-users@ceph.io
> Subject: [ceph-users] Re: Recomand number of k and m erasure code
>
> There are nuances, but in general the higher the sum of m+k, the lower the
> performance, because *every* operation has to hit that many drives, which
> is especially impactful with HDDs.  So there’s a tradeoff between storage
> efficiency and performance.  And as you’ve seen, larger parity groups
> especially mean slower recovery/backfill.
>
> There’s also a modest benefit to choosing values of m and k that have
> small prime factors, but I wouldn’t worry too much about that.
>
>
> You can find EC efficiency tables on the net:
>
>
>
> https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html
>
>
> I should really add a table to the docs, making a note to do that.
>
> There’s a nice calculator at the OSNEXUS site:
>
> https://www.osnexus.com/ceph-designer
>
>
> The overhead factor is (k+m) / k
>
> So for a 4,2 profile, that’s 6 / 4 == 1.5
>
> For 6,2, 8 / 6 = 1.33
>
> For 10,2, 12 / 10 = 1.2
>
> and so forth.  As k increases, the incremental efficiency gain sees
> diminishing returns, but performance continues to decrease.
>
> Think of m as the number of copies you can lose without losing data, and
> m-1 as the number you can lose / have down and still have data *available*.
>
> I also suggest that the number of failure domains — in your cases this
> means OSD nodes — be *at least* k+m+1, so in your case you want k+m to be
> at most 9.
>
> With RBD and many CephFS implementations, we mostly have relatively large
> RADOS objects that are striped over many OSDs.
>
> When using RGW especially, one should attend to average and median S3
> object size.  There’s an analysis of the potential for space amplification
> in the docs so I won’t repeat it here in detail. This sheet
> https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
> visually demonstrates this.
>
> Basically, for an RGW bucket pool — or for a CephFS data pool storing
> unusually small objects — if you have a lot of S3 objects in the multiples
> of KB size, you waste a significant fraction of underlying storage.  This
> is exacerbated by EC, and the larger the sum of k+m, the more waste.
>
> When people ask me about replication vs EC and EC profile, the first
> question I ask is what they’re storing.  When EC isn’t a non-starter, I
> tend to recommend 4,2 as a profile until / unless someone 

[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Anthony D'Atri
by “RBD for cloud”, do you mean VM / container general-purposes volumes on 
which a filesystem is usually built?  Or large archive / backup volumes that 
are read and written sequentially without much concern for latency or 
throughput?

How many of those ultra-dense chassis in a cluster?  Are all 60 off a single 
HBA?

I’ve experienced RGW clusters built from 4x 90 slot ultra-dense chassis, each 
of which had 2x server trays, so effectively 2x 45 slot chassis bound together. 
 The bucket pool was EC 3,2 or 4,2.  The motherboard was …. odd, as a certain 
chassis vendor had a thing for at a certain point in time.  With only 12 DIMM 
slots each, they were chronically short on RAM and the single HBA was a 
bottleneck.  Performance was acceptable for the use-case …. at first.  As the 
cluster filled up and got busier, that was no longer the case.  And these were 
8TB capped drives.  Not all slots were filled, at least initially.

The index pool was on separate 1U servers with SATA SSDs.

There were hotspots, usually relatively small objects that clients hammered on. 
 A single OSD restarting and recovering would tank the API; we found it better 
to destroy and redeploy it.   Expanding faster than data was coming in was a 
challenge, as we had to throttle the heck out of the backfill to avoid rampant 
slow requests and API impact.

QLC with a larger number of OSD node failure domains was a net win in that RAS 
was dramatically increased, and expensive engineer-hours weren’t soaked up 
fighting performance and availability issues.  

ymmv, especially if one’s organization has unreasonably restrictive purchasing 
policies, row after row of empty DC racks, etc.  I’ve suffered LFF spinners — 
just 3 / 4 TB — misused for  OpenStack Cinder and Glance.  Filestore with 
(wince) colocated journals * with 3R pools — EC for RBD was not yet a thing, 
else we would have been forced to make it even worse.  The stated goal of the 
person who specked the hardware was for every instance to have the performance 
of its own 5400 RPM HDD.  Three fallacies there:  1) that anyone would consider 
that acceptable 2) that it would be sustainable during heavy usage or 
backfill/recovery and especially 3) that 450 / 3 = 2000.  It was just 
miserable.  I suspect that your use-case is different.  If spinners work for 
your purposes and you don’t need IOPs or the ability to provision SSDs down the 
road, more power to you.




* Which tickled a certain HDD mfg’s design flaws in a manner that substantially 
risked data availability and durability, in turn directly costing the 
organization substantial user dissatisfaction and hundreds of thousands of 
dollars.

> 
> These kinds of statements make me at least ask questions. Dozens of 14TB HDDs 
> have worked reasonably well for us for four years of RBD for cloud, and 
> hundreds of 16TB HDDs have satisfied our requirements for two years of RGW 
> operations, such that we are deploying 22TB HDDs in the next batch. It 
> remains to be seen how well 60 disk SAS-attached JBOD chassis work, but we 
> believe we have an effective use case.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Gregory Orange

On 12/1/24 22:32, Drew Weaver wrote:

So we were going to replace a Ceph cluster with some hardware we had
laying around using SATA HBAs but I was told that the only right way to
build Ceph in 2023 is with direct attach NVMe.


These kinds of statements make me at least ask questions. Dozens of 14TB 
HDDs have worked reasonably well for us for four years of RBD for cloud, 
and hundreds of 16TB HDDs have satisfied our requirements for two years 
of RGW operations, such that we are deploying 22TB HDDs in the next 
batch. It remains to be seen how well 60 disk SAS-attached JBOD chassis 
work, but we believe we have an effective use case.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Anthony D'Atri


> 
> Now that you say it's just backups/archival, QLC might be excessive for
> you (or a great fit if the backups are churned often).

PLC isn’t out yet, though, and probably won’t have a conventional block 
interface.

> USD70/TB is the best public large-NVME pricing I'm aware of presently; for QLC
> 30TB drives. Smaller capacity drives do get down to USD50/TB.
> 2.5" SATA spinning disk is USD20-30/TB.

2.5” spinners top out at 5TB last I checked, and a certain chassis vendor only 
resells half that capacity.

But as I’ve written, *drive* unit economics are myopic.  We don’t run 
palletloads of drives, we run *servers* with drive bays, admin overhead, switch 
ports, etc., that take up RUs, eat amps, and fart out watts.

> PCIe bandwidth: this goes for NVME as well as SATA/SAS.
> I won't name the vendor, but I saw a weird NVME server with 50+ drive
> slots.  Each drive slot was x4 lane width but had a number of PCIe
> expanders in the path from the motherboard, so it you were trying to max
> it out, simultaneously using all the drives, each drive only only got
> ~1.7x usable PCIe4.0 lanes.

I’ve seen a 2U server with … 102 IIRC E1.L bays, but it was only Gen3.

> Compare that to the Supermicro servers I suggested: The AMD variants use
> a H13SSF motherboard, which provides 64x PCIe5.0 lanes, split into 32x
> E3.S drive slots, and each drive slot has 4x PCIe 4.0, no
> over-subscription.

Having the lanes and filling them are two different things though.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph pg mark_unfound_lost delete results in confused ceph

2024-01-15 Thread Oliver Dzombic

Hi,

after osd.15 died in the wrong moment there is:

#ceph health detail

[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg stale
pg 10.17 is stuck stale for 3d, current state 
stale+active+undersized+degraded, last acting [15]
[WRN] PG_DEGRADED: Degraded data redundancy: 172/57063399 objects 
degraded (0.000%), 1 pg degraded, 1 pg undersized
pg 10.17 is stuck undersized for 3d, current state 
stale+active+undersized+degraded, last acting [15]


which will never resolv as there is no osd.15 anymore.

So a

ceph pg 10.17 mark_unfound_lost delete

was executed.


ceph seems to be a bit confused about pg 10.17 now:

While this worked before, its not working anymore
# ceph pg 10.17 query
Error ENOENT: i don't have pgid 10.17


And while this was pointing to 15 the map changed now to 5 and 6 ( which 
is correct ):

# ceph pg map 10.17
osdmap e14425 pg 10.17 (10.17) -> up [5,6] acting [5,6]



According to ceph health, ceph assumes that osd.15 is still somehow in 
charge.


The pg map seems to think that 10.17 is on osd.5 and osd.6

But pg 10.17 seems not to be really existing, as a query will fail.

Any idea whats going wrong and howto fix this?

Thank you!

--
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
Layer7 Networks

mailto:i...@layer7.net

Anschrift:

Layer7 Networks GmbH
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 96293 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic
UST ID: DE259845632


OpenPGP_0x627BE440332A7AD0.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Peter Grandi
>> So we were going to replace a Ceph cluster with some hardware we had
>> laying around using SATA HBAs but I was told that the only right way
>> to build Ceph in 2023 is with direct attach NVMe.

My impression are somewhat different:

* Nowadays it is rather more difficult to find 2.5in SAS or SATA
  "Enterprise" SSDs than most NVMe types. NVMe as a host bus
  also has much greater bandwidth than SAS or SATA, but Ceph is
  mostly about IOPS rather than single-device bandwidth. So in
  general willing or less willing one has got to move to NVMe.

* Ceph was designed (and most people have forgotten it) for many
  small capacity 1-OSD cheap servers, and lots of them, but
  unfortunately it is not easy to find small cheap "enterprise"
  SSD servers. In part because many people use rather unwisely
  as figure-of-merit the capacity per server-price most NVMe
  servers have many slots, which means either RAID-ing devices
  into a small number of large OSDs, which goes against all Ceph
  stands for, or running many OSD daemons on one system, which
  work-ish but is not best.

>> Does anyone have any recommendation for a 1U barebones server
>> (we just drop in ram disks and cpus) with 8-10 2.5" NVMe bays
>> that are direct attached to the motherboard without a bridge
>> or HBA for Ceph specifically?

> If you're buying new, Supermicro would be my first choice for
> vendor based on experience.
> https://www.supermicro.com/en/products/nvme

Indeed, SuperMicro does them fairly well, and there are also
GigaByte, and Tyan I think, not yet seen Intel-based models.

> You said 2.5" bays, which makes me think you have existing
> drives. There are models to fit that, but if you're also
> considering new drives, you can get further density in E1/E3

BTW "NVMe" is a bus specification (something not too different
from SCSI-over-PCIe), and there are several different physical
specifications, like 2.5in U.2 (SFF-8639), 2.5in U.3
(SFF-TA-1001), and various types of EDSFF (SFF-TA-1006,7,8). U.3
is still difficult to find but its connector supports SATA, SAS
and NVMe U.2; I have not yet seen EDSFF boxes actually available
retail without enormous delivery times, I guess the big internet
companies buy all the available production.

https://nvmexpress.org/wp-content/uploads/Session-4-NVMe-Form-Factors-Developer-Day-SSD-Form-Factors-v8.pdf
https://media.kingston.com/kingston/content/ktc-content-nvme-general-ssd-form-factors-graph-en-3.jpg
https://media.kingston.com/kingston/pdf/ktc-article-understanding-ssd-technology-en.pdf
https://www.snia.org/sites/default/files/SSSI/OCP%20EDSFF%20JM%20Hands.pdf

> The only caveat is that you will absolutely want to put a
> better NIC in these systems, because 2x10G is easy to saturate
> with a pile of NVME.

That's one reason why Ceph was designed for many small 1-OSD
servers (ideally distributed across several racks) :-). Note: to
maximize changes of many-to-many traffic instead of many-to-one.
Anyhow Ceph again is all about lots of IOPS more than
bandwidth, but if you need bandwidth nowadays many 10Gb NICs
support 25Gb/s too, and 40Gb/s and 100Gb/s are no longer that
expensive (but the cables are horrible).
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Robin H. Johnson
On Mon, Jan 15, 2024 at 03:21:11PM +, Drew Weaver wrote:
> Oh, well what I was going to do wAs just use SATA HBAs on PowerEdge R740s 
> because we don't really care about performance as this is just used as a copy 
> point for backups/archival but the current Ceph cluster we have [Which is 
> based on HDDs attached to Dell RAID controllers with each disk in RAID-0 and 
> works just fine for us] is on EL7 and that is going to be EOL soon. So I 
> thought it might be better on the new cluster to use HBAs instead of having 
> the OSDs just be single disk RAID-0 volumes because I am pretty sure that's 
> the least good scenario whether or not it has been working for us for like 8 
> years now.
> 
> So I asked on the list for recommendations and also read on the website and 
> it really sounds like the only "right way" to run Ceph is by directly 
> attaching disks to a motherboard. I had thought that HBAs were okay before 
> but I am probably confusing that with ZFS/BSD or some other equally 
> hyperspecific requirement. The other note was about how using NVMe seems to 
> be the only right way now too.
> 
> I would've rather just stuck to SATA but I figured if I was going to have to 
> buy all new servers that direct attach the SATA ports right off the 
> motherboards to a backplane I may as well do it with NVMe (even though the 
> price of the media will be a lot higher).
> 
> It would be cool if someone made NVMe drives that were cost competitive and 
> had similar performance to hard drives (meaning, not super expensive but not 
> lightning fast either) because the $/GB on datacenter NVMe drives like 
> Kioxia, etc is still pretty far away from what it is for HDDs (obviously).

I think as a collective, the mailing list didn't do enough to ask about
your use case for the Ceph cluster earlier in the thread.

Now that you say it's just backups/archival, QLC might be excessive for
you (or a great fit if the backups are churned often).

USD70/TB is the best public large-NVME pricing I'm aware of presently; for QLC
30TB drives. Smaller capacity drives do get down to USD50/TB.
2.5" SATA spinning disk is USD20-30/TB.
All of those are much higher than the USD15-20/TB for 3.5" spinning disk
made for 24/7 operation.

Maybe it would also help as a community to explain "why" on the
perceptions of "right way".

It's a tradeoff in what you're doing, you don't want to
bottleneck/saturate critical parts of the system.

PCIe bandwidth: this goes for NVME as well as SATA/SAS.
I won't name the vendor, but I saw a weird NVME server with 50+ drive
slots.  Each drive slot was x4 lane width but had a number of PCIe
expanders in the path from the motherboard, so it you were trying to max
it out, simultaneously using all the drives, each drive only only got
~1.7x usable PCIe4.0 lanes.

Compare that to the Supermicro servers I suggested: The AMD variants use
a H13SSF motherboard, which provides 64x PCIe5.0 lanes, split into 32x
E3.S drive slots, and each drive slot has 4x PCIe 4.0, no
over-subscription.

On that same Supermicro system, how do you get the data out? There are
two PCIe 5.0 x16 slots for your network cards, so you only need to
saturate at most HALF the drives to saturate the network.

Taking this back to the SATA/SAS servers: if you had a 16-port HBA,
with only PCIe 2.0 x8, theoretical max 4GB/sec. Say you filled it with
Samsung QVO drives, and efficiently used them for 560MB/sec.
The drives can collectively get almost 9GB/sec.
=> probably worthwhile to buy a better HBA.

On the HBA side, some of the controllers, in any RAID mode (including
single-disk RAID0), cannot handle saturating every port at the same
time: the little CPU is just doing too much work. Those same controllers
in a passthrough/IT mode are fine because the CPU doesn't do work
anymore.

This turned out more rambling than I intended, but how can we capture
the 'why' of the recommendations into something usable by the community,
and have everybody be able to read that (esp. for those that don't want
to engage on a mailing list).

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Anthony D'Atri


> Oh, well what I was going to do was just use SATA HBAs on PowerEdge R740s 
> because we don't really care about performance

That is important context.

> as this is just used as a copy point for backups/archival but the current 
> Ceph cluster we have [Which is based on HDDs attached to Dell RAID 
> controllers with each disk in RAID-0 and works just fine for us]

The H330?  You can set passthrough / JBOD / HBA personality and avoid the RAID0 
dance.

> is on EL7 and that is going to be EOL soon. So I thought it might be better 
> on the new cluster to use HBAs instead of having the OSDs just be single disk 
> RAID-0 volumes because I am pretty sure that's the least good scenario 
> whether or not it has been working for us for like 8 years now.

See above.

> So I asked on the list for recommendations and also read on the website and 
> it really sounds like the only "right way" to run Ceph is by directly 
> attaching disks to a motherboard

That isn’t quite what I meant.

If one is specking out *new* hardware:

* HDDs are a false economy
* SATA / SAS SSDs hobble performance for little or no cost savings over NVMe
* RAID HBAs are fussy and a waste of money in 2023


>  I had thought that HBAs were okay before

By HBA I suspect you mean a non-RAID HBA?

> but I am probably confusing that with ZFS/BSD or some other equally 
> hyperspecific requirement.

ZFS indeed prefers as little as possible between it and the drives.  The 
benefits for Ceph are not identical but very congruent.

> The other note was about how using NVMe seems to be the only right way now 
> too.

If we predicate that HDDs are a dead end, then that leaves us with SAS/SATA SSD 
vs NVMe SSD.

SAS is all but dead, and carries a price penalty.
SATA SSDs are steadily declining in the market.  5-10 years from now I suspect 
that no more than one manufacturer of enterprise-class SATA SSDs will remain.  
The future is PCI. SATA SSDs don’t save any money over NVMe SSDs, and 
additionally require some sort of HBA, be it an add-in card or on the 
motherboard.  SATA and NVMe SSDs use the same NAND, just with a different 
interface.


> I would've rather just stuck to SATA but I figured if I was going to have to 
> buy all new servers that direct attach the SATA ports right off the 
> motherboards to a backplane

On-board SATA chips may be relatively weak but I don’t know much about current 
implementations.

> I may as well do it with NVMe (even though the price of the media will be a 
> lot higher).

NVMe SSDs shouldn’t cost significantly more than SATA SSDs.  Hint:  certain 
tier-one chassis manufacturers mark both the fsck up.  You can get a better 
warranty and pricing by buying drives from a VAR.

> It would be cool if someone made NVMe drives that were cost competitive and 
> had similar performance to hard drives (meaning, not super expensive but not 
> lightning fast either) because the $/GB on datacenter NVMe drives like 
> Kioxia, etc is still pretty far away from what it is for HDDs (obviously).

It’s a trap!  Which is to say, that the $/GB really isn’t far away, and in fact 
once you step back to TCO from the unit economics of the drive in insolation, 
the HDDs often turn out to be *more* expensive.

Pore through this:  https://www.snia.org/forums/cmsi/programs/TCOcalc

* $/IOPS are higher for any HDD compared to NAND
* HDDs are available up to what, 22TB these days?  With the same tired SATA 
interface as when they were 2TB.  That’s rather a bottleneck.  We see HDD 
clusters limiting themselves to 8-10TB HDDs all the time; in fact AIUI RHCS 
stipulates no larger than 10TB.  Feed that into the equation and the TCO 
changes a bunch
* HDDs not only hobble steady-state performance, but under duress — expansion, 
component failure, etc., the impact to client operations will be higher and 
recovery to desired redundancy will be much longer.  I’ve seen a cluster — 
especially when using EC — take *4 weeks* to weight an 8TB HDD OSD up or down.  
Consider the operational cost and risk of that.  The SNIA calc has a 
performance multiplier that accounts for this.
* A SATA chassis is stuck with SATA, 5-10 years from now that will be 
increasingly limiting, especially if you go with LFF drives
* RUs cost money.  A 1U LFF server can hold what, at most 88TB raw when using 
HDDs?  With 60TB SSDs (*) one can fit 600TB of raw space into the same RU.






* If they meet your needs



> 
> Anyway thanks.
> -Drew
> 
> 
> 
> 
> 
> -Original Message-
> From: Robin H. Johnson  
> Sent: Sunday, January 14, 2024 5:00 PM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: recommendation for barebones server with 8-12 
> direct attach NVMe?
> 
> On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote:
>> Hello,
>> 
>> So we were going to replace a Ceph cluster with some hardware we had 
>> laying around using SATA HBAs but I was told that the only right way 
>> to build Ceph in 2023 is with direct attach NVMe.
>> 
>> Does anyone have any 

[ceph-users] Re: Debian 12 (bookworm) / Reef 18.2.1 problems

2024-01-15 Thread Chris Palmer

Updates on both problems:

Problem 1
--

The bookworm/reef cephadm package needs updating to accommodate the last 
change in /usr/share/doc/adduser/NEWS.Debian.gz:


  System user home defaults to /nonexistent if --home is not specified.
  Packages that call adduser to create system accounts should explicitly
  specify a location for /home (see Lintian check
  maintainer-script-lacks-home-in-adduser).

i.e. when creating the cephadm user as a system user it needs to 
explicitly specify the expected home directory of /home/cephadm.


A workaround is to manually create the user+directory before installing 
ceph.



Problem 2
--

This is a complex set of interactions that prevent many mgr modules 
(including dashboard) from running. It is NOT debian-specific and will 
eventually bite other distributions as well. At the moment Ceph PR54710 
looks the most promising fix (full or partial). Detail is spread across 
the following:


https://github.com/pyca/cryptography/issues/9016
https://github.com/ceph/ceph/pull/54710
https://tracker.ceph.com/issues/63529
https://forum.proxmox.com/threads/ceph-warning-post-upgrade-to-v8.129371/page-5
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1055212
https://github.com/pyca/bcrypt/issues/694



On 12/01/2024 14:29, Chris Palmer wrote:

More info on problem 2:

When starting the dashboard, the mgr seems to try to initialise 
cephadm, which in turn uses python crypto libraries that lead to the 
python error:


$ ceph crash info 
2024-01-12T11:10:03.938478Z_2263d2c8-8120-417e-84bc-bb01f5d81e52

{
    "backtrace": [
    "  File \"/usr/share/ceph/mgr/cephadm/__init__.py\", line 1, 
in \n    from .module import CephadmOrchestrator",
    "  File \"/usr/share/ceph/mgr/cephadm/module.py\", line 15, in 
\n    from cephadm.service_discovery import ServiceDiscovery",
    "  File \"/usr/share/ceph/mgr/cephadm/service_discovery.py\", 
line 20, in \n    from cephadm.ssl_cert_utils import SSLCerts",
    "  File \"/usr/share/ceph/mgr/cephadm/ssl_cert_utils.py\", 
line 8, in \n    from cryptography import x509",
    "  File 
\"/lib/python3/dist-packages/cryptography/x509/__init__.py\", line 6, 
in \n    from cryptography.x509 import certificate_transparency",
    "  File 
\"/lib/python3/dist-packages/cryptography/x509/certificate_transparency.py\", 
line 10, in \n    from cryptography.hazmat.bindings._rust 
import x509 as rust_x509",
    "ImportError: PyO3 modules may only be initialized once per 
interpreter process"

    ],
    "ceph_version": "18.2.1",
    "crash_id": 
"2024-01-12T11:10:03.938478Z_2263d2c8-8120-417e-84bc-bb01f5d81e52",

    "entity_name": "mgr.x01",
    "mgr_module": "cephadm",
    "mgr_module_caller": "PyModule::load_subclass_of",
    "mgr_python_exception": "ImportError",
    "os_id": "12",
    "os_name": "Debian GNU/Linux 12 (bookworm)",
    "os_version": "12 (bookworm)",
    "os_version_id": "12",
    "process_name": "ceph-mgr",
    "stack_sig": 
"7815ad73ced094695056319d1241bf7847da19b4b0dfee7a216407b59a7e3d84",

    "timestamp": "2024-01-12T11:10:03.938478Z",
    "utsname_hostname": "x01.xxx.xxx",
    "utsname_machine": "x86_64",
    "utsname_release": "6.1.0-17-amd64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 
(2023-12-30)"

}


On 12/01/2024 12:39, Chris Palmer wrote:
I was delighted to see the native Debian 12 (bookworm) packages turn 
up in Reef 18.2.1.


We currently run a number of ceph clusters on Debian11 (bullseye) / 
Quincy 17.2.7. These are not cephadm-managed.


I have attempted to upgrade a test cluster, and it is not going well. 
Quincy only supports bullseye, and Reef only supports bookworm, we 
are reinstalling from bare metal. However I don't think either of 
these two problems are related to that.


Problem 1
--

A simple "apt install ceph" goes most of the way, then errors with

Setting up cephadm (18.2.1-1~bpo12+1) ...
usermod: unlocking the user's password would result in a passwordless 
account.
You should set a password with usermod -p to unlock this user's 
password.
mkdir: cannot create directory ‘/home/cephadm/.ssh’: No such file or 
directory

dpkg: error processing package cephadm (--configure):
 installed cephadm package post-installation script subprocess 
returned error exit status 1

dpkg: dependency problems prevent configuration of ceph-mgr-cephadm:
 ceph-mgr-cephadm depends on cephadm; however:
  Package cephadm is not configured yet.

dpkg: error processing package ceph-mgr-cephadm (--configure):
 dependency problems - leaving unconfigured


The two cephadm-related packages are then left in an error state, 
which apt tries to continue each time it is run.


The cephadm user has a login directory of /nonexistent, however the 
cephadm --configure script is trying to use /home/cephadm (as it was 
on Quincy/bullseye).


So, we aren't using cephadm, and decided to keep going as the other 
packages were 

[ceph-users] Re: recommendation for barebones server with 8-12 direct attach NVMe?

2024-01-15 Thread Drew Weaver
Oh, well what I was going to do was just use SATA HBAs on PowerEdge R740s 
because we don't really care about performance as this is just used as a copy 
point for backups/archival but the current Ceph cluster we have [Which is based 
on HDDs attached to Dell RAID controllers with each disk in RAID-0 and works 
just fine for us] is on EL7 and that is going to be EOL soon. So I thought it 
might be better on the new cluster to use HBAs instead of having the OSDs just 
be single disk RAID-0 volumes because I am pretty sure that's the least good 
scenario whether or not it has been working for us for like 8 years now.

So I asked on the list for recommendations and also read on the website and it 
really sounds like the only "right way" to run Ceph is by directly attaching 
disks to a motherboard. I had thought that HBAs were okay before but I am 
probably confusing that with ZFS/BSD or some other equally hyperspecific 
requirement. The other note was about how using NVMe seems to be the only right 
way now too.

I would've rather just stuck to SATA but I figured if I was going to have to 
buy all new servers that direct attach the SATA ports right off the 
motherboards to a backplane I may as well do it with NVMe (even though the 
price of the media will be a lot higher).

It would be cool if someone made NVMe drives that were cost competitive and had 
similar performance to hard drives (meaning, not super expensive but not 
lightning fast either) because the $/GB on datacenter NVMe drives like Kioxia, 
etc is still pretty far away from what it is for HDDs (obviously).

Anyway thanks.
-Drew





-Original Message-
From: Robin H. Johnson  
Sent: Sunday, January 14, 2024 5:00 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: recommendation for barebones server with 8-12 direct 
attach NVMe?

On Fri, Jan 12, 2024 at 02:32:12PM +, Drew Weaver wrote:
> Hello,
> 
> So we were going to replace a Ceph cluster with some hardware we had 
> laying around using SATA HBAs but I was told that the only right way 
> to build Ceph in 2023 is with direct attach NVMe.
> 
> Does anyone have any recommendation for a 1U barebones server (we just 
> drop in ram disks and cpus) with 8-10 2.5" NVMe bays that are direct 
> attached to the motherboard without a bridge or HBA for Ceph 
> specifically?
If you're buying new, Supermicro would be my first choice for vendor based on 
experience.
https://www.supermicro.com/en/products/nvme

You said 2.5" bays, which makes me think you have existing drives.
There are models to fit that, but if you're also considering new drives, you 
can get further density in E1/E3

The only caveat is that you will absolutely want to put a better NIC in these 
systems, because 2x10G is easy to saturate with a pile of NVME.

--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB 
E9B85B1F 825BCECF EE05E6F6 A48F6136
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] erasure-code-lrc Questions regarding repair

2024-01-15 Thread Ansgar Jazdzewski
hi folks,

I currently test erasure-code-lrc (1) in a multi-room multi-rack setup.
The idea is to be able to repair a disk-failures within the rack
itself to lower bandwidth-usage

```bash
ceph osd erasure-code-profile set lrc_hdd \
plugin=lrc \
crush-root=default \
crush-locality=rack \
crush-failure-domain=host \
crush-device-class=hdd \
mapping=__D__D__D__D \
layers='
[
[ "_cD_cD_cD_cD", "" ],
[ "cDD_", "" ],
[ "___cDD__", "" ],
[ "__cDD___", "" ],
[ "_cDD", "" ],
]' \
crush-steps='[
[ "choose", "room", 4 ],
[ "choose", "rack", 1 ],
[ "chooseleaf", "host", 7 ],
]'
```

The roule picks 4 out of 5 rooms and keeps the PG in one rack like expected!

However it looks like the PG will not move to another Room if the PG
is undersized or the entire Room or Rack is down!

Questions:
* do I miss something to allow LRC (PG's) to move across Racks/Rooms for repair?
* Is it even possible to build such a 'Multi-stage' grushmap?

Thanks for your help,
Ansgar

1) https://docs.ceph.com/en/quincy/rados/operations/erasure-code-jerasure/
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recomand number of k and m erasure code

2024-01-15 Thread Frank Schilder
I would like to add here a detail that is often overlooked: maintainability 
under degraded conditions.

For production systems I would recommend to use EC profiles with at least m=3. 
The reason being that if you have a longer problem with a node that is down and 
m=2 it is not possible to do any maintenance on the system without loosing 
write access. Don't trust what users claim they are willing to tolerate - at 
least get it in writing. Once a problem occurs they will be at your door step 
no matter what they said before.

Similarly, when doing a longer maintenance task and m=2, any disk fail during 
maintenance will imply loosing write access.

Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being 
unavailable simultaneously while service is fully operational. That can be a 
life saver in many situations.

An additional reason for larger m is systematic failures of drives if your 
vendor doesn't mix drives from different batches and factories. If a batch has 
a systematic production error, failures are no longer statistically 
independent. In such a situation, if one drives fails the likelihood that more 
drives fail at the same time is very high. Having a larger number of parity 
shards increases the chances of recovering from such events.

For similar reasons I would recommend to deploy 5 MONs instead of 3. My life 
got so much better after having the extra redundancy.

As some background, in our situation we experience(d) somewhat heavy 
maintenance operations including modifying/updating ceph nodes (hardware, not 
software), exchanging Racks, switches, cooling and power etc. This required 
longer downtime and/or moving of servers and moving the ceph hardware was the 
easiest compared with other systems due to the extra redundancy bits in it. We 
had no service outages during such operations.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: Saturday, January 13, 2024 5:36 PM
To: Phong Tran Thanh
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Recomand number of k and m erasure code

There are nuances, but in general the higher the sum of m+k, the lower the 
performance, because *every* operation has to hit that many drives, which is 
especially impactful with HDDs.  So there’s a tradeoff between storage 
efficiency and performance.  And as you’ve seen, larger parity groups 
especially mean slower recovery/backfill.

There’s also a modest benefit to choosing values of m and k that have small 
prime factors, but I wouldn’t worry too much about that.


You can find EC efficiency tables on the net:


https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html


I should really add a table to the docs, making a note to do that.

There’s a nice calculator at the OSNEXUS site:

https://www.osnexus.com/ceph-designer


The overhead factor is (k+m) / k

So for a 4,2 profile, that’s 6 / 4 == 1.5

For 6,2, 8 / 6 = 1.33

For 10,2, 12 / 10 = 1.2

and so forth.  As k increases, the incremental efficiency gain sees diminishing 
returns, but performance continues to decrease.

Think of m as the number of copies you can lose without losing data, and m-1 as 
the number you can lose / have down and still have data *available*.

I also suggest that the number of failure domains — in your cases this means 
OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9.

With RBD and many CephFS implementations, we mostly have relatively large RADOS 
objects that are striped over many OSDs.

When using RGW especially, one should attend to average and median S3 object 
size.  There’s an analysis of the potential for space amplification in the docs 
so I won’t repeat it here in detail. This sheet 
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
 visually demonstrates this.

Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually 
small objects — if you have a lot of S3 objects in the multiples of KB size, 
you waste a significant fraction of underlying storage.  This is exacerbated by 
EC, and the larger the sum of k+m, the more waste.

When people ask me about replication vs EC and EC profile, the first question I 
ask is what they’re storing.  When EC isn’t a non-starter, I tend to recommend 
4,2 as a profile until / unless someone has specific needs and can understand 
the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not 
going overboard on the performance hit.

If you care about your data, do not set m=1.

If you need to survive the loss of many drives, say if your cluster is across 
multiple buildings or sites, choose a larger value of k.  There are people 
running profiles like 4,6 because they have unusual and specific needs.




> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh  wrote:
>
> Hi ceph user!
>
> I need to determine which erasure code