from:"Anthony D'Atri"

[ceph-users] Re: Full list of metrics provided by ceph exporter daemon

2024-06-20 Thread Anthony D'Atri



> 
> That's not the answer I expected to see :( Are you suggesting enabling all
> possible daemons of Ceph just to obtain a full list of metrics?

Depends if what you're after is potential metrics or ones currently surfaced, I 
assumed you meant the latter.

> Isn't there any list in ceph repo or in documentation with all metrics 
> exposed?

Use the source, Luke?  I don't know of a documented list.

> Another question: ceph_rgw_qactive metric changed its instance_id from
> numeric format to a letter format. Is there any pull request for that in
> Ceph or it is Rook initiative? Example:

I believe recently there was a restructuring of some of the RGW metrics.

> 
> metrics from prometheus module:
> ceph_rgw_qactive{instance_id="4134960"} 0.0
> ceph_rgw_qactive{instance_id="4493203"} 0.0
> 
> metrics from ceph-exporter:
> ceph_rgw_qactive{instance_id="a"} 0 ceph_rgw_qactive{instance_id="a"} 0
> 
> 
> чт, 20 июн. 2024 г. в 20:09, Anthony D'Atri :
> 
>> curl http://endpoint:port/metrics
>> 
>>> On Jun 20, 2024, at 10:15, Peter Razumovsky 
>> wrote:
>>> 
>>> Hello!
>>> 
>>> I'm using Ceph Reef with Rook v1.13 and want to find somewhere a full
>> list
>>> of metrics exported by brand new ceph exporter daemon. We found that some
>>> metrics have been changed after moving from prometheus module metrics to
>> a
>>> separate daemon.
>>> 
>>> We used described method [1] to give us a time to mitigate all risks
>> during
>>> transfer to a new daemon. Now we want to obtain a full list of
>>> ceph-exporter metrics to compare old and new metrics and to mitigate
>>> potential risks of losing monitoring data and breaking alerting systems.
>> We
>>> found some examples of metrics [2] but it is not complete so we will
>>> appreciate it if someone points us to the full list.
>>> 
>>> [1]
>>> 
>> https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics
>>> 
>>> [2] https://docs.ceph.com/en/latest/monitoring/
>>> 
>>> --
>>> Best regards,
>>> Peter Razumovsky
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
> 
> -- 
> Best regards,
> Peter Razumovsky
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full list of metrics provided by ceph exporter daemon

2024-06-20 Thread Anthony D'Atri

curl http://endpoint:port/metrics

> On Jun 20, 2024, at 10:15, Peter Razumovsky  wrote:
> 
> Hello!
> 
> I'm using Ceph Reef with Rook v1.13 and want to find somewhere a full list
> of metrics exported by brand new ceph exporter daemon. We found that some
> metrics have been changed after moving from prometheus module metrics to a
> separate daemon.
> 
> We used described method [1] to give us a time to mitigate all risks during
> transfer to a new daemon. Now we want to obtain a full list of
> ceph-exporter metrics to compare old and new metrics and to mitigate
> potential risks of losing monitoring data and breaking alerting systems. We
> found some examples of metrics [2] but it is not complete so we will
> appreciate it if someone points us to the full list.
> 
> [1]
> https://docs.ceph.com/en/latest/mgr/prometheus/#ceph-daemon-performance-counters-metrics
> 
> [2] https://docs.ceph.com/en/latest/monitoring/
> 
> -- 
> Best regards,
> Peter Razumovsky
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to change default osd reweight from 1.0 to 0.5

2024-06-19 Thread Anthony D'Atri

I’ve thought about this strategy in the past.   I think you might enter a cron 
job to reset any OSDs at 1.0 to 0.5, but really the balancer module or JJ 
balancer is a better idea than old-style reweight.  

> On Jun 19, 2024, at 2:22 AM, 서민우  wrote:
> 
> Hello~
> 
> Our ceph cluster uses 260 osds.
> The most highest osd usage is 87% But, The most lowest is under 40%.
> We consider lowering the highest osd's reweight and rising the lowest osd's
> reweight.
> 
> We solved this problem following this workload.
> This is our workload.
> 1. Set All osd's reweight to 0.5
> 2. Rising the lowest osd's reweight to 0.6
> 3. Lowering the highest osd's reweight to 0.4
> 
> ** What I'm really curious about is this. **
> I want to set the default osd reweight to 0.5 for the new osd that will be
> added to the cluster.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitoring

2024-06-18 Thread Anthony D'Atri

Easier to ignore any node_exporter that Ceph (or k8s) deploys and just deploy 
your own on a different port across your whole fleet.

> On Jun 18, 2024, at 13:56, Alex  wrote:
> 
> But how do you combine it with Prometheus node exporter built into Ceph?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Monitoring

2024-06-18 Thread Anthony D'Atri

I don't, I have the fleetwide monitoring / observability systems query 
ceph_exporter and a fleetwide node_exporter instance on 9101.  ymmv.


> On Jun 18, 2024, at 09:25, Alex  wrote:
> 
> Good morning.
> 
> Our RH Ceph comes with Prometheus monitoring "built in". How does everyone
> interstate that into their existing monitoring infrastructure so Ceph and
> other servers are all under one dashboard?
> 
> Thanks,
> Alex.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread Anthony D'Atri

Ohhh, so multiple OSD failure domains on a single SAN node?  I suspected as 
much.

I've experienced a Ceph cluster built on SanDisk InfiniFlash, was was somewhere 
between SAN and DAS arguably.  Each of 4 IF chassis drive 4x OSD nodes via SAS, 
but it was zoned such that the chassis was the failure domain in the CRUSH tree.

> On Jun 17, 2024, at 16:52, David C.  wrote:
> 
> In Pablo's unfortunate incident, it was because of a SAN incident, so it's 
> possible that Replica 3 didn't save him.
> In this scenario, the architecture is more the origin of the incident than 
> the number of replicas.

> It seems to me that replica 3 exists, by default, since firefly => make 
> replica 2, this is intentional.

The default EC profile though is 2,1 and that makes it too easy for someone to 
understandably assume that the default is suitable for production.  I have an 
action item to update docs and code to default to, say, 2,2 so that it still 
works on smaller configurations like sandboxes but is less dangerous.

> However, I'd rather see a full flash Replica 2 platform with solid backups 
> than Replica 3 without backups (well obviously, Replica 3, or E/C + backup 
> are much better).
> 

Tangent, but yeah RAID or replication != backups.  SolidFire was RF2 flash, 
their assertion was that resilvering was fast enough that it was safe.  With 
Ceph we know there's more to it than that, but I'm not sure if they had special 
provisions to address the sequences of events that can cause problems with Ceph 
RF2.  They did have limited scalability, though.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread Anthony D'Atri



>> 
>> * We use replicated pools
>> * Replica 2, min replicas 1.

Note to self:   Change the docs and default to discourage this.  This is rarely 
appropriate in production.  

You had multiple overlapping drive failures?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: why not block gmail?

2024-06-17 Thread Anthony D'Atri

Yes.  I have admin juice on some other Ceph lists, I've asked for it here as 
well so that I can manage with alacrity.


> On Jun 17, 2024, at 09:31, Robert W. Eckert  wrote:
> 
> Is there any way to have a subscription request validated? 
> 
> -Original Message-
> From: Marc  
> Sent: Monday, June 17, 2024 7:56 AM
> To: ceph-users 
> Subject: [ceph-users] Re: why not block gmail?
> 
> I am putting ceph-users@ceph.io on the blacklist for now. Let me know via 
> different email address when it is resolved.
> 
>> 
>> Could we at least stop approving requests from obvious spammers?
>> 
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>> 
>> 
>> From: Eneko Lacunza 
>> Sent: Monday, June 17, 2024 9:18 AM
>> To: ceph-users@ceph.io
>> Subject: [ceph-users] Re: why not block gmail?
>> 
>> Hi,
>> 
>> El 15/6/24 a las 11:49, Marc escribió:
>>> If you don't block gmail, gmail/google will never make an effort to
>> clean up their shit. I don't think people with a gmail.com will mind, 
>> because this is free and get somewhere else a free account.
>>> 
>>> tip: google does not really know what part of their infrastructure 
>>> is
>> sending email so they use spf ~all. If you process gmail.com and force 
>> the -all manually, you block mostly spam.
>> In May, of 111 list messages 41 (no-spam) came from gmail.com
>> 
>> I think banning gmail.com will be an issue for the list, at least 
>> short-term.
>> 
>> Applying SPF -all seems better, but not sure about how easy that would 
>> be to implement... :)
>> 
>> Cheers
>> 
>> Eneko Lacunza
>> Zuzendari teknikoa | Director técnico
>> Binovo IT Human Project
>> 
>> Tel. +34 943 569 206 | https://www.binovo.es Astigarragako Bidea, 2 - 
>> 2º izda. Oficina 10-11, 20180 Oiartzun
>> 
>> https://www.youtube.com/user/CANALBINOVO
>> https://www.linkedin.com/company/37269706/
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
>> email to ceph-users-le...@ceph.io 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
>> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
> ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-13 Thread Anthony D'Atri

There you go.

Tiny objects are the hardest thing for any object storage service:  you can 
have space amplification and metadata operations become a very high portion of 
the overall workload.

With 500KB objects, you may waste a significant fraction of underlying space -- 
especially if you have large-IU QLC OSDs, or OSDs made with an older Ceph 
release where the min_alloc_size was 64KB vs the current 4KB.  This is 
exacerbated by EC if you're using it, as many do for buckets pools.

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?gid=358760253#gid=358760253
Bluestore Space Amplification Cheat Sheet
docs.google.com


Things to do:  Disable Nagle  https://docs.ceph.com/en/quincy/radosgw/frontends/

Putting your index pool on as many SSDs as you can would also help, I don't 
recall if it's on HDD now.   Index doesn't use all that much data, but benefits 
from a generous pg_num and multiple OSDs so that it isn't bottlenecked.


> On Jun 13, 2024, at 15:13, Sinan Polat  wrote:
> 
> 500K object size
> 
>> Op 13 jun 2024 om 21:11 heeft Anthony D'Atri  het 
>> volgende geschreven:
>> 
>> How large are the objects you tested with?  
>> 
>>> On Jun 13, 2024, at 14:46, si...@turka.nl wrote:
>>> 
>>> I have doing some further testing.
>>> 
>>> My RGW pool is placed on spinning disks.
>>> I created a 2nd RGW data pool, placed on flash disks.
>>> 
>>> Benchmarking on HDD pool:
>>> Client 1 -> 1 RGW Node: 150 obj/s
>>> Client 1-5 -> 1 RGW Node: 150 ob/s (30 obj/s each client)
>>> Client 1 -> HAProxy -> 3 RGW Nodes: 150 obj/s
>>> Client 1-5 -> HAProxy -> 3 RGW Nodes: 150 obj/s (30 obj/s each client)
>>> 
>>> I did the same tests towards the RGW pool on flash disks: same results
>>> 
>>> So, it doesn't matter if my pool is hosted on HDD or SSD.
>>> It doesn't matter if I am using 1 RGW or 3 RGW nodes.
>>> It doesn't matter if I am using 1 client or 5 clients.
>>> 
>>> I am constantly limited at around 140-160 objects/s.
>>> 
>>> I see some TCP Retransmissions on the RGW Node, but maybe thats 'normal'.
>>> 
>>> Any ideas/suggestions?
>>> 
>>> On 2024-06-11 22:08, Anthony D'Atri wrote:
>>>>> I am not sure adding more RGW's will increase the performance.
>>>> That was a tangent.
>>>>> To be clear, that means whatever.rgw.buckets.index ?
>>>>>>> No, sorry my bad. .index is 32 and .data is 256.
>>>>>> Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG 
>>>>>> replicas on each OSD?  You want (IMHO) to end up with 100-200, keeping 
>>>>>> each pool's pg_num to a power of 2 ideally.
>>>>> No, my RBD pool is larger. My average PG per OSD is round 60-70.
>>>> Ah.  Aim for 100-200 with spinners.
>>>>>> Assuming all your pools span all OSDs, I suggest at a minimum 256 for 
>>>>>> .index and 8192 for .data, assuming you have only RGW pools.  And would 
>>>>>> be included to try 512 / 8192.  Assuming your  other minor pools are at 
>>>>>> 32, I'd bump .log and .non-ec to 128 or 256 as well.
>>>>>> If you have RBD or other pools colocated, those numbers would change.
>>>>>> ^ above assume disabling the autoscaler
>>>>> I bumped my .data pool from 256 to 1024 and .index from 32 to 128.
>>>> Your index pool still only benefits from half of your OSDs with a value of 
>>>> 128.
>>>>> Also doubled the .non-e and .log pools. Performance wise I don't see any 
>>>>> improvement. If I would see 10-20% improvement, I definitely would 
>>>>> increase it to 512 / 8192.
>>>>> With 0.5MB object size I am still limited at about 150 up to 250 
>>>>> objects/s.
>>>>> The disks aren't saturated. The wr await is mostly around 1ms and does 
>>>>> not get higher when benchmarking with S3.
>>>> Trust iostat about as far as you can throw it.
>>>>> Other suggestions, or does anyone else has suggestions?
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> 
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-13 Thread Anthony D'Atri

How large are the objects you tested with?  

> On Jun 13, 2024, at 14:46, si...@turka.nl wrote:
> 
> I have doing some further testing.
> 
> My RGW pool is placed on spinning disks.
> I created a 2nd RGW data pool, placed on flash disks.
> 
> Benchmarking on HDD pool:
> Client 1 -> 1 RGW Node: 150 obj/s
> Client 1-5 -> 1 RGW Node: 150 ob/s (30 obj/s each client)
> Client 1 -> HAProxy -> 3 RGW Nodes: 150 obj/s
> Client 1-5 -> HAProxy -> 3 RGW Nodes: 150 obj/s (30 obj/s each client)
> 
> I did the same tests towards the RGW pool on flash disks: same results
> 
> So, it doesn't matter if my pool is hosted on HDD or SSD.
> It doesn't matter if I am using 1 RGW or 3 RGW nodes.
> It doesn't matter if I am using 1 client or 5 clients.
> 
> I am constantly limited at around 140-160 objects/s.
> 
> I see some TCP Retransmissions on the RGW Node, but maybe thats 'normal'.
> 
> Any ideas/suggestions?
> 
> On 2024-06-11 22:08, Anthony D'Atri wrote:
>>> I am not sure adding more RGW's will increase the performance.
>> That was a tangent.
>>> To be clear, that means whatever.rgw.buckets.index ?
>>>>> No, sorry my bad. .index is 32 and .data is 256.
>>>> Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG 
>>>> replicas on each OSD?  You want (IMHO) to end up with 100-200, keeping 
>>>> each pool's pg_num to a power of 2 ideally.
>>> No, my RBD pool is larger. My average PG per OSD is round 60-70.
>> Ah.  Aim for 100-200 with spinners.
>>>> Assuming all your pools span all OSDs, I suggest at a minimum 256 for 
>>>> .index and 8192 for .data, assuming you have only RGW pools.  And would be 
>>>> included to try 512 / 8192.  Assuming your  other minor pools are at 32, 
>>>> I'd bump .log and .non-ec to 128 or 256 as well.
>>>> If you have RBD or other pools colocated, those numbers would change.
>>>> ^ above assume disabling the autoscaler
>>> I bumped my .data pool from 256 to 1024 and .index from 32 to 128.
>> Your index pool still only benefits from half of your OSDs with a value of 
>> 128.
>>> Also doubled the .non-e and .log pools. Performance wise I don't see any 
>>> improvement. If I would see 10-20% improvement, I definitely would increase 
>>> it to 512 / 8192.
>>> With 0.5MB object size I am still limited at about 150 up to 250 objects/s.
>>> The disks aren't saturated. The wr await is mostly around 1ms and does not 
>>> get higher when benchmarking with S3.
>> Trust iostat about as far as you can throw it.
>>> Other suggestions, or does anyone else has suggestions?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Patching Ceph cluster

2024-06-12 Thread Anthony D'Atri

That's just setting noout, norebalance, etc.

> On Jun 12, 2024, at 11:28, Michael Worsham  
> wrote:
> 
> Interesting. How do you set this "maintenance mode"? If you have a series of 
> documented steps that you have to do and could provide as an example, that 
> would be beneficial for my efforts.
> 
> We are in the process of standing up both a dev-test environment consisting 
> of 3 Ceph servers (strictly for testing purposes) and a new production 
> environment consisting of 20+ Ceph servers.
> 
> We are using Ubuntu 22.04.
> 
> -- Michael
> 
> 
> From: Daniel Brown 
> Sent: Wednesday, June 12, 2024 9:18 AM
> To: Anthony D'Atri 
> Cc: Michael Worsham ; ceph-users@ceph.io 
> 
> Subject: Re: [ceph-users] Patching Ceph cluster
> 
> This is an external email. Please take care when clicking links or opening 
> attachments. When in doubt, check with the Help Desk or Security.
> 
> 
> There’s also a Maintenance mode that you can set for each server, as you’re 
> doing updates, so that the cluster doesn’t try to move data from affected 
> OSD’s, while the server being updated is offline or down. I’ve worked some on 
> automating this with Ansible, but have found my process (and/or my cluster) 
> still requires some manual intervention while it’s running to get things done 
> cleanly.
> 
> 
> 
>> On Jun 12, 2024, at 8:49 AM, Anthony D'Atri  wrote:
>> 
>> Do you mean patching the OS?
>> 
>> If so, easy -- one node at a time, then after it comes back up, wait until 
>> all PGs are active+clean and the mon quorum is complete before proceeding.
>> 
>> 
>> 
>>> On Jun 12, 2024, at 07:56, Michael Worsham  
>>> wrote:
>>> 
>>> What is the proper way to patch a Ceph cluster and reboot the servers in 
>>> said cluster if a reboot is necessary for said updates? And is it possible 
>>> to automate it via Ansible? This message and its attachments are from Data 
>>> Dimensions and are intended only for the use of the individual or entity to 
>>> which it is addressed, and may contain information that is privileged, 
>>> confidential, and exempt from disclosure under applicable law. If the 
>>> reader of this message is not the intended recipient, or the employee or 
>>> agent responsible for delivering the message to the intended recipient, you 
>>> are hereby notified that any dissemination, distribution, or copying of 
>>> this communication is strictly prohibited. If you have received this 
>>> communication in error, please notify the sender immediately and 
>>> permanently delete the original email and destroy any copies or printouts 
>>> of this email as well as any attachments.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> This message and its attachments are from Data Dimensions and are intended 
> only for the use of the individual or entity to which it is addressed, and 
> may contain information that is privileged, confidential, and exempt from 
> disclosure under applicable law. If the reader of this message is not the 
> intended recipient, or the employee or agent responsible for delivering the 
> message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, or copying of this communication is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender immediately and permanently delete the original email and destroy 
> any copies or printouts of this email as well as any attachments.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS metadata pool size

2024-06-12 Thread Anthony D'Atri

If you have:

* pg_num too low (defaults are too low)
* pg_num not a power of 2
* pg_num != number of OSDs in the pool
* balancer not enabled

any of those might result in imbalance.

> On Jun 12, 2024, at 07:33, Eugen Block  wrote:
> 
> I don't have any good explanation at this point. Can you share some more 
> information like:
> 
> ceph pg ls-by-pool 
> ceph osd df (for the relevant OSDs)
> ceph df
> 
> Thanks,
> Eugen
> 
> Zitat von Lars Köppel :
> 
>> Since my last update the size of the largest OSD increased by 0.4 TiB while
>> the smallest one only increased by 0.1 TiB. How is this possible?
>> 
>> Because the metadata pool reported to have only 900MB space left, I stopped
>> the hot-standby MDS. This gave me 8GB back but these filled up in the last
>> 2h.
>> I think I have to zap the next OSD because the filesystem is getting read
>> only...
>> 
>> How is it possible that an OSD has over 1 TiB less data on it after a
>> rebuild? And how is it possible to have so different sizes of OSDs?
>> 
>> 
>> [image: ariadne.ai Logo] Lars Köppel
>> Developer
>> Email: lars.koep...@ariadne.ai
>> Phone: +49 6221 5993580 <+4962215993580>
>> ariadne.ai (Germany) GmbH
>> Häusserstraße 3, 69115 Heidelberg
>> Amtsgericht Mannheim, HRB 744040
>> Geschäftsführer: Dr. Fabian Svara
>> https://ariadne.ai
>> 
>> 
>> On Tue, Jun 11, 2024 at 3:47 PM Lars Köppel  wrote:
>> 
>>> Only in warning mode. And there were no PG splits or merges in the last 2
>>> month.
>>> 
>>> 
>>> [image: ariadne.ai Logo] Lars Köppel
>>> Developer
>>> Email: lars.koep...@ariadne.ai
>>> Phone: +49 6221 5993580 <+4962215993580>
>>> ariadne.ai (Germany) GmbH
>>> Häusserstraße 3, 69115 Heidelberg
>>> Amtsgericht Mannheim, HRB 744040
>>> Geschäftsführer: Dr. Fabian Svara
>>> https://ariadne.ai
>>> 
>>> 
>>> On Tue, Jun 11, 2024 at 3:32 PM Eugen Block  wrote:
>>> 
 I don't think scrubs can cause this. Do you have autoscaler enabled?
 
 Zitat von Lars Köppel :
 
 > Hi,
 >
 > thank you for your response.
 >
 > I don't think this thread covers my problem, because the OSDs for the
 > metadata pool fill up at different rates. So I would think this is no
 > direct problem with the journal.
 > Because we had earlier problems with the journal I changed some
 > settings(see below). I already restarted all MDS multiple times but no
 > change here.
 >
 > The health warnings regarding cache pressure resolve normally after a
 > short period of time, when the heavy load on the client ends. Sometimes
 it
 > stays a bit longer because an rsync is running and copying data on the
 > cluster(rsync is not good at releasing the caps).
 >
 > Could it be a problem if scrubs run most of the time in the background?
 Can
 > this block any other tasks or generate new data itself?
 >
 > Best regards,
 > Lars
 >
 >
 > global  basic mds_cache_memory_limit
 > 17179869184
 > global  advanced  mds_max_caps_per_client
 >16384
 > global  advanced
 mds_recall_global_max_decay_threshold
 >262144
 > global  advanced  mds_recall_max_decay_rate
 >1.00
 > global  advanced  mds_recall_max_decay_threshold
 > 262144
 > mds advanced  mds_cache_trim_threshold
 > 131072
 > mds advanced  mds_heartbeat_grace
 >120.00
 > mds advanced  mds_heartbeat_reset_grace
 >7400
 > mds advanced  mds_tick_interval
 >3.00
 >
 >
 > [image: ariadne.ai Logo] Lars Köppel
 > Developer
 > Email: lars.koep...@ariadne.ai
 > Phone: +49 6221 5993580 <+4962215993580>
 > ariadne.ai (Germany) GmbH
 > Häusserstraße 3, 69115 Heidelberg
 > Amtsgericht Mannheim, HRB 744040
 > Geschäftsführer: Dr. Fabian Svara
 > https://ariadne.ai
 >
 >
 > On Tue, Jun 11, 2024 at 2:05 PM Eugen Block  wrote:
 >
 >> Hi,
 >>
 >> can you check if this thread [1] applies to your situation? You don't
 >> have multi-active MDS enabled, but maybe it's still some journal
 >> trimming, or maybe misbehaving clients? In your first post there were
 >> health warnings regarding cache pressure and cache size. Are those
 >> resolved?
 >>
 >> [1]
 >>
 >>
 https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7U27L27FHHPDYGA6VNNVWGLTXCGP7X23/#VOOV235D4TP5TEOJUWHF4AVXIOTHYQQE
 >>
 >> Zitat von Lars Köppel :
 >>
 >> > Hello everyone,
 >> >
 >> > short update to this problem.
 >> > The zapped OSD is rebuilt and it has now 1.9 TiB (the expected size
 >> ~50%).
 >> > The other 2 OSDs are now at 2.8 respectively 3.2 TiB. They jumped up
 and
 >> > down a lot but the higher one has

[ceph-users] Re: Patching Ceph cluster

2024-06-12 Thread Anthony D'Atri

Do you mean patching the OS?

If so, easy -- one node at a time, then after it comes back up, wait until all 
PGs are active+clean and the mon quorum is complete before proceeding.



> On Jun 12, 2024, at 07:56, Michael Worsham  
> wrote:
> 
> What is the proper way to patch a Ceph cluster and reboot the servers in said 
> cluster if a reboot is necessary for said updates? And is it possible to 
> automate it via Ansible? This message and its attachments are from Data 
> Dimensions and are intended only for the use of the individual or entity to 
> which it is addressed, and may contain information that is privileged, 
> confidential, and exempt from disclosure under applicable law. If the reader 
> of this message is not the intended recipient, or the employee or agent 
> responsible for delivering the message to the intended recipient, you are 
> hereby notified that any dissemination, distribution, or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please notify the sender immediately and permanently delete the 
> original email and destroy any copies or printouts of this email as well as 
> any attachments.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-11 Thread Anthony D'Atri



> 
> I am not sure adding more RGW's will increase the performance.

That was a tangent.

> To be clear, that means whatever.rgw.buckets.index ?
>>> No, sorry my bad. .index is 32 and .data is 256.
>> Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG replicas 
>> on each OSD?  You want (IMHO) to end up with 100-200, keeping each pool's 
>> pg_num to a power of 2 ideally.
> 
> No, my RBD pool is larger. My average PG per OSD is round 60-70.

Ah.  Aim for 100-200 with spinners.

> 
>> Assuming all your pools span all OSDs, I suggest at a minimum 256 for .index 
>> and 8192 for .data, assuming you have only RGW pools.  And would be included 
>> to try 512 / 8192.  Assuming your  other minor pools are at 32, I'd bump 
>> .log and .non-ec to 128 or 256 as well.
>> If you have RBD or other pools colocated, those numbers would change.
>> ^ above assume disabling the autoscaler
> 
> I bumped my .data pool from 256 to 1024 and .index from 32 to 128.

Your index pool still only benefits from half of your OSDs with a value of 128.


> Also doubled the .non-e and .log pools. Performance wise I don't see any 
> improvement. If I would see 10-20% improvement, I definitely would increase 
> it to 512 / 8192.
> With 0.5MB object size I am still limited at about 150 up to 250 objects/s.
> 
> The disks aren't saturated. The wr await is mostly around 1ms and does not 
> get higher when benchmarking with S3.

Trust iostat about as far as you can throw it.


> 
> Other suggestions, or does anyone else has suggestions?
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Attention: Documentation - mon states and names

2024-06-11 Thread Anthony D'Atri

Custom names were never really 100% implemented, and I would not be surprised 
if they don't work in Reef.

> On Jun 11, 2024, at 14:02, Joel Davidow  wrote:
> 
> Zac,
> 
> Thanks for your super-fast response and action on this. Those four items
> are great and the corresponding email as reformatted looks good.
> 
> Jana's point about cluster names is a good one. The deprecation of custom
> cluster names, which appears to have started in octopus per
> https://docs.ceph.com/en/octopus/rados/configuration/common/, alleviates
> that confusion going forward but does not help with clusters already
> deployed with custom names.
> 
> Thanks again,
> Joel
> 
> On Tue, Jun 11, 2024 at 2:26 AM Janne Johansson  wrote:
> 
>>> Note the difference of convention in ceph command presentation. In
>>> 
>> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#understanding-mon-status
>> ,
>>> mon.X uses X to represent the portion of the command to be replaced by
>> the
>>> operator with a specific value. However, that may not be clear to all
>>> readers, some of whom may read that as a literal X. I recommend switching
>>> convention to something that makes visually explicit any portion of a
>>> command that an operator has to replace with a specific value. One such
>>> convention is to use <> as delimiters marking the portion of a command
>> that
>>> an operator has to replace with a specific value, minus the delimiters
>>> themselves. I'm sure there are other conventions that would accomplish
>> the
>>> same goal and provide the <> convention as an example only.
>> 
>> Yes, this is one of my main gripes. Many of the doc parts should more
>> visibly point out which words or parts of names are the ones that you
>> chose (by selecting a hostname for instance), it gets weird when you
>> see "mon-1" or "client.rgw.rgw1" and you don't know which of those are
>> to be changed to suit your environment and which are not. Sometimes
>> the "ceph" word sneaks into paths because it is the name of the
>> software (duh) but sometimes because it is the clustername. Now I
>> don't hope many people change their clustername, but if you did, docs
>> would be hard to follow in order to figure out where to replace "ceph"
>> with your cluster name.
>> 
>>> Also, the actual name of a mon is not clear due to the variety of mon
>> name
>>> formats. The value of the NAME column returned by ceph orch ps
>>> --daemon-type mon and the return from ceph mon dump follow the format of
>>> mon. whereas the value of name returned by ceph tell 
>>> mon_status, the mon line returned by ceph -s, and the return from ceph
>> mon
>>> stat follow the format of . Unifying the return for the mon name
>>> value of all those commands could be helpful in establishing the format
>> of
>>> a mon name, though that is probably easier said than done.
>>> 
>>> In addition, in
>>> 
>> https://docs.ceph.com/en/latest/rados/configuration/mon-config-ref/#configuring-monitors
>> ,
>>> mon names are stated to use alpha notation by convention, but that
>>> convention is not followed by cephadm in the clusters that I've deployed.
>>> Cephadm also uses a minimal ceph.conf file with configs in the mon
>>> database. I recommend this section be updated to mention those changes.
>> If
>>> there is a way to explain what a mon name is or how it is formatted,
>>> perhaps adding that to that same section would be good.
>> 
>> 
>> 
>> --
>> May the most significant bit of your life be positive.
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: About disk disk iops and ultil peak

2024-06-10 Thread Anthony D'Atri

What specifically are your OSD devices?

> On Jun 10, 2024, at 22:23, Phong Tran Thanh  wrote:
> 
> Hi ceph user!
> 
> I am encountering a problem with IOPS and disk utilization of OSD. Sometimes, 
> my disk peaks in IOPS and utilization become too high, which affects my 
> cluster and causes slow operations to appear in the logs.
> 
> 6/6/24 9:51:46 AM[WRN]Health check update: 0 slow ops, oldest one blocked for 
> 36 sec, osd.268 has slow ops (SLOW_OPS)
> 
> 6/6/24 9:51:37 AM[WRN]Health check update: 0 slow ops, oldest one blocked for 
> 31 sec, osd.268 has slow ops (SLOW_OPS)
> 
> 
> This is config tu reduce it, but its not resolve my problem
> global  advanced  osd_mclock_profile  
>  custom   
>  
> global  advanced  osd_mclock_scheduler_background_best_effort_lim 
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_best_effort_res 
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_best_effort_wgt 
>  1
> 
> global  advanced  osd_mclock_scheduler_background_recovery_lim
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_recovery_res
>  0.10 
> 
> global  advanced  osd_mclock_scheduler_background_recovery_wgt
>  1
> 
> global  advanced  osd_mclock_scheduler_client_lim 
>  0.40 
> 
> global  advanced  osd_mclock_scheduler_client_res 
>  0.40 
> 
> global  advanced  osd_mclock_scheduler_client_wgt 4
> 
> Hope someone can help me
> 
> Thanks so much!
> --
> 
> Email: tranphong...@gmail.com 
> Skype: tranphong079
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri



>> To be clear, you don't need more nodes.  You can add RGWs to the ones you 
>> already have.  You have 12 OSD nodes - why not put an RGW on each?

> Might be an option, just don't like the idea to host multiple components on 
> nodes. But I'll consider it.

I really don't like mixing mon/mgr with other components because of coupled 
failure domains, and past experience with mon misbehavior, but many people do 
that.  ymmv.  With a bunch of RGWs none of them need grow to consume 
significant resources, and it can be difficult to get an RGW daemon to itself 
really use all of a dedicated node.

> 
 There are still serializations in the OSD and PG code.  You have 240 OSDs, 
 does your index pool have *at least* 256 PGs?
>>> Index as the data pool has 256 PG's.
>> To be clear, that means whatever.rgw.buckets.index ?
> 
> No, sorry my bad. .index is 32 and .data is 256.

Oh, yeah. Does `ceph osd df` show you at the far right like 4-5 PG replicas on 
each OSD?  You want (IMHO) to end up with 100-200, keeping each pool's pg_num 
to a power of 2 ideally.

Assuming all your pools span all OSDs, I suggest at a minimum 256 for .index 
and 8192 for .data, assuming you have only RGW pools.  And would be included to 
try 512 / 8192.  Assuming your  other minor pools are at 32, I'd bump .log and 
.non-ec to 128 or 256 as well.

If you have RBD or other pools colocated, those numbers would change.



^ above assume disabling the autoscaler
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri




 
>>> You are right here, but we use Ceph mainly for RBD. It performs 'good 
>>> enough' for our RBD load.
>> You use RBD for archival?
> 
> No, storage for (light-weight) virtual machines.

I'm surprised that it's enough, I've seen HDDs fail miserably in that role.

> The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted 
> on the OSD nodes and are running on modern hardware.
>> You didn't list additional nodes so I assumed.  You might still do well to 
>> have a larger number of RGWs, wherever they run.  RGWs often scale better 
>> horizontally than vertically.
> 
> Good to know. I'll check if adding more RGW nodes is possible.

To be clear, you don't need more nodes.  You can add RGWs to the ones you 
already have.  You have 12 OSD nodes - why not put an RGW on each?

> 
>> There are still serializations in the OSD and PG code.  You have 240 OSDs, 
>> does your index pool have *at least* 256 PGs?
> 
> Index as the data pool has 256 PG's.

To be clear, that means whatever.rgw.buckets.index ?

> 
 You might also disable Nagle on the RGW nodes.
>>> I need to lookup what that exactly is and does.
> It depends on the concurrency setting of Warp.
> It look likes the objects/s is the bottleneck, not the throughput.
> Max memory usage is about 80-90GB per node. CPU's are quite idling.
> Is it reasonable to expect more IOps / objects/s for RGW with my setup? 
> At this moment I am not able to find the bottleneck what is causing the 
> low obj/s.
 HDDs are a false economy.
>>> Got it :)
> Ceph version is 15.2.
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-10 Thread Anthony D'Atri

Eh?  cf. Mark and Dan's 1TB/s presentation.

> On Jun 10, 2024, at 13:58, Mark Lehrer  wrote:
> 
>  It
> seems like Ceph still hasn't adjusted to SSD performance.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Testing CEPH scrubbing / self-healing capabilities

2024-06-10 Thread Anthony D'Atri

Scrubs are of PGs not OSDs, the lead OSD for a PG orchestrates subops to 
secondary OSDs.  If you can point me to where this is in docs/src I'll clarify 
it, ideally if you can put in a tracker ticket and send me a link.

Scrubbing all PGs on an OSD at once or even in sequence would be impactful.

> On Jun 10, 2024, at 10:51, Petr Bena  wrote:
> 
> Most likely it wasn't, the ceph help or documentation is not very clear about 
> this:
> 
> osd deep-scrub   
> initiate deep scrub on osd , or 
> use  to deep scrub all
> 
> It doesn't say anything like "initiate deep scrub of primary PGs on osd"
> 
> I assumed it just runs a scrub of everything on given OSD.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri




>>> - 20 spinning SAS disks per node.
>> Don't use legacy HDDs if you care about performance.
> 
> You are right here, but we use Ceph mainly for RBD. It performs 'good enough' 
> for our RBD load.

You use RBD for archival?


>>> - Some nodes have 256GB RAM, some nodes 128GB.
>> 128GB is on the low side for 20 OSDs.
> 
> Agreed, but with 20 OSD's x osd_memory_target 4GB (80GB) it is enough. We 
> haven't had any server that OOM'ed yet.

Remember that's a *target* not a *limit*.  Say one or more of your failure 
domains goes offline or you have some other large topology change.  Your OSDs 
might want up to 2x osd_memory_target, then you OOM and it cascades.  I've been 
there, had to do an emergency upgrade of 24xOSD nodes from 128GB to 192GB.

>>> - CPU varies between Intel E5-2650 and Intel Gold 5317.
>> E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd make a 
>> decent MDS system but assuming a dual socket system, you have ~2 threads per 
>> OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw 
>> on some of them too.
> 
> The (CPU) load on the OSD nodes is quite low. Our MON/MGR/RGW aren't hosted 
> on the OSD nodes and are running on modern hardware.

You didn't list additional nodes so I assumed.  You might still do well to have 
a larger number of RGWs, wherever they run.  RGWs often scale better 
horizontally than vertically.

> 
>> rados bench is a useful for smoke testing but not always a reflection of E2E 
>> experience.
>>> Unfortunately not getting the same performance with Rados Gateway (S3).
>>> - 1x HAProxy with 3 backend RGW's.
>> Run an RGW on every node.
> 
> On every OSD node?

Yep, why not?


>>> I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp 
>>> clients. Benchmarking towards the HAProxy.
>>> Results:
>>> - Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy 
>>> server. Thats good.
>>> - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s 
>>> with 140 up to 300 obj/s.
>> Tiny objects are the devil of any object storage deployment.  The HDDs are 
>> killing you here, especially for the index pool.  You might get a bit better 
>> by upping pg_num from the party line.
> 
> I would expect high write await times, but all OSD/disks have write await 
> times of 1 ms up to 3 ms.

There are still serializations in the OSD and PG code.  You have 240 OSDs, does 
your index pool have *at least* 256 PGs?


> 
>> You might also disable Nagle on the RGW nodes.
> 
> I need to lookup what that exactly is and does.
> 
>>> It depends on the concurrency setting of Warp.
>>> It look likes the objects/s is the bottleneck, not the throughput.
>>> Max memory usage is about 80-90GB per node. CPU's are quite idling.
>>> Is it reasonable to expect more IOps / objects/s for RGW with my setup? At 
>>> this moment I am not able to find the bottleneck what is causing the low 
>>> obj/s.
>> HDDs are a false economy.
> 
> Got it :)
> 
>>> Ceph version is 15.2.
>>> Thanks!
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance issues RGW (S3)

2024-06-10 Thread Anthony D'Atri




> Hi all,
> 
> My Ceph setup:
> - 12 OSD nodes, 4 OSD nodes per rack. Replication of 3, 1 replica per rack.
> - 20 spinning SAS disks per node.

Don't use legacy HDDs if you care about performance.

> - Some nodes have 256GB RAM, some nodes 128GB.

128GB is on the low side for 20 OSDs.

> - CPU varies between Intel E5-2650 and Intel Gold 5317.

E5-2650 is underpowered for 20 OSDs.  5317 isn't the ideal fit, it'd make a 
decent MDS system but assuming a dual socket system, you have ~2 threads per 
OSD, which is maybe acceptable for HDDs, but I assume you have mon/mgr/rgw on 
some of them too.


> - Each node has 10Gbit/s network.
> 
> Using rados bench I am getting decent results (depending on block size):
> - 1000 MB/s throughput, 1000 IOps with 1MB block size
> - 30 MB/s throughput, 7500 IOps with 4K block size

rados bench is a useful for smoke testing but not always a reflection of E2E 
experience.

> 
> Unfortunately not getting the same performance with Rados Gateway (S3).
> 
> - 1x HAProxy with 3 backend RGW's.

Run an RGW on every node.


> I am using Minio Warp for benchmarking (PUT). I am 1 Warp server and 5 Warp 
> clients. Benchmarking towards the HAProxy.
> 
> Results:
> - Using 10MB object size, I am hitting the 10Gbit/s link of the HAProxy 
> server. Thats good.
> - Using 500K object size, I am getting a throughput of 70 up to 150 MB/s with 
> 140 up to 300 obj/s.

Tiny objects are the devil of any object storage deployment.  The HDDs are 
killing you here, especially for the index pool.  You might get a bit better by 
upping pg_num from the party line.

You might also disable Nagle on the RGW nodes.


> It depends on the concurrency setting of Warp.
> 
> It look likes the objects/s is the bottleneck, not the throughput.
> 
> Max memory usage is about 80-90GB per node. CPU's are quite idling.
> 
> Is it reasonable to expect more IOps / objects/s for RGW with my setup? At 
> this moment I am not able to find the bottleneck what is causing the low 
> obj/s.

HDDs are a false economy.

> 
> Ceph version is 15.2.
> 
> Thanks!
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-09 Thread Anthony D'Atri


> 
> 6) There are advanced tuning like numa pinning, but you should get decent 
> speeds without doing fancy stuff.

This is why I’d asked the OP for the CPU in use.  Mark and Dan’s recent and 
superlative presentation about 1TB/s with Ceph underscored how tunings can make 
a very real difference.  On EPYS, for example:

* Experiment with NPS = 4, 2, 1, 0
* Disabling IOMMU in the kernel may make a huge difference
* Single-socket vs dual-socket systems make a difference.  To be clear, that 
means like a higher-core -P CPU in a chassis designed for single-socket, not a 
dual socket unit left half empty with half the cores.  CPU interconnects can 
matter, as when chasing critical latency things like how many chiplets comprise 
the CPU (compare Sapphire Rapids to Emerald Rapids) and how many cores per IO 
die are present.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-09 Thread Anthony D'Atri

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Anthony D'Atri

> On Jun 7, 2024, at 13:20, Mark Lehrer  wrote:
> 
>> server RAM and CPU
>> * osd_memory_target
>> * OSD drive model
> 
> Thanks for the reply.  The servers have dual Xeon Gold 6154 CPUs with
> 384 GB

So roughly 7 vcores / HTs per OSD?  Your Ceph is a recent release?

> The drives are older, first gen NVMe - WDC SN620.

Those appear to be a former SanDisk product, lower performers than more recent 
drives, how much a factor that is I can't say.
Which specific SKU?  There appear to be low and standard endurance SKUs, 3.84 
or 1.92 T, 3.2T or 1.6T respectively.

What is the lifetime used like on them?  Less than 80%?  If you really want to 
eliminate uncertainties:

* Ensure they're updated to the latest firmware
* In rolling fashion, destroy the OSDs, secure-erase each OSD, redeploy the OSDs

> osd_memory_target is at the default.  Mellanox CX5 and SN2700
> hardware.  The test client is a similar machine with no drives.

This is via RBD?  Do you have the client RBD cache on or off?

> The CPUs are 80% idle during the test.

Do you have the server BMC/BIOS profile set to performance?  Deep C-states 
disabled via TuneD or other means?

> The OSDs (according to iostat)

Careful, iostat's metrics are of limited utility on SSDs, especially NVMe.

> I did find it interesting that the wareq-sz option in iostat is around
> 5 during the test - I was expecting 16.  Is there a way to tweak this
> in bluestore?

Not my area of expertise, but I once tried to make OSDs with a >4KB BlueStore 
block size, they crashed at startup.   4096 is hardcoded in various places.

Quality SSD firmware will coalesce writes to NAND.  If your firmware surfaces 
host vs NAND writes, you might capture deltas over, say, a week of workload and 
calculate the WAF.

> These drives are terrible at under 8K I/O.  Not that it really matters since 
> we're not I/O bound at all.

I/O bound can be tricky, be careful with that assumption, there are multiple 
facets.  I can't find anything specific, but that makes me suspect that 
internally the IU isn't the usual 4KB, perhaps to save a few bucks on DRAM.  

> I can also increase threads from 8 to 32 and the iops are roughly
> quadruple so that's good at least.  Single thread writes are about 250
> iops and like 3.7MB/sec.  So sad.

Assuming that the pool you're writing to spans all 60 OSDs, what is your PG 
count on that pool?  Are there multiple pools in the cluster?  As reported by 
`ceph osd df`, on average how many PG replicas are on each OSD?

> The rados bench process is also under 50% CPU utilization of a single
> core.  This seems like a thead/semaphore kind of issue if I had to
> guess.  It's tricky to debug when there is no obvious bottleneck.

rados bench is a good smoke test, but fio may better represent the E2E 
experience.

> 
> Thanks,
> Mark
> 
> 
> 
> 
> On Fri, Jun 7, 2024 at 9:47 AM Anthony D'Atri  wrote:
>> 
>> Please describe:
>> 
>> * server RAM and CPU
>> * osd_memory_target
>> * OSD drive model
>> 
>>> On Jun 7, 2024, at 11:32, Mark Lehrer  wrote:
>>> 
>>> I've been using MySQL on Ceph forever, and have been down this road
>>> before but it's been a couple of years so I wanted to see if there is
>>> anything new here.
>>> 
>>> So the TL:DR version of this email - is there a good way to improve
>>> 16K write IOPs with a small number of threads?  The OSDs themselves
>>> are idle so is this just a weakness in the algorithms or do ceph
>>> clients need some profiling?  Or "other"?
>>> 
>>> Basically, this is one of the worst possible Ceph workloads so it is
>>> fun to try to push the limits.  I also happen have a MySQL instance
>>> that is reaching the write IOPs limit so this is also a last-ditch
>>> effort to keep it on Ceph.
>>> 
>>> This cluster is as straightforward as it gets... 6 servers with 10
>>> SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
>>> the OSDs are more or less idle so I don't suspect any hardware
>>> limitations.
>>> 
>>> MySQL has no parallelism so the number of threads and effective queue
>>> depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
>>> bench with 16K writes and 8 threads.  The RBD actually gets about 2x
>>> this level - still not so great.
>>> 
>>> I get about 2000 IOPs with this test:
>>> 
>>> # rados bench -p volumes 10 write -t 8 -b 16K
>>> hints = 1
>>> Maintaining 8 concurrent writes of 16384 bytes to objects of size
>>> 16384 for up to 10 seconds or 0 objects
>>> Object prefix: benchmark_data_fstosinfra-5_3652583
>>> sec Cur ops   sta

[ceph-users] Re: Ceph RBD, MySQL write IOPs - what is possible?

2024-06-07 Thread Anthony D'Atri

Please describe:

* server RAM and CPU
* osd_memory_target
* OSD drive model

> On Jun 7, 2024, at 11:32, Mark Lehrer  wrote:
> 
> I've been using MySQL on Ceph forever, and have been down this road
> before but it's been a couple of years so I wanted to see if there is
> anything new here.
> 
> So the TL:DR version of this email - is there a good way to improve
> 16K write IOPs with a small number of threads?  The OSDs themselves
> are idle so is this just a weakness in the algorithms or do ceph
> clients need some profiling?  Or "other"?
> 
> Basically, this is one of the worst possible Ceph workloads so it is
> fun to try to push the limits.  I also happen have a MySQL instance
> that is reaching the write IOPs limit so this is also a last-ditch
> effort to keep it on Ceph.
> 
> This cluster is as straightforward as it gets... 6 servers with 10
> SSDs each, 100 Gb networking.  I'm using size=3.  During operations,
> the OSDs are more or less idle so I don't suspect any hardware
> limitations.
> 
> MySQL has no parallelism so the number of threads and effective queue
> depth stay pretty low.  Therefore, as a proxy for MySQL I use rados
> bench with 16K writes and 8 threads.  The RBD actually gets about 2x
> this level - still not so great.
> 
> I get about 2000 IOPs with this test:
> 
> # rados bench -p volumes 10 write -t 8 -b 16K
> hints = 1
> Maintaining 8 concurrent writes of 16384 bytes to objects of size
> 16384 for up to 10 seconds or 0 objects
> Object prefix: benchmark_data_fstosinfra-5_3652583
>  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>0   0 0 0 0 0   -   0
>1   8  2050  2042   31.9004   31.9062  0.00247633  0.00390848
>2   8  4306  4298   33.5728 35.25  0.00278488  0.00371784
>3   8  6607  6599   34.3645   35.9531  0.00277546  0.00363139
>4   7  8951  8944   34.9323   36.6406  0.00414908  0.00357249
>5   8 11292 1128435.257   36.5625  0.00291434  0.00353997
>6   8 13588 13580   35.358835.875  0.00306094  0.00353084
>7   7 15933 15926   35.5432   36.6562  0.00308388   0.0035123
>8   8 18361 18353   35.8399   37.9219  0.00314996  0.00348327
>9   8 20629 20621   35.7947   35.4375  0.00352998   0.0034877
>   10   5 23010 23005   35.9397 37.25  0.00395566  0.00347376
> Total time run: 10.003
> Total writes made:  23010
> Write size: 16384
> Object size:16384
> Bandwidth (MB/sec): 35.9423
> Stddev Bandwidth:   1.63433
> Max bandwidth (MB/sec): 37.9219
> Min bandwidth (MB/sec): 31.9062
> Average IOPS:   2300
> Stddev IOPS:104.597
> Max IOPS:   2427
> Min IOPS:   2042
> Average Latency(s): 0.0034737
> Stddev Latency(s):  0.00163661
> Max latency(s): 0.115932
> Min latency(s): 0.00179735
> Cleaning up (deleting benchmark objects)
> Removed 23010 objects
> Clean up completed and total clean up time :7.44664
> 
> 
> Are there any good options to improve this?  It seems like the client
> side is the bottleneck since the OSD servers are at like 15%
> utilization.
> 
> Thanks,
> Mark
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: tuning for backup target cluster

2024-06-04 Thread Anthony D'Atri

Or partition, or use LVM.

I've wondered for years what the practical differences are between using a 
namespace and a conventional partition.


> On Jun 4, 2024, at 07:59, Robert Sander  wrote:
> 
> On 6/4/24 12:47, Lukasz Borek wrote:
> 
>> Using cephadm, is it possible to cut part of the NVME drive for OSD and
>> leave rest space for RocksDB/WALL?
> 
> Not out of the box.
> 
> You could check if your devices support NVMe namespaces and create more than 
> one namespace on the device. The kernel then sees multiple block devices and 
> for the orchestrator they are completely separate.
> 
> Regards
> -- 
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
> 
> https://www.heinlein-support.de
> 
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
> 
> Amtsgericht Berlin-Charlottenburg - HRB 220009 B
> Geschäftsführer: Peer Heinlein - Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: tuning for backup target cluster

2024-06-03 Thread Anthony D'Atri



The OP's number suggest IIRC like 120GB-ish for WAL+DB, though depending on 
workload spillover could of course still be a thing.

> 
> I have certainly seen cases where the OMAPS have not stayed within the 
> RocksDB/WAL NVME space and have been going down to disk.
> 
> This was on a large cluster with a lot of objects but the disks that where 
> being used for the non-ec pool where seeing a lot more actual disk activity 
> than the other disks in the system.
> 
> Moving the non-ec pool onto NVME helped with a lot of operations that needed 
> to be done to cleanup a lot of orphaned objects.
> 
> Yes this was a large cluster with a lot of ingress data admitedly.
> 
> Darren Soothill
> 
> Want a meeting with me: https://calendar.app.google/MUdgrLEa7jSba3du9
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io/
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich 
> CEO: Martin Verges - VAT-ID: DE310638492 
> Com. register: Amtsgericht Munich HRB 231263 
> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
> 
> 
> 
> 
>> On 29 May 2024, at 21:24, Anthony D'Atri  wrote:
>> 
>> 
>> 
>>> You also have the metadata pools used by RGW that ideally need to be on 
>>> NVME.
>> 
>> The OP seems to intend shared NVMe for WAL+DB, so that the omaps are on NVMe 
>> that way.
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: tuning for backup target cluster

2024-05-29 Thread Anthony D'Atri




> You also have the metadata pools used by RGW that ideally need to be on NVME.

The OP seems to intend shared NVMe for WAL+DB, so that the omaps are on NVMe 
that way.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-29 Thread Anthony D'Atri



> 
> Each simultaneously.  These are SATA3 with 128mb cache drives.

Turn off the cache.

>  The bus is 6 gb/s.  I expect usage to be in the 90+% range not the 50% range.

"usage" as measured how?

> 
> On Mon, May 27, 2024 at 5:37 PM Anthony D'Atri  wrote:
> 
>> 
>> 
>> 
>>  hdd iops on the three discs hover around 80 +/- 5.
>> 
>> 
>> Each or total?  I wouldn’t expect much more than 80 per drive.
>> 
>> 
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Lousy recovery for mclock and reef

2024-05-27 Thread Anthony D'Atri



> 
>   hdd iops on the three discs hover around 80 +/- 5.  

Each or total?  I wouldn’t expect much more than 80 per drive.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: tuning for backup target cluster

2024-05-27 Thread Anthony D'Atri

ve enough streams running in
>> parallel? Have you tried a benchmarking tool such as minio warp to see what
>> it can achieve.
>> 
>> You haven’t mentioned the number of PG’s you have for each of the pools in
>> question. You need to ensure that every pool that is being used has more
>> PG’s that the number of disks. If that’s not the case then individual disks
>> could be slowing things down.
>> 
>> You also have the metadata pools used by RGW that ideally need to be on
>> NVME.
>> 
>> Because you are using EC then there is the buckets.non-ec pool which is
>> used to manage the OMAPS for the multipart uploads this is usually down at
>> 8 PG’s and that will be limiting things as well.
>> 
>> 
>> 
>> Darren Soothill
>> 
>> Want a meeting with me: https://calendar.app.google/MUdgrLEa7jSba3du9
>> 
>> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>> 
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>> 
>> 
>> 
>> 
>> On 25 May 2024, at 14:56, Anthony D'Atri  wrote:
>> 
>> 
>> 
>> Hi Everyone,
>> 
>> I'm putting together a HDD cluster with an ECC pool dedicated to the backup
>> environment. Traffic via s3. Version 18.2,  7 OSD nodes, 12 * 12TB HDD +
>> 1NVME each,
>> 
>> 
>> QLC, man.  QLC.  That said, I hope you're going to use that single NVMe
>> SSD for at least the index pool.  Is this a chassis with universal slots,
>> or is that NVMe device maybe M.2 or rear-cage?
>> 
>> Wondering if there is some general guidance for startup setup/tuning in
>> regards to s3 object size.
>> 
>> 
>> Small objects are the devil of any object storage system.
>> 
>> 
>> Files are read from fast storage (SSD/NVME) and
>> written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic.
>> 
>> 
>> Nothing nonstandard about that, though your 1TB objects presumably are
>> going to be MPU.  Having the .buckets.non-ec pool on HDD with objects that
>> large might be really slow to assemble them, you might need to increase
>> timeouts but I'm speculating.
>> 
>> 
>> Backup for big files took hours to complete.
>> 
>> 
>> Spinners gotta spin.  They're a false economy.
>> 
>> My first shot would be to increase default bluestore_min_alloc_size_hdd, to
>> reduce the number of stored objects, but I'm not sure if it's a
>> good direccion?
>> 
>> 
>> With that workload you *could* increase that to like 64KB, but I don't
>> think it'd gain you much.
>> 
>> 
>> Any other parameters worth checking to support such a
>> traffic pattern?
>> 
>> 
>> `ceph df`
>> `ceph osd dump | grep pool`
>> 
>> So we can see what's going on HDD and what's on NVMe.
>> 
>> 
>> Thanks!
>> 
>> --
>> Łukasz
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
>> 
> 
> -- 
> Łukasz Borek
> luk...@borek.org.pl
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: tuning for backup target cluster

2024-05-25 Thread Anthony D'Atri



> Hi Everyone,
> 
> I'm putting together a HDD cluster with an ECC pool dedicated to the backup
> environment. Traffic via s3. Version 18.2,  7 OSD nodes, 12 * 12TB HDD +
> 1NVME each,

QLC, man.  QLC.  That said, I hope you're going to use that single NVMe SSD for 
at least the index pool.  Is this a chassis with universal slots, or is that 
NVMe device maybe M.2 or rear-cage?

> Wondering if there is some general guidance for startup setup/tuning in
> regards to s3 object size.

Small objects are the devil of any object storage system.


> Files are read from fast storage (SSD/NVME) and
> written to s3. Files sizes are 10MB-1TB, so it's not standard s3. traffic.

Nothing nonstandard about that, though your 1TB objects presumably are going to 
be MPU.  Having the .buckets.non-ec pool on HDD with objects that large might 
be really slow to assemble them, you might need to increase timeouts but I'm 
speculating.


> Backup for big files took hours to complete.

Spinners gotta spin.  They're a false economy.

> My first shot would be to increase default bluestore_min_alloc_size_hdd, to
> reduce the number of stored objects, but I'm not sure if it's a
> good direccion?  

With that workload you *could* increase that to like 64KB, but I don't think 
it'd gain you much.  


> Any other parameters worth checking to support such a
> traffic pattern?

`ceph df`
`ceph osd dump | grep pool`

So we can see what's going on HDD and what's on NVMe.

> 
> Thanks!
> 
> -- 
> Łukasz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph ARM providing storage for x86

2024-05-25 Thread Anthony D'Atri

Why not?  The hwarch doesn't matter.  

> On May 25, 2024, at 07:35, filip Mutterer  wrote:
> 
> Is this known to be working:
> 
> Setting up the Ceph Cluster with ARM and then use the storage with X86 
> Machines for example LXC, Docker and KVM?
> 
> Is this possible?
> 
> Greetings
> 
> filip
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best practice regarding rgw scaling

2024-05-23 Thread Anthony D'Atri

I'm interested in these responses.  Early this year a certain someone related 
having good results by deploying an RGW on every cluster node.  This was when 
we were experiencing ballooning memory usage conflicting with K8s limits when 
running 3.  So on the cluster in question we now run 25.

I've read in the past of people running multiple RGW instances on a given host 
vs trying to get a single instance to scale past a certain point.  I'm told 
that in our Rook/K8s case we couldn't do that.

> On May 23, 2024, at 11:49, Szabo, Istvan (Agoda)  
> wrote:
> 
> Hi,
> 
> Wonder what is the best practice to scale RGW, increase the thread numbers or 
> spin up more gateways?
> 
> 
>  *
> Let's say I have 21000 connections on my haproxy
>  *
> I have 3 physical gateway servers so let's say each of them need to server 
> 7000 connections
> 
> This means with 512 thread pool size each of them needs 13 gateway altogether 
> 39 in the cluster.
> or
> 3 gateway and each 8192 rgw thread?
> 
> Thank you
> 
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS as Offline Storage

2024-05-21 Thread Anthony D'Atri




> I think it is his lab so maybe it is a test setup for production.

Home production?

> 
> I don't think it matters to much with scrubbing, it is not like it is related 
> to how long you were offline. It will scrub just as much being 1 month 
> offline as being 6 months offline.
> 
>> 
>> If you have a single node arguably ZFS would be a better choice.
>> 
>>> On May 21, 2024, at 14:53, adam.ther  wrote:
>>> 
>>> Hello,
>>> 
>>> To save on power in my home lab can I have a single node CEPH cluster
>> sit idle and powered off for 3 months at a time then boot only to refresh
>> backups? Or will this cause issues I'm unaware of? I'm aware deep-
>> scrubbing will not happen, it would be done when in the boot-up period
>> ever 3 months.
>>> 
>>> What do you think?
>>> 
>>> Regards,
>>> 
>>> Adam
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS as Offline Storage

2024-05-21 Thread Anthony D'Atri

If you have a single node arguably ZFS would be a better choice.

> On May 21, 2024, at 14:53, adam.ther  wrote:
> 
> Hello,
> 
> To save on power in my home lab can I have a single node CEPH cluster sit 
> idle and powered off for 3 months at a time then boot only to refresh 
> backups? Or will this cause issues I'm unaware of? I'm aware deep-scrubbing 
> will not happen, it would be done when in the boot-up period ever 3 months.
> 
> What do you think?
> 
> Regards,
> 
> Adam
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please discuss about Slow Peering

2024-05-21 Thread Anthony D'Atri

> 
> 
> I have additional questions,
> We use 13 disk (3.2TB NVMe) per server and allocate one OSD to each disk. In 
> other words 1 Node has 13 osds.
> Do you think this is inefficient?
> Is it better to create more OSD by creating LV on the disk?

Not with the most recent Ceph releases.  I suspect your PGs are too few though.

>> 
>> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

2024-05-20 Thread Anthony D'Atri



> On May 20, 2024, at 2:24 PM, Matthew Vernon  wrote:
> 
> Hi,
> 
> Thanks for your help!
> 
> On 20/05/2024 18:13, Anthony D'Atri wrote:
> 
>> You do that with the CRUSH rule, not with osd_crush_chooseleaf_type.  Set 
>> that back to the default value of `1`.  This option is marked `dev` for a 
>> reason ;)
> 
> OK [though not obviously at 
> https://docs.ceph.com/en/reef/rados/configuration/pool-pg-config-ref/#confval-osd_crush_chooseleaf_type
>  ]

Aye that description is pretty oblique.  I’d update it if I fully understood 
it.  But one might argue that if you don’t understand something, leave it alone 
;)

> 
>> but I think you’d also need to revert `osd_crush_chooseleaf_type` too.  
>> Might be better to wipe and redeploy so you know that down the road when you 
>> add / replace hardware this behavior doesn’t resurface.
> 
> Yep, I'm still at the destroy-and-recreate point here, trying to make sure I 
> can do this repeatably.
> 
>>>>> Once the cluster was up I used an osd spec file that looked like:
>>>>> service_type: osd
>>>>> service_id: rrd_single_NVMe
>>>>> placement:
>>>>>  label: "NVMe"
>>>>> spec:
>>>>>  data_devices:
>>>>>rotational: 1
>>>>>  db_devices:
>>>>>model: "NVMe"
>>>> Is it your intent to use spinners for payload data and SSD for metadata?
>>> 
>>> Yes.
>> You might want to set `db_slots` accordingly, by default I think it’ll be 
>> 1:1 which probably isn’t what you intend.
> 
> Is there an easy way to check this? The docs suggested it would work, and 
> vgdisplay on the vg that pvs tells me the nvme device is in shows 24 LVs...

If you create the OSDs and their DB/WAL devices show NVMe partitions then 
you’re good.  How many NVMe devices do you have on the HDD nodes?  

> 
> Thanks,
> 
> Matthew
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

2024-05-20 Thread Anthony D'Atri


> 
>>> This has left me with a single sad pg:
>>> [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
>>>pg 1.0 is stuck inactive for 33m, current state unknown, last acting []
>>> 
>> .mgr pool perhaps.
> 
> I think so
> 
>>> ceph osd tree shows that CRUSH picked up my racks OK, eg.
>>> -3  45.11993  rack B4
>>> -2  45.11993  host moss-be1001
>>> 1hdd3.75999  osd.1 up   1.0  1.0
>> Please send the entire first 10 lines or so of `ceph osd tree`
> 
> root@moss-be1001:/# ceph osd tree
> ID  CLASS  WEIGHT TYPE NAME STATUS  REWEIGHT  PRI-AFF
> -7 176.11194  rack F3
> -6 176.11194  host moss-be1003
> 2hdd7.33800  osd.2 up   1.0  1.0
> 3hdd7.33800  osd.3 up   1.0  1.0
> 6hdd7.33800  osd.6 up   1.0  1.0
> 9hdd7.33800  osd.9 up   1.0  1.0
> 12hdd7.33800  osd.12up   1.0  1.0
> 13hdd7.33800  osd.13up   1.0  1.0
> 16hdd7.33800  osd.16up   1.0  1.0
> 19hdd7.33800  osd.19up   1.0  1.0

Yep.  Your racks and thus hosts and OSDs aren’t under the `default` or any 
other root, so they won’t get picked by any CRUSH rule. 

> 
>>> 
>>> I passed this config to bootstrap with --config:
>>> 
>>> [global]
>>>  osd_crush_chooseleaf_type = 3
>> Why did you set that?  3 is an unusual value.  AIUI most of the time the 
>> only reason to change this option is if one is setting up a single-node 
>> sandbox - and perhaps localpools create a rule using it.  I suspect this is 
>> at least part of your problem.
> 
> I wanted to have rack as failure domain rather than host i.e. to ensure that 
> each replica goes in a different rack (academic at the moment as I have 3 
> hosts, one in each rack, but for future expansion important).

You do that with the CRUSH rule, not with osd_crush_chooseleaf_type.  Set that 
back to the default value of `1`.  This option is marked `dev` for a reason ;)

And the replication rule:
rule replicated_rule {
   id 0
   type replicated
   step take default
   step chooseleaf firstn 0 type rack ##`rack` here is what 
selects the failure domain.
   step emit
}


> I could presumably fix this up by editing the crushmap (to put the racks into 
> the default bucket)

That would probably help 

`ceph osh crush move F3 root=default`

but I think you’d also need to revert `osd_crush_chooseleaf_type` too.  Might 
be better to wipe and redeploy so you know that down the road when you add / 
replace hardware this behavior doesn’t resurface.


> 
>>> Once the cluster was up I used an osd spec file that looked like:
>>> service_type: osd
>>> service_id: rrd_single_NVMe
>>> placement:
>>>  label: "NVMe"
>>> spec:
>>>  data_devices:
>>>rotational: 1
>>>  db_devices:
>>>model: "NVMe"
>> Is it your intent to use spinners for payload data and SSD for metadata?
> 
> Yes.

You might want to set `db_slots` accordingly, by default I think it’ll be 1:1 
which probably isn’t what you intend.

> 
> Regards,
> 
> Matthew
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: cephadm bootstraps cluster with bad CRUSH map(?)

2024-05-20 Thread Anthony D'Atri



> On May 20, 2024, at 12:21 PM, Matthew Vernon  wrote:
> 
> Hi,
> 
> I'm probably Doing It Wrong here, but. My hosts are in racks, and I wanted 
> ceph to use that information from the get-go, so I tried to achieve this 
> during bootstrap.
> 
> This has left me with a single sad pg:
> [WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
>pg 1.0 is stuck inactive for 33m, current state unknown, last acting []
> 

.mgr pool perhaps.

> ceph osd tree shows that CRUSH picked up my racks OK, eg.
> -3  45.11993  rack B4
> -2  45.11993  host moss-be1001
> 1hdd3.75999  osd.1 up   1.0  1.0


Please send the entire first 10 lines or so of `ceph osd tree`

> 
> I passed this config to bootstrap with --config:
> 
> [global]
>  osd_crush_chooseleaf_type = 3

Why did you set that?  3 is an unusual value.  AIUI most of the time the only 
reason to change this option is if one is setting up a single-node sandbox - 
and perhaps localpools create a rule using it.  I suspect this is at least part 
of your problem.

> 
> 
> Once the cluster was up I used an osd spec file that looked like:
> service_type: osd
> service_id: rrd_single_NVMe
> placement:
>  label: "NVMe"
> spec:
>  data_devices:
>rotational: 1
>  db_devices:
>model: "NVMe"

Is it your intent to use spinners for payload data and SSD for metadata?

> 
> I could presumably fix this up by editing the crushmap (to put the racks into 
> the default bucket), but what did I do wrong? Was this not a reasonable thing 
> to want to do with cephadm?
> 
> I'm running
> ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
> 
> Thanks,
> 
> Matthew
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please discuss about Slow Peering

2024-05-16 Thread Anthony D'Atri

If using jumbo frames, also ensure that they're consistently enabled on all OS 
instances and network devices. 

> On May 16, 2024, at 09:30, Frank Schilder  wrote:
> 
> This is a long shot: if you are using octopus, you might be hit by this 
> pglog-dup problem: 
> https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't 
> mention slow peering explicitly in the blog, but its also a consequence 
> because the up+acting OSDs need to go through the PG_log during peering.
> 
> We are also using octopus and I'm not sure if we have ever seen slow ops 
> caused by peering alone. It usually happens when a disk cannot handle load 
> under peering. We have, unfortunately, disks that show random latency spikes 
> (firmware update pending). You can try to monitor OPS latencies for your 
> drives when peering and look for something that sticks out. People on this 
> list were reporting quite bad results for certain infamous NVMe brands. If 
> you state your model numbers, someone else might recognize it.
> 
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> 
> From: 서민우 
> Sent: Thursday, May 16, 2024 7:39 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Please discuss about Slow Peering
> 
> Env:
> - OS: Ubuntu 20.04
> - Ceph Version: Octopus 15.0.0.1
> - OSD Disk: 2.9TB NVMe
> - BlockStorage (Replication 3)
> 
> Symptom:
> - Peering when OSD's node up is very slow. Peering speed varies from PG to
> PG, and some PG may even take 10 seconds. But, there is no log for 10
> seconds.
> - I checked the effect of client VM's. Actually, Slow queries of mysql
> occur at the same time.
> 
> There are Ceph OSD logs of both Best and Worst.
> 
> Best Peering Case (0.5 Seconds)
> 2024-04-11T15:32:44.693+0900 7f108b522700  1 osd.7 pg_epoch: 27368 pg[6.8]
> state: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> state: Peering, affected_by_map, going to Reset
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11],
> acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
> state: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
> start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11],
> acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting
> 
> Worst Peering Case (11.6 Seconds)
> 2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
> state: transitioning to Stray
> 2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
> start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
> 2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
> state: transitioning to Stray
> 2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
> start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting
> 2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
> state: transitioning to Stray
> 2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
> start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1],
> acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
> 
> *I wish to know about*
> - Why some PG's take 10 seconds until Peering finishes.
> - Why Ceph log is quiet during peering.
> - Is this symptom intended in Ceph.
> 
> *And please give some advice,*
> - Is there any way to improve peering speed?
> - Or, Is there a way to not affect the client when peering occurs?
> 
> P.S
> - I checked the symptoms in the following environments.
> -> Octopus Version, Reef Version, Cephadm, Ceph-Ansible
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrading Ceph Cluster OS

2024-05-14 Thread Anthony D'Atri

https://docs.ceph.com/en/latest/start/os-recommendations/#platforms


You might want to go to 20.04, then to Reef, then to 22.04

> On May 13, 2024, at 12:22, Nima AbolhassanBeigi 
>  wrote:
> 
> The ceph version is 16.2.13 pacific.
> It's deployed using ceph-ansible. (release branch stable-6.0)
> We are trying to upgrade from ubuntu 18.04 to ubuntu 22.04.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph reef and (slow) backfilling - how to speed it up

2024-05-12 Thread Anthony D'Atri

I halfway suspect that something akin to the speculation in 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/7MWAHAY7NCJK2DHEGO6MO4SWTLPTXQMD/
 is going on.

Below are reservations reported by a random OSD that serves (mostly) an EC RGW 
bucket pool.  This is with the mclock override on and the usual three 
backfill/recovery tunables set to 7 (bumped to get more OSDs backfilling after 
I changed to rack failure domain, having 50+ % of objects remapped makes me 
nervous and I want convergence.

3 happens to be the value of osd_recovery_max_active_hdd , so maybe there is 
some interaction between EC and how osd_recovery_max_active is derived and used?

Complete wild-ass speculation.

Just for grins, after `ceph osd down 313`

* local_reservations incremented
* remote_reservations decreased somewhat
* cluster aggregate recovery speed increased for at least the short term


[root@rook-ceph-osd-313-6f84bc5bd5-hr825 ceph]# ceph daemon osd.313 
dump_recovery_reservations
{
"local_reservations": {
"max_allowed": 7,
"min_priority": 0,
"queues": [],
"in_progress": [
{
"item": "21.161es0",
"prio": 110,
"can_preempt": true
},
{
"item": "21.180bs0",
"prio": 110,
"can_preempt": true
},
{
"item": "21.1e0as0",
"prio": 110,
"can_preempt": true
}
]
},
"remote_reservations": {
"max_allowed": 7,
"min_priority": 0,
"queues": [
{
"priority": 110,
"items": [
{
"item": "21.1d18s5",
"prio": 110,
"can_preempt": true
},
{
"item": "21.7d0s2",
"prio": 110,
"can_preempt": true
},
{
"item": "21.766s5",
"prio": 110,
"can_preempt": true
},
{
"item": "21.373s1",
"prio": 110,
"can_preempt": true
},
{
"item": "21.1a8es1",
"prio": 110,
"can_preempt": true
},
{
"item": "21.2das2",
"prio": 110,
"can_preempt": true
},
{
"item": "21.14a0s2",
"prio": 110,
"can_preempt": true
},
{
"item": "21.c7fs5",
"prio": 110,
"can_preempt": true
},
{
"item": "21.18e5s5",
"prio": 110,
"can_preempt": true
},
{
"item": "21.54ds2",
"prio": 110,
"can_preempt": true
},
{
"item": "21.79bs4",
"prio": 110,
"can_preempt": true
},
{
"item": "21.15c3s2",
"prio": 110,
"can_preempt": true
},
{
"item": "21.e15s4",
"prio": 110,
"can_preempt": true
},
{
"item": "21.226s3",
"prio": 110,
"can_preempt": true
},
{
"item": "21.adfs2",
"prio": 110,
"can_preempt": true
},
{
"item": "21.184bs4",
"prio": 110,
"can_preempt": true
},
{
"item": "21.f43s3",
"prio": 110,
"can_preempt": true
},
{
"item": "21.f5cs4",
"prio": 110,
"can_preempt": true
},
{
"item": "21.1300s3",
"prio": 110,
"can_preempt": true
},
{

[ceph-users] Re: Ceph reef and (slow) backfilling - how to speed it up

2024-05-02 Thread Anthony D'Atri




>> For our customers we are still disabling mclock and using wpq. Might be
>> worth trying.
>> 
>> 
> Could you please elaborate a bit on the issue(s) preventing the
> use of mClock. Is this specific to only the slow backfill rate and/or other
> issue?
> 
> This feedback would help prioritize the improvements in those areas.

Multiple people -- including me -- have also observed backfill/recovery stop 
completely for no apparent reason.

In some cases poking the lead OSD for a PG with `ceph osd down` restores, in 
other cases it doesn't.

Anecdotally this *may* only happen for EC pools on HDDs but that sample size is 
small.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recoveries without any misplaced objects?

2024-04-24 Thread Anthony D'Atri

Do you see *keys* aka omap traffic?  Especially if you have RGW set up?

> On Apr 24, 2024, at 15:37, David Orman  wrote:
> 
> Did you ever figure out what was happening here?
> 
> David
> 
> On Mon, May 29, 2023, at 07:16, Hector Martin wrote:
>> On 29/05/2023 20.55, Anthony D'Atri wrote:
>>> Check the uptime for the OSDs in question
>> 
>> I restarted all my OSDs within the past 10 days or so. Maybe OSD
>> restarts are somehow breaking these stats?
>> 
>>> 
>>>> On May 29, 2023, at 6:44 AM, Hector Martin  wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I'm watching a cluster finish a bunch of backfilling, and I noticed that
>>>> quite often PGs end up with zero misplaced objects, even though they are
>>>> still backfilling.
>>>> 
>>>> Right now the cluster is down to 6 backfilling PGs:
>>>> 
>>>> data:
>>>>   volumes: 1/1 healthy
>>>>   pools:   6 pools, 268 pgs
>>>>   objects: 18.79M objects, 29 TiB
>>>>   usage:   49 TiB used, 25 TiB / 75 TiB avail
>>>>   pgs: 262 active+clean
>>>>6   active+remapped+backfilling
>>>> 
>>>> But there are no misplaced objects, and the misplaced column in `ceph pg
>>>> dump` is zero for all PGs.
>>>> 
>>>> If I do a `ceph pg dump_json`, I can see `num_objects_recovered`
>>>> increasing for these PGs... but the misplaced count is still 0.
>>>> 
>>>> Is there something else that would cause recoveries/backfills other than
>>>> misplaced objects? Or perhaps there is a bug somewhere causing the
>>>> misplaced object count to be misreported as 0 sometimes?
>>>> 
>>>> # ceph -v
>>>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>>>> (stable)
>>>> 
>>>> - Hector
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>>> 
>> 
>> - Hector
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-23 Thread Anthony D'Atri




> On Apr 23, 2024, at 12:24, Maged Mokhtar  wrote:
> 
> For nvme:HDD ratio, yes you can go for 1:10, or if you have extra slots you 
> can use 1:5 using smaller capacity/cheaper nvmes, this will reduce the impact 
> of nvme failures.

On occasion I've seen a suggestion to mirror the fast devices and use LVM 
slices for each OSD.  This might result in increased wear.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Status of IPv4 / IPv6 dual stack?

2024-04-23 Thread Anthony D'Atri

Sounds like an opportunity for you to submit an expansive code PR to implement 
it.

> On Apr 23, 2024, at 04:28, Marc  wrote:
> 
>> I have removed dual-stack-mode-related information from the documentation
>> on the assumption that dual-stack mode was planned but never fully
>> implemented.
>> 
>> See https://tracker.ceph.com/issues/65631.
>> 
>> See https://github.com/ceph/ceph/pull/57051.
>> 
>> Hat-tip to Dan van der Ster, who bumped this thread for me.
> 
> "I will remove references to dual-stack mode in the documentation because i"
> 
> I prefer if it would state that dual stack is not supported (and maybe why). 
> By default I would assume such a thing is supported. I would not even suspect 
> it could be an issue.
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-21 Thread Anthony D'Atri

>Do you have any data on the reliability of QLC NVMe drives? 

They were my job for a year, so yes, I do.  The published specs are accurate.  
A QLC drive built from the same NAND as a TLC drive will have more capacity, 
but less endurance.  Depending on the model, you may wish to enable 
`bluestore_use_optimal_io_size_for_min_alloc_size` when creating your OSDs.  
The Intel / Soldigim P5316, for example, has a 64KB IU size, so performance and 
endurance will benefit from aligning OSD `min_alloc_size` to that value.  Note 
that this is baked in at creation, you cannot change it on a given OSD after 
the fact, but you can redeploy the OSD and let it recover.

Other SKUs have 8KB or 16KB IU sizes, some have 4KB which requires no specific 
min_alloc_size.  Note that QLC is a good fit for workloads where writes tend to 
be sequential and reasonably large on average and infrequent.  I know of 
successful QLC RGW clusters that see 0.01 DWPD.  Yes, that decimal point is in 
the correct place.  Millions of 1KB files overwritten once an hour aren't a 
good workload for QLC. Backups, archives, even something like an OpenStack 
Glance pool are good fits.  I'm about to trial QLC as Prometheus LTS as well. 
Read-mostly workloads are good fits, as the read performance is in the ballpark 
of TLC.  Write performance is still going to be way better than any HDD, and 
you aren't stuck with legacy SATA slots.  You also don't have to buy or manage 
a fussy HBA.

> How old is your deep archive cluster, how many NVMes it has, and how many did 
> you
> have to replace?

I don't personally have one at the moment.

Even with TLC, endurance is, dare I say, overrated.  99% of enterprise SSDs 
never burn more than 15% of their rated endurance.  SSDs from at least some 
manufacturers have a timed workload feature in firmware that will estimate 
drive lifetime when presented with a real-world workload -- this is based on 
observed PE cycles.

Pretty much any SSD will report lifetime used or remaining, so TLC, QLC, even 
MLC or SLC you should collect those metrics in your time-series DB and watch 
both for drives nearing EOL and their burn rates.  

> 
> On Sun, Apr 21, 2024 at 11:06 PM Anthony D'Atri  
> wrote:
>> 
>> A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB 
>> in size, 32 of those in one RU makes for a cluster that doesn’t take up the 
>> whole DC.
>> 
>>> On Apr 21, 2024, at 5:42 AM, Darren Soothill  
>>> wrote:
>>> 
>>> Hi Niklaus,
>>> 
>>> Lots of questions here but let me tray and get through some of them.
>>> 
>>> Personally unless a cluster is for deep archive then I would never suggest 
>>> configuring or deploying a cluster without Rocks DB and WAL on NVME.
>>> There are a number of benefits to this in terms of performance and 
>>> recovery. Small writes go to the NVME first before being written to the HDD 
>>> and it makes many recovery operations far more efficient.
>>> 
>>> As to how much faster it makes things that very much depends on the type of 
>>> workload you have on the system. Lots of small writes will make a 
>>> significant difference. Very large writes not as much of a difference.
>>> Things like compactions of the RocksDB database are a lot faster as they 
>>> are now running from NVME and not from the HDD.
>>> 
>>> We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This 
>>> is assuming the NVME’s being used are good mixed use enterprise NVME’s with 
>>> power loss protection.
>>> 
>>> As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but 
>>> this is no worse than a failure of an entire node. This is something Ceph 
>>> is designed to handle.
>>> 
>>> I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
>>> that will degrade the performance of them when you are trying to get better 
>>> performance.
>>> 
>>> 
>>> 
>>> Darren Soothill
>>> 
>>> 
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> 
> 
> -- 
> Alexander E. Patrakov
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Why CEPH is better than other storage solutions?

2024-04-21 Thread Anthony D'Atri

Vendor lock-in only benefits vendors.  You’ll pay outrageously for support / 
maint then your gear goes EOL and you’re trolling eBay for parts.   

With Ceph you use commodity servers, you can swap 100% of the hardware without 
taking downtime with servers and drives of your choice.  And you get the source 
code so worst case you can fix or customize.  Ask me sometime about my 
experience with a certain proprietary HW vendor.  

Longhorn , openEBS I don’t know much about.  I suspect that they don’t offer 
the richness of Ceph and that their communities are much smaller.  

Of course we’re biased here;)

> On Apr 21, 2024, at 5:21 AM, sebci...@o2.pl wrote:
> 
> Hi,
> I have problem to answer to this question:
> Why CEPH is better than other storage solutions?
> 
> I know this high level texts about
> - scalability,
> - flexibility,
> - distributed,
> - cost-Effectiveness
> 
> What convince me, but could be received also against, is ceph as a product 
> has everything what I need it mean:
> block storage (RBD),
> file storage (CephFS),
> object storage (S3, Swift)
> and "plugins" to run NFS, NVMe over Fabric, NFS on object storage.
> 
> Also many other features which are usually sold as a option (mirroring, geo 
> replication, etc) in paid solutions.
> I have problem to write it done piece by piece.
> I want convince my managers we are going in good direction.
> 
> Why not something from robin.io or purestorage, netapp, dell/EMC. From 
> opensource longhorn or openEBS.
> 
> If you have ideas please write it.
> 
> Thanks,
> S.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-21 Thread Anthony D'Atri

A deep archive cluster benefits from NVMe too.  You can use QLC up to 60TB in 
size, 32 of those in one RU makes for a cluster that doesn’t take up the whole 
DC.  

> On Apr 21, 2024, at 5:42 AM, Darren Soothill  wrote:
> 
> Hi Niklaus,
> 
> Lots of questions here but let me tray and get through some of them.
> 
> Personally unless a cluster is for deep archive then I would never suggest 
> configuring or deploying a cluster without Rocks DB and WAL on NVME.
> There are a number of benefits to this in terms of performance and recovery. 
> Small writes go to the NVME first before being written to the HDD and it 
> makes many recovery operations far more efficient.
> 
> As to how much faster it makes things that very much depends on the type of 
> workload you have on the system. Lots of small writes will make a significant 
> difference. Very large writes not as much of a difference.
> Things like compactions of the RocksDB database are a lot faster as they are 
> now running from NVME and not from the HDD.
> 
> We normally work with  a upto 1:12 ratio so 1 NVME for every 12 HDD’s. This 
> is assuming the NVME’s being used are good mixed use enterprise NVME’s with 
> power loss protection.
> 
> As to failures yes a failure of the NVME would mean a loss of 12 OSD’s but 
> this is no worse than a failure of an entire node. This is something Ceph is 
> designed to handle.
> 
> I certainly wouldn’t be thinking about putting the NVME’s into raid sets as 
> that will degrade the performance of them when you are trying to get better 
> performance.
> 
> 
> 
> Darren Soothill
> 
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io/
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrading Ceph 15 to 18

2024-04-21 Thread Anthony D'Atri

It should.  

> On Apr 21, 2024, at 5:48 AM, Malte Stroem  wrote:
> 
> Thank you, Anthony.
> 
> But does it work to upgrade from the latest 15 to the latest 16, too?
> 
> We'd like to be careful.
> 
> And then from the latest 16 to the latest 18?
> 
> Best,
> Malte
> 
>> On 21.04.24 04:14, Anthony D'Atri wrote:
>> The party line is to jump no more than 2 major releases at once.
>> So that would be Octopus (15) to Quincy (17) to Reef (18).
>> Squid (19) is due out soon, so you may want to pause at Quincy until Squid 
>> is released and has some runtime and maybe 19.2.1, then go straight to Squid 
>> from Quincy to save a step.
>> If you can test the upgrades on a lab cluster first, so much the better.  Be 
>> sure to read the release notes for every release in case there are specific 
>> additional actions or NBs.
>>>> On Apr 20, 2024, at 18:42, Malte Stroem  wrote:
>>> 
>>> Hello,
>>> 
>>> we'd like to upgrade our cluster from the latest Ceph 15 to Ceph 18.
>>> 
>>> It's running with cephadm.
>>> 
>>> What's the right way to do it?
>>> 
>>> Latest Ceph 15 to latest 16 and then to the latest 17 and then the latest 
>>> 18?
>>> 
>>> Does that work?
>>> 
>>> Or is it possible to jump from the latest Ceph 16 to the latest Ceph 18?
>>> 
>>> Latest Ceph 15 -> latest Ceph 16 -> latest Ceph 18.
>>> 
>>> Best,
>>> Malte
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrading Ceph 15 to 18

2024-04-20 Thread Anthony D'Atri

The party line is to jump no more than 2 major releases at once.

So that would be Octopus (15) to Quincy (17) to Reef (18).

Squid (19) is due out soon, so you may want to pause at Quincy until Squid is 
released and has some runtime and maybe 19.2.1, then go straight to Squid from 
Quincy to save a step.

If you can test the upgrades on a lab cluster first, so much the better.  Be 
sure to read the release notes for every release in case there are specific 
additional actions or NBs.

> On Apr 20, 2024, at 18:42, Malte Stroem  wrote:
> 
> Hello,
> 
> we'd like to upgrade our cluster from the latest Ceph 15 to Ceph 18.
> 
> It's running with cephadm.
> 
> What's the right way to do it?
> 
> Latest Ceph 15 to latest 16 and then to the latest 17 and then the latest 18?
> 
> Does that work?
> 
> Or is it possible to jump from the latest Ceph 16 to the latest Ceph 18?
> 
> Latest Ceph 15 -> latest Ceph 16 -> latest Ceph 18.
> 
> Best,
> Malte
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Mysterious Space-Eating Monster

2024-04-19 Thread Anthony D'Atri

Look for unlinked but open files, it may not be Ceph at fault.  Suboptimal 
logrotate rules can cause this.  lsof, fsck -n, etc.

> On Apr 19, 2024, at 05:54, Sake Ceph  wrote:
> 
> Hi Matthew, 
> 
> Cephadm doesn't cleanup old container images, at least with Quincy. After a 
> upgrade we run the following commands:
> sudo podman system prune -a -f
> sudo podman volume prune -f
> 
> But if someone has a better advice, please tell us. 
> 
> Kind regards, 
> Sake
>> Op 19-04-2024 10:24 CEST schreef duluxoz :
>> 
>> 
>> Hi All,
>> 
>> *Something* is chewing up a lot of space on our `\var` partition to the 
>> point where we're getting warnings about the Ceph monitor running out of 
>> space (ie > 70% full).
>> 
>> I've been looking, but I can't find anything significant (ie log files 
>> aren't too big, etc) BUT there seem to be a hell of a lot (15) of 
>> sub-directories (with GUIDs for names) under the 
>> `/var/lib/containers/storage/overlay/` folder, all ending with `merged` 
>> - ie `/var/lib/containers/storage/overlay/{{GUID}}/`merged`.
>> 
>> Is this normal, or is something going wrong somewhere, or am I looking 
>> in the wrong place?
>> 
>> Also, if this is the issue, can I delete these folders?
>> 
>> Sorry for asking such a noob Q, but the Cephadm/Podman stuff is 
>> extremely new to me  :-)
>> 
>> Thanks in advance
>> 
>> Cheers
>> 
>> Dulux-Oz
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Best practice and expected benefits of using separate WAL and DB devices with Bluestore

2024-04-19 Thread Anthony D'Atri

This is a ymmv thing, it depends on one's workload.


> 
> However, we have some questions about this and are looking for some guidance 
> and advice.
> 
> The first one is about the expected benefits. Before we undergo the efforts 
> involved in the transition, we are wondering if it is even worth it.

My personal sense is that it often isn't.  It adds a lot of complexity for OSD 
lifecycle.  I suspect there are many sites where if an OSD drive is replaced, 
it gets rebuilt without the external WAL/DB.

This is one of the reasons that HDDs are a false economy.

> How much of a performance boost one can expect when adding NVMe SSDs for 
> WAL-devices to an HDD cluster? Plus, how much faster than that does it get 
> with the DB also being on SSD. Are there rule-of-thumb number of that? Or 
> maybe someone has done benchmarks in the past?

Very workload-dependent.  My limited understanding is that the WAL on faster 
storage only gets used for small writes, ones smaller than the deferred 
setting, which I think defaults to min_alloc_size.  So maybe for a DB workload 
that does very small random overwrites?

> 
> The second question is of more practical nature. Are there any best-practices 
> on how to implement this? I was thinking we won't do one SSD per HDD - surely 
> an NVMe SSD is plenty fast to handle the traffic from multiple OSDs. But what 
> is a good ratio? Do I have one NVMe SSD per 4 HDDs? Per 6 or even 8?

There was once the rule of thumb of 4 SATA HDDs per SATA SSD, 10 SATA HDDs per 
NVMe SSD.  There isn't a hard line in the sand.  The ratio is enforced in part 
by the chassis in use.

> Also, how should I chop-up the SSD, using partitions or using LVM? Last but 
> not least, if I have one SSD handle WAL and DB for multiple OSDs, losing that 
> SSD means losing multiple OSDs. How do people deal with this risk? Is it 
> generally deemed acceptable or is this something people tend to mitigate and 
> if so how? Do I run multiple SSDs in RAID?
> 
> I do realize that for some of these, there might not be the one perfect 
> answer that fits all use cases. I am looking for best practices and in 
> general just trying to avoid any obvious mistakes.

This is one of the reasons why I counsel people to consider TCO -- and source 
hardware effectively -- so they don't get stuck with HDDs.

> 
> Any advice is much appreciated.
> 
> Sincerely
> 
> Niklaus Hofer
> -- 
> stepping stone AG
> Wasserwerkgasse 7
> CH-3011 Bern
> 
> Telefon: +41 31 332 53 63
> www.stepping-stone.ch
> niklaus.ho...@stepping-stone.ch
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance of volume size, not a block size

2024-04-15 Thread Anthony D'Atri

If you're using SATA/SAS SSDs I would aim for 150-200 PGs per OSD as shown by 
`ceph osd df`.
If NVMe, 200-300 unless you're starved for RAM.


> On Apr 15, 2024, at 07:07, Mitsumasa KONDO  wrote:
> 
> Hi Menguy-san,
> 
> Thank you for your reply. Users who use large IO with tiny volumes are a
> nuisance to cloud providers.
> 
> I confirmed my ceph cluster with 40 SSDs. Each OSD on 1TB SSD has about 50
> placement groups in my cluster. Therefore, each PG has approximately 20GB
> of space.
> If we create a small 8GB volume, I had a feeling it wouldn't be distributed
> well, but it will be distributed well.
> 
> Regards,
> --
> Mitsumasa KONDO
> 
> 2024年4月15日(月) 15:29 Etienne Menguy :
> 
>> Hi,
>> 
>> Volume size doesn't affect performance, cloud providers apply a limit to
>> ensure they can deliver expected performances to all their customers.
>> 
>> Étienne
>> --
>> *From:* Mitsumasa KONDO 
>> *Sent:* Monday, 15 April 2024 06:06
>> *To:* ceph-users@ceph.io 
>> *Subject:* [ceph-users] Performance of volume size, not a block size
>> 
>> [Some people who received this message don't often get email from
>> kondo.mitsum...@gmail.com. Learn why this is important at
>> https://aka.ms/LearnAboutSenderIdentification ]
>> 
>> Hi,
>> 
>> In AWS EBS gp3, AWS says that small volume size cannot achieve best
>> performance. I think it's a feature or tendency of general
>> distributed storages including Ceph. Is that right in Ceph block storage? I
>> read many docs on ceph community. I never heard of Ceph storage.
>> 
>> 
>> https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.aws.amazon.com%2Febs%2Flatest%2Fuserguide%2Fgeneral-purpose.html=05%7C02%7Cetienne.menguy%40ubisoft.com%7C3076825a4d2a4897074208dc5d017852%7Ce01bd386fa514210a2a429e5ab6f7ab1%7C0%7C0%7C638487508098942744%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C=wOQKqG41uccTbyNHDIps62ojcTFBZYlyxxp3TzccsJI%3D=0
>> 
>> 
>> Regard,
>> --
>> Mitsumasa KONDO
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG inconsistent

2024-04-12 Thread Anthony D'Atri

If you're using an Icinga active check that just looks for 

SMART overall-health self-assessment test result: PASSED

then it's not doing much for you.  That bivalue status can be shown for a drive 
that is decidedly an ex-parrot.  Gotta look at specific attributes, which is 
thorny since they aren't consistently implemented.  drivedb.h is a downright 
mess, which doesn't help.

> 
> 
> 
> 
> - Le 12 Avr 24, à 15:17, Albert Shih albert.s...@obspm.fr a écrit :
> 
>> Le 12/04/2024 à 12:56:12+0200, Frédéric Nass a écrit
>>> 
>> Hi,
>> 
>>> 
>>> Have you check the hardware status of the involved drives other than with
>>> smartctl? Like with the manufacturer's tools / WebUI (iDrac / perccli for 
>>> DELL
>>> hardware for example).
>> 
>> Yes, all my disk are «under» periodic check with smartctl + icinga.
> 
> Actually, I meant lower level tools (drive / server vendor tools).
> 
>> 
>>> If these tools don't report any media error (that is bad blocs on disks) 
>>> then
>>> you might just be facing the bit rot phenomenon. But this is very rare and
>>> should happen in a sysadmin's lifetime as often as a Royal Flush hand in a
>>> professional poker player's lifetime. ;-)
>>> 
>>> If no media error is reported, then you might want to check and update the
>>> firmware of all drives.
>> 
>> You're perfectly right.
>> 
>> It's just a newbie error, I check on the «main» osd of the PG (meaning the
>> first in the list) but forget to check on other.
>> 
> 
> Ok.
> 
>> On when server I indeed get some error on a disk.
>> 
>> But strangely smartctl report nothing. I will add a check with dmesg.
> 
> That's why I pointed you to the drive / server vendor tools earlier as 
> sometimes smartctl is missing the information you want.
> 
>> 
>>> 
>>> Once you figured it out, you may enable osd_scrub_auto_repair=true to have 
>>> these
>>> inconsistencies repaired automatically on deep-scrubbing, but make sure 
>>> you're
>>> using the alert module [1] so to at least get informed about the scrub 
>>> errors.
>> 
>> Thanks. I will look into because we got already icinga2 on site so I use
>> icinga2 to check the cluster.
>> 
>> Is they are a list of what the alert module going to check ?
> 
> Basically the module checks for ceph status (ceph -s) changes.
> 
> https://github.com/ceph/ceph/blob/main/src/pybind/mgr/alerts/module.py
> 
> Regards,
> Frédéric.
> 
>> 
>> 
>> Regards
>> 
>> JAS
>> --
>> Albert SHIH 嶺 
>> France
>> Heure locale/Local time:
>> ven. 12 avril 2024 15:13:13 CEST
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Impact of large PG splits

2024-04-12 Thread Anthony D'Atri

One can up the ratios temporarily but it's all too easy to forget to reduce 
them later, or think that it's okay to run all the time with reduced headroom.

Until a host blows up and you don't have enough space to recover into.

> On Apr 12, 2024, at 05:01, Frédéric Nass  
> wrote:
> 
> 
> Oh, and yeah, considering "The fullest OSD is already at 85% usage" best move 
> for now would be to add new hardware/OSDs (to avoid reaching the backfill too 
> full limit), prior to start the splitting PGs before or after enabling upmap 
> balancer depending on how the PGs got rebalanced (well enough or not) after 
> adding new OSDs.
> 
> BTW, what ceph version is this? You should make sure you're running v16.2.11+ 
> or v17.2.4+ before splitting PGs to avoid this nasty bug: 
> https://tracker.ceph.com/issues/53729
> 
> Cheers,
> Frédéric.
> 
> - Le 12 Avr 24, à 10:41, Frédéric Nass frederic.n...@univ-lorraine.fr a 
> écrit :
> 
>> Hello Eugen,
>> 
>> Is this cluster using WPQ or mClock scheduler? (cephadm shell ceph daemon 
>> osd.0
>> config show | grep osd_op_queue)
>> 
>> If WPQ, you might want to tune osd_recovery_sleep* values as they do have a 
>> real
>> impact on the recovery/backfilling speed. Just lower osd_max_backfills to 1
>> before doing that.
>> If mClock scheduler then you might want to use a specific mClock profile as
>> suggested by Gregory (as osd_recovery_sleep* are not considered when using
>> mClock).
>> 
>> Since each PG involves reads/writes from/to apparently 18 OSDs (!) and this
>> cluster only has 240, increasing osd_max_backfills to any values higher than
>> 2-3 will not help much with the recovery/backfilling speed.
>> 
>> All the way, you'll have to be patient. :-)
>> 
>> Cheers,
>> Frédéric.
>> 
>> - Le 10 Avr 24, à 12:54, Eugen Block ebl...@nde.ag a écrit :
>> 
>>> Thank you for input!
>>> We started the split with max_backfills = 1 and watched for a few
>>> minutes, then gradually increased it to 8. Now it's backfilling with
>>> around 180 MB/s, not really much but since client impact has to be
>>> avoided if possible, we decided to let that run for a couple of hours.
>>> Then reevaluate the situation and maybe increase the backfills a bit
>>> more.
>>> 
>>> Thanks!
>>> 
>>> Zitat von Gregory Orange :
>>> 
 We are in the middle of splitting 16k EC 8+3 PGs on 2600x 16TB OSDs
 with NVME RocksDB, used exclusively for RGWs, holding about 60b
 objects. We are splitting for the same reason as you - improved
 balance. We also thought long and hard before we began, concerned
 about impact, stability etc.

 We set target_max_misplaced_ratio to 0.1% initially, so we could
 retain some control and stop it again fairly quickly if we weren't
 happy with the behaviour. It also serves to limit the performance
 impact on the cluster, but unfortunately it also makes the whole
 process slower.

 We now have the setting up to 1.5%, seeing recovery up to 10GB/s. No
 issues with the cluster. We could go higher, but are not in a rush
 at this point. Sometimes nearfull osd warnings get high and MAX
 AVAIL on the data pool in `ceph df` gets low enough that we want to
 interrupt it. So, we set pg_num to whatever the current value is
 (ceph osd pool ls detail), and let it stabilise. Then the balancer
 gets to work once the misplaced objects drop below the ratio, and
 things balance out. Nearfull osds drop usually to zero, and MAX
 AVAIL goes up again.

 The above behaviour is because while they share the same threshold
 setting, the autoscaler only runs every minute, and it won't run
 when misplaced are over the threshold. Meanwhile, checks for the
 next PG to split happen much more frequently, so the balancer never
 wins that race.

 We didn't know how long to expect it all to take, but decided that
 any improvement in PG size was worth starting. We now estimate it
 will take another 2-3 weeks to complete, for a total of 4-5 weeks
 total.

 We have lost a drive or two during the process, and of course
 degraded objects went up, and more backfilling work got going. We
 paused splits for at least one of those, to make sure the degraded
 objects were sorted out as quick as possible. We can't be sure it
 went any faster though - there's always a long tail on that sort of
 thing.

 Inconsistent objects are found at least a couple of times a week,
 and to get them repairing we disable scrubs, wait until they're
 stopped, then set the repair going and reenable scrubs. I don't know
 if this is special to the current higher splitting load, but we
 haven't noticed it before.

 HTH,
 Greg.

 On 10/4/24 14:42, Eugen Block wrote:
> Thank you, Janne.
> I believe the default 5% target_max_misplaced_ratio would work as
> well, we've had good experience with that in the past, without the

[ceph-users] Re: DB/WALL and RGW index on the same NVME

2024-04-08 Thread Anthony D'Atri

My understanding is that omap and EC are incompatible, though.

> On Apr 8, 2024, at 09:46, David Orman  wrote:
> 
> I would suggest that you might consider EC vs. replication for index data, 
> and the latency implications. There's more than just the nvme vs. rotational 
> discussion to entertain, especially if using the more widely spread EC modes 
> like 8+3. It would be worth testing for your particular workload.
> 
> Also make sure to factor in storage utilization if you expect to see 
> versioning/object lock in use. This can be the source of a significant amount 
> of additional consumption that isn't planned for initially.
> 
> On Mon, Apr 8, 2024, at 01:42, Daniel Parkes wrote:
>> Hi Lukasz,
>> 
>> RGW uses Omap objects for the index pool; Omaps are stored in Rocksdb
>> database of each osd, not on the actual index pool, so by putting DB/WALL
>> on an NVMe as you mentioned, you are already configuring the index pool on
>> a non-rotational drive, you don't need to do anything else.
>> 
>> You just need to size your DB/WALL partition accordingly. For RGW/object
>> storage, a good starting point for the DB/Wall sizing is 4%.
>> 
>> Example of Omap entries in the index pool using 0 bytes, as they are stored
>> in Rocksdb:
>> 
>> # rados -p default.rgw.buckets.index listomapkeys
>> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> file1
>> file2
>> file4
>> file10
>> 
>> rados df -p default.rgw.buckets.index
>> POOL_NAME  USED  OBJECTS  CLONES  COPIES
>> MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS   RD  WR_OPS  WR
>> USED COMPR  UNDER COMPR
>> default.rgw.buckets.index   0 B   11   0  33
>>00 0 208  207 KiB  41  20 KiB 0 B
>>0 B
>> 
>> # rados -p default.rgw.buckets.index stat
>> .dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> default.rgw.buckets.index/.dir.7fb0a3df-9553-4a76-938d-d23711e67677.34162.1.2
>> mtime 2022-12-20T07:32:11.00-0500, size 0
>> 
>> 
>> On Sun, Apr 7, 2024 at 10:06 PM Lukasz Borek  wrote:
>> 
>>> Hi!
>>> 
>>> I'm working on a POC cluster setup dedicated to backup app writing objects
>>> via s3 (large objects, up to 1TB transferred via multipart upload process).
>>> 
>>> Initial setup is 18 storage nodes (12HDDs + 1 NVME card for DB/WALL) + EC
>>> pool.  Plan is to use cephadm.
>>> 
>>> I'd like to follow good practice and put the RGW index pool on a
>>> no-rotation drive. Question is how to do it?
>>> 
>>>   - replace a few HDDs (1 per node) with a SSD (how many? 4-6-8?)
>>>   - reserve space on NVME drive on each node, create lv based OSD and let
>>>   rgb index use the same NVME drive as DB/WALL
>>> 
>>> Thoughts?
>>> 
>>> --
>>> Lukasz
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Impact of Slow OPS?

2024-04-06 Thread Anthony D'Atri

ISTR that the Ceph slow op threshold defaults to 30 or 32 seconds.   Naturally 
an op over the threshold often means there are more below the reporting 
threshold.  

120s I think is the default Linux op timeout.  

> On Apr 6, 2024, at 10:53 AM, David C.  wrote:
> 
> Hi,
> 
> Do slow ops impact data integrity => No
> Can I generally ignore it => No :)
> 
> This means that some client transactions are blocked for 120 sec (that's a
> lot).
> This could be a lock on the client side (CephFS, essentially), an incident
> on the infrastructure side (a disk about to fall, network instability,
> etc.), ...
> 
> When this happens, you need to look at the blocked requests.
> If you systematically see an osd ID, then look at dmesg and the SMART of
> the disk.
> 
> This can also be an architectural problem (for example, high IOPS load with
> osdmap on HDD, all multiplied by the erasure code)
> 
> *David*
> 
> 
>> Le ven. 5 avr. 2024 à 19:42, adam.ther  a écrit :
>> 
>> Hello,
>> 
>> Do slow ops impact data integrity or can I generally ignore it? I'm
>> loading 3 hosts with a 10GB link and it saturating the disks or the OSDs.
>> 
>>2024-04-05T15:33:10.625922+ mon.CEPHADM-1 [WRN] Health check
>>update: 3 slow ops, oldest one blocked for 117 sec, daemons
>>[osd.0,osd.13,osd.14,osd.17,osd.3,osd.4,osd.9] have slow ops.
>> (SLOW_OPS)
>> 
>>2024-04-05T15:33:15.628271+ mon.CEPHADM-1 [WRN] Health check
>>update: 2 slow ops, oldest one blocked for 123 sec, daemons
>>[osd.0,osd.1,osd.14,osd.17,osd.3,osd.4,osd.9] have slow ops. (SLOW_OPS)
>> 
>> I guess more to the point, what the impact here?
>> 
>> Thanks,
>> 
>> Adam
>> 
>> 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Bucket usage per storage classes

2024-04-04 Thread Anthony D'Atri

A bucket may contain objects spread across multiple storage classes, and AIUI 
the head object is always in the default storage class, so I'm not sure 
*exactly* what you're after here.

> On Apr 4, 2024, at 17:09, Ondřej Kukla  wrote:
> 
> Hello, 
> 
> I’m playing around with Storage classes in rgw and I’m looking for ways to 
> see per bucket statistics for the diferent storage classes (for billing 
> purposes etc.).
> 
> I though that I would add another object to the bucket usage response like 
> for multiparts - rgw.multimeta, but it’s counted under the rgw.mainIs.
> 
> Is there some option to get this info?
> 
> Ondrej
> 
> 
> Bucket usage I’m referring to
> 
> "usage": {
>"rgw.main": {
>"size": 1333179575,
>"size_actual": 1333190656,
>"size_utilized": 1333179575,
>"size_kb": 1301934,
>"size_kb_actual": 1301944,
>"size_kb_utilized": 1301934,
>"num_objects": 4
>},
>"rgw.multimeta": {
>"size": 0,
>"size_actual": 0,
>"size_utilized": 0,
>"size_kb": 0,
>"size_kb_actual": 0,
>"size_kb_utilized": 0,
>"num_objects": 0
>}
> }
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RBD image metric

2024-04-04 Thread Anthony D'Atri

Not a requirement but it makes it a LOT faster.

> On Apr 4, 2024, at 03:54, Szabo, Istvan (Agoda)  
> wrote:
> 
> Hi,
> 
> Let's say thin provisioned and no, no fast-diff and object map enabled. As I 
> see this is a requirements to be able to use "du".
> 
> 
> Istvan Szabo
> Staff Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
> -------
> 
> 
> 
> 
> From: Anthony D'Atri 
> Sent: Thursday, April 4, 2024 3:19 AM
> To: Szabo, Istvan (Agoda) 
> Cc: Ceph Users 
> Subject: [ceph-users] Re: RBD image metric
> 
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
> 
> Depending on your Ceph release you might need to enable rbdstats.
> 
> Are you after provisioned, allocated, or both sizes?  Do you have object-map 
> and fast-diff enabled?  They speed up `rbd du` massively.
> 
>> On Apr 3, 2024, at 00:26, Szabo, Istvan (Agoda)  
>> wrote:
>> 
>> Hi,
>> 
>> Trying to pull out some metrics from ceph about the rbd images sizes but 
>> haven't found anything only pool related metrics.
>> 
>> Wonder is there any metric about images or I need to create by myself to 
>> collect it with some third party tool?
>> 
>> Thank you
>> 
>> 
>> This message is confidential and is for the sole use of the intended 
>> recipient(s). It may also be privileged or otherwise protected by copyright 
>> or other legal rules. If you have received it by mistake please let us know 
>> by reply email and delete it from your system. It is prohibited to copy this 
>> message or disclose its content to anyone. Any confidentiality or privilege 
>> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
>> the message. All messages sent to and from Agoda may be monitored to ensure 
>> compliance with company policies, to protect the company's interests and to 
>> remove potential malware. Electronic messages may be intercepted, amended, 
>> lost or deleted, or contain viruses.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: question about rbd_read_from_replica_policy

2024-04-04 Thread Anthony D'Atri

Network RTT?

> On Apr 4, 2024, at 03:44, Noah Elias Feldt  wrote:
> 
> Hello,
> I have a question about a setting for RBD.
> How exactly does "rbd_read_from_replica_policy" with the value "localize" 
> work?
> According to the RBD documentation, read operations will be sent to the 
> closest OSD as determined by the CRUSH map. How does the client know exactly 
> which OSD I am closest to?
> The client is not in the CRUSH map. I can't find much more information about 
> it. How does this work?
> 
> Thanks
> 
> 
> noah feldt
> infrastrutur
> _
> 
>  OWAPstImg85013815377039-2715-4125-ae27-665645666b62.png>
> 
> Mittwald CM Service GmbH & Co. KG
> Königsberger Straße 4-6
> 32339 Espelkamp
>  
> Tel.: 05772 / 293-100
> Fax: 05772 / 293-333
>  
> supp...@mittwald.de 
> https://www.mittwald.de 
>  
> Geschäftsführer: Robert Meyer, Florian Jürgens
>  
> USt-IdNr.: DE814773217, HRA 6640, AG Bad Oeynhausen
> Komplementärin: Robert Meyer Verwaltungs GmbH, HRB 13260, AG Bad Oeynhausen
> 
> Informationen zur Datenverarbeitung im Rahmen unserer Geschäftstätigkeit 
> gemäß Art. 13-14 DSGVO sind unter www.mittwald.de/ds 
>  abrufbar.
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io 
> To unsubscribe send an email to ceph-users-le...@ceph.io 
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow ops during recovery for RGW index pool only when degraded OSD is primary

2024-04-03 Thread Anthony D'Atri

Thanks.  I'll PR up some doc updates reflecting this and run them by the RGW / 
RADOS folks.

> On Apr 3, 2024, at 16:34, Joshua Baergen  wrote:
> 
> Hey Anthony,
> 
> Like with many other options in Ceph, I think what's missing is the
> user-visible effect of what's being altered. I believe the reason why
> synchronous recovery is still used is that, assuming that per-object
> recovery is quick, it's faster to complete than asynchronous recovery,
> which has extra steps on either end of the recovery process. Of
> course, as you know, synchronous recovery blocks I/O, so when
> per-object recovery isn't quick, as in RGW index omap shards,
> particularly large shards, IMO we're better off always doing async
> recovery.
> 
> I don't know enough about the overheads involved here to evaluate
> whether it's worth keeping synchronous recovery at all, but IMO RGW
> index/usage(/log/gc?) pools are always better off using asynchronous
> recovery.
> 
> Josh
> 
> On Wed, Apr 3, 2024 at 1:48 PM Anthony D'Atri  wrote:
>> 
>> We currently have in  src/common/options/global.yaml.in
>> 
>> - name: osd_async_recovery_min_cost
>>  type: uint
>>  level: advanced
>>  desc: A mixture measure of number of current log entries difference and 
>> historical
>>missing objects,  above which we switch to use asynchronous recovery when 
>> appropriate
>>  default: 100
>>  flags:
>>  - runtime
>> 
>> I'd like to rephrase the description there in a PR, might you be able to 
>> share your insight into the dynamics so I can craft a better description?  
>> And do you have any thoughts on the default value?  Might appropriate values 
>> vary by pool type and/or media?
>> 
>> 
>> 
>>> On Apr 3, 2024, at 13:38, Joshua Baergen  wrote:
>>> 
>>> We've had success using osd_async_recovery_min_cost=0 to drastically
>>> reduce slow ops during index recovery.
>>> 
>>> Josh
>>> 
>>> On Wed, Apr 3, 2024 at 11:29 AM Wesley Dillingham  
>>> wrote:
>>>> 
>>>> I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which
>>>> supports the RGW index pool causes crippling slow ops. If the OSD is marked
>>>> with primary-affinity of 0 prior to the OSD restart no slow ops are
>>>> observed. If the OSD has a primary affinity of 1 slow ops occur. The slow
>>>> ops only occur during the recovery period of the OMAP data and further only
>>>> occur when client activity is allowed to pass to the cluster. Luckily I am
>>>> able to test this during periods when I can disable all client activity at
>>>> the upstream proxy.
>>>> 
>>>> Given the behavior of the primary affinity changes preventing the slow ops
>>>> I think this may be a case of recovery being more detrimental than
>>>> backfill. I am thinking that causing an pg_temp acting set by forcing
>>>> backfill may be the right method to mitigate the issue. [1]
>>>> 
>>>> I believe that reducing the PG log entries for these OSDs would accomplish
>>>> that but I am also thinking a tuning of osd_async_recovery_min_cost [2] may
>>>> also accomplish something similar. Not sure the appropriate tuning for that
>>>> config at this point or if there may be a better approach. Seeking any
>>>> input here.
>>>> 
>>>> Further if this issue sounds familiar or sounds like another condition
>>>> within the OSD may be at hand I would be interested in hearing your input
>>>> or thoughts. Thanks!
>>>> 
>>>> [1] https://docs.ceph.com/en/latest/dev/peering/#concepts
>>>> [2]
>>>> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_async_recovery_min_cost
>>>> 
>>>> Respectfully,
>>>> 
>>>> *Wes Dillingham*
>>>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>>>> w...@wesdillingham.com
>>>> ___
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: RBD image metric

2024-04-03 Thread Anthony D'Atri

Depending on your Ceph release you might need to enable rbdstats.

Are you after provisioned, allocated, or both sizes?  Do you have object-map 
and fast-diff enabled?  They speed up `rbd du` massively.

> On Apr 3, 2024, at 00:26, Szabo, Istvan (Agoda)  
> wrote:
> 
> Hi,
> 
> Trying to pull out some metrics from ceph about the rbd images sizes but 
> haven't found anything only pool related metrics.
> 
> Wonder is there any metric about images or I need to create by myself to 
> collect it with some third party tool?
> 
> Thank you
> 
> 
> This message is confidential and is for the sole use of the intended 
> recipient(s). It may also be privileged or otherwise protected by copyright 
> or other legal rules. If you have received it by mistake please let us know 
> by reply email and delete it from your system. It is prohibited to copy this 
> message or disclose its content to anyone. Any confidentiality or privilege 
> is not waived or lost by any mistaken delivery or unauthorized disclosure of 
> the message. All messages sent to and from Agoda may be monitored to ensure 
> compliance with company policies, to protect the company's interests and to 
> remove potential malware. Electronic messages may be intercepted, amended, 
> lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Slow ops during recovery for RGW index pool only when degraded OSD is primary

2024-04-03 Thread Anthony D'Atri

We currently have in  src/common/options/global.yaml.in

- name: osd_async_recovery_min_cost
  type: uint
  level: advanced
  desc: A mixture measure of number of current log entries difference and 
historical
missing objects,  above which we switch to use asynchronous recovery when 
appropriate
  default: 100
  flags:
  - runtime

I'd like to rephrase the description there in a PR, might you be able to share 
your insight into the dynamics so I can craft a better description?  And do you 
have any thoughts on the default value?  Might appropriate values vary by pool 
type and/or media?



> On Apr 3, 2024, at 13:38, Joshua Baergen  wrote:
> 
> We've had success using osd_async_recovery_min_cost=0 to drastically
> reduce slow ops during index recovery.
> 
> Josh
> 
> On Wed, Apr 3, 2024 at 11:29 AM Wesley Dillingham  
> wrote:
>> 
>> I am fighting an issue on an 18.2.0 cluster where a restart of an OSD which
>> supports the RGW index pool causes crippling slow ops. If the OSD is marked
>> with primary-affinity of 0 prior to the OSD restart no slow ops are
>> observed. If the OSD has a primary affinity of 1 slow ops occur. The slow
>> ops only occur during the recovery period of the OMAP data and further only
>> occur when client activity is allowed to pass to the cluster. Luckily I am
>> able to test this during periods when I can disable all client activity at
>> the upstream proxy.
>> 
>> Given the behavior of the primary affinity changes preventing the slow ops
>> I think this may be a case of recovery being more detrimental than
>> backfill. I am thinking that causing an pg_temp acting set by forcing
>> backfill may be the right method to mitigate the issue. [1]
>> 
>> I believe that reducing the PG log entries for these OSDs would accomplish
>> that but I am also thinking a tuning of osd_async_recovery_min_cost [2] may
>> also accomplish something similar. Not sure the appropriate tuning for that
>> config at this point or if there may be a better approach. Seeking any
>> input here.
>> 
>> Further if this issue sounds familiar or sounds like another condition
>> within the OSD may be at hand I would be interested in hearing your input
>> or thoughts. Thanks!
>> 
>> [1] https://docs.ceph.com/en/latest/dev/peering/#concepts
>> [2]
>> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_async_recovery_min_cost
>> 
>> Respectfully,
>> 
>> *Wes Dillingham*
>> LinkedIn 
>> w...@wesdillingham.com
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Questions about rbd flatten command

2024-04-02 Thread Anthony D'Atri

Do these RBD volumes have a full feature set?  I would think that fast-diff and 
objectmap would speed this.

> On Apr 2, 2024, at 00:36, Henry lol  wrote:
> 
> I'm not sure, but it seems that read and write operations are
> performed for all objects in rbd.
> If so, is there any method to apply qos for flatten operation?
> 
> 2024년 4월 1일 (월) 오후 11:59, Henry lol 님이 작성:
>> 
>> Hello,
>> 
>> I executed multiple 'rbd flatten' commands simultaneously on a client.
>> The elapsed time of each flatten job increased as the number of jobs
>> increased, and network I/O was nearly full.
>> 
>> so, I have two questions.
>> 1. isn’t the flatten job running within the ceph cluster? Why is
>> client-side network I/O so high?
>> 2. How can I apply qos for each flatten job to reduce network I/O?
>> 
>> Sincerely,
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph status not showing correct monitor services

2024-04-01 Thread Anthony D'Atri




>  a001s017.bpygfm(active, since 13M), standbys: a001s016.ctmoay

Looks like you just had an mgr failover?  Could be that the secondary mgr 
hasn't caught up with current events.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: stretch mode item not defined

2024-03-26 Thread Anthony D'Atri

Yes, you will need to create datacenter buckets and move your host buckets 
under them.


> On Mar 26, 2024, at 09:18, ronny.lippold  wrote:
> 
> hi there, need some help please.
> 
> we are planning to replace our rbd-mirror setup and go to stretch mode.
> the goal is, to have the cluster in 2 fire compartment server rooms.
> 
> start was a default proxmox/ceph setup.
> now, i followed the howto from: 
> https://docs.ceph.com/en/latest/rados/operations/stretch-mode/
> 
> with the crushmap i ended in an error:
> in rule 'stretch_rule' item 'dc1' not defined (dc1,2,3 is like site1,2,3 in 
> the docu).
> 
> my osd tree looks like
> pve-test02-01:~# ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -1 3.80951  root default
> -15 0.06349  host pve-test01-01
>  0ssd  0.06349  osd.0   up   1.0  1.0
> -19 0.06349  host pve-test01-02
>  1ssd  0.06349  osd.1   up   1.0  1.0
> -17 0.06349  host pve-test01-03
>  2ssd  0.06349  osd.2   up   1.0  1.0
> ...
> 
> i think there is something missing.
> should i need adding the buckets manually?
> ...
> 
> for disaster failover, any ideas are welcome.
> 
> 
> man thanks for help and kind regards,
> ronny
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Large number of misplaced PGs but little backfill going on

2024-03-25 Thread Anthony D'Atri

First try "ceph osd down 89"

> On Mar 25, 2024, at 15:37, Alexander E. Patrakov  wrote:
> 
> On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard  wrote:
>> 
>> 
>> 
>> On 24/03/2024 01:14, Torkil Svensgaard wrote:
>>> On 24-03-2024 00:31, Alexander E. Patrakov wrote:
 Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
 Thanks for the update. Even though the improvement is small, it is
 still an improvement, consistent with the osd_max_backfills value, and
 it proves that there are still unsolved peering issues.
 
 I have looked at both the old and the new state of the PG, but could
 not find anything else interesting.
 
 I also looked again at the state of PG 37.1. It is known what blocks
 the backfill of this PG; please search for "blocked_by." However, this
 is just one data point, which is insufficient for any conclusions. Try
 looking at other PGs. Is there anything too common in the non-empty
 "blocked_by" blocks?
>>> 
>>> I'll take a look at that tomorrow, perhaps we can script something
>>> meaningful.
>> 
>> Hi Alexander
>> 
>> While working on a script querying all PGs and making a list of all OSDs
>> found in a blocked_by list, and how many times for each, I discovered
>> something odd about pool 38:
>> 
>> "
>> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38
>> OSDs blocking other OSDs:
> 
> 
>> All PGs in the pool are active+clean so why are there any blocked_by at
>> all? One example attached.
> 
> I don't know. In any case, it doesn't match the "one OSD blocks them
> all" scenario that I was looking for. I think this is something bogus
> that can probably be cleared in your example by restarting osd.89
> (i.e, the one being blocked).
> 
>> 
>> Mvh.
>> 
>> Torkil
>> 
 I think we have to look for patterns in other ways, too. One tool that
 produces good visualizations is TheJJ balancer. Although it is called
 a "balancer," it can also visualize the ongoing backfills.
 
 The tool is available at
 https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py
 
 Run it as follows:
 
 ./placementoptimizer.py showremapped --by-osd | tee remapped.txt
>>> 
>>> Output attached.
>>> 
>>> Thanks again.
>>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
 On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard 
 wrote:
> 
> Hi Alex
> 
> New query output attached after restarting both OSDs. OSD 237 is no
> longer mentioned but it unfortunately made no difference for the number
> of backfills which went 59->62->62.
> 
> Mvh.
> 
> Torkil
> 
> On 23-03-2024 22:26, Alexander E. Patrakov wrote:
>> Hi Torkil,
>> 
>> I have looked at the files that you attached. They were helpful: pool
>> 11 is problematic, it complains about degraded objects for no obvious
>> reason. I think that is the blocker.
>> 
>> I also noted that you mentioned peering problems, and I suspect that
>> they are not completely resolved. As a somewhat-irrational move, to
>> confirm this theory, you can restart osd.237 (it is mentioned at the
>> end of query.11.fff.txt, although I don't understand why it is there)
>> and then osd.298 (it is the primary for that pg) and see if any
>> additional backfills are unblocked after that. Also, please re-query
>> that PG again after the OSD restart.
>> 
>> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard 
>> wrote:
>>> 
>>> 
>>> 
>>> On 23-03-2024 21:19, Alexander E. Patrakov wrote:
 Hi Torkil,
>>> 
>>> Hi Alexander
>>> 
 I have looked at the CRUSH rules, and the equivalent rules work on my
 test cluster. So this cannot be the cause of the blockage.
>>> 
>>> Thank you for taking the time =)
>>> 
 What happens if you increase the osd_max_backfills setting
 temporarily?
>>> 
>>> We already had the mclock override option in place and I re-enabled
>>> our
>>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending
>>> on how full they are. Active backfills went from 16 to 53 which is
>>> probably because default osd_max_backfills for mclock is 1.
>>> 
>>> I think 53 is still a low number of active backfills given the large
>>> percentage misplaced.
>>> 
 It may be a good idea to investigate a few of the stalled PGs. Please
 run commands similar to this one:
 
 ceph pg 37.0 query > query.37.0.txt
 ceph pg 37.1 query > query.37.1.txt
 ...
 and the same for the other affected pools.
>>> 
>>> A few samples attached.
>>> 
 Still, I must say that some of your rules are actually unsafe.
 
 The 4+2 rule as used by rbd_ec_data will not survive a
 datacenter-offline incident. Namely, for each PG, it chooses OSDs
 from
 two hosts in each datacenter, so 6 OSDs total. When

[ceph-users] Re: Are we logging IRC channels?

2024-03-23 Thread Anthony D'Atri

I fear this will raise controversy, but in 2024 what’s the value in 
perpetuating an interface from early 1980s BITnet batch operating systems?

> On Mar 23, 2024, at 5:45 AM, Janne Johansson  wrote:
> 
>> Sure!  I think Wido just did it all unofficially, but afaik we've lost
>> all of those records now.  I don't know if Wido still reads the mailing
>> list but he might be able to chime in.  There was a ton of knowledge in
>> the irc channel back in the day.  With slack, it feels like a lot of
>> discussions have migrated into different channels, though #ceph still
>> gets some community traffic (and a lot of hardware design discussion).
> 
> It's also a bit cumbersome to be on IRC when someone pastes 655 lines
> of text on slack, then edits a whitespace or comma that ended up wrong
> and we get a total dump of 655 lines again from the gateway.
> 
> -- 
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-22 Thread Anthony D'Atri

Perhaps emitting an extremely low value could have value for identifying a 
compromised drive?

> On Mar 22, 2024, at 12:49, Michel Jouvin  
> wrote:
> 
> Frédéric,
> 
> We arrived at the same conclusions! I agree that an insane low value would be 
> a good addition: the idea would be that the benchmark emits a warning about 
> the value but the it will not put a value lower than the minimum defined. I 
> don't have a precise idea of the possible bad side effects of such an 
> approach...
> 
> Thanks for your help.
> 
> Michel
> 
> Le 22/03/2024 à 16:29, Frédéric Nass a écrit :
>> Michel,
>> Glad to know that was it.
>> I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd value be 
>> set in cluster's config database since I don't have any set in my lab.
>> Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set when the 
>> calculated value is below osd_mclock_iops_capacity_threshold_hdd, otherwise 
>> the OSD uses the default value of 315.
>> Probably to rule out any insanely high calculated values. Would have been 
>> nice to also rule out any insanely low measured values. :-)
>> Now either:
>> A/ these incredibly low values were calculated a while back with an unmature 
>> version of the code or under some specific hardware conditions and you can 
>> hope this won't happen again
>> OR
>> B/ you don't want to rely on hope to much and you'll prefer to disable 
>> automatic calculation (osd_mclock_skip_benchmark = true) and set 
>> osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or using a 
>> rack/host mask) after a precise evaluation of the performance of your OSDs.
>> B/ would be more deterministic :-)
>> Cheers,
>> Frédéric.
>> 
>>
>>*De: *Michel 
>>*à: *Frédéric 
>>*Cc: *Pierre ; ceph-users 
>>*Envoyé: *vendredi 22 mars 2024 14:44 CET
>>*Sujet : *Re: [ceph-users] Re: Reef (18.2): Some PG not
>>scrubbed/deep scrubbed for 1 month
>> 
>>Hi Frédéric,
>> 
>>I think you raise the right point, sorry if I misunderstood Pierre's
>>suggestion to look at OSD performances. Just before reading your
>>email,
>>I was implementing Pierre's suggestion for max_osd_scrubs and I
>>saw the
>>osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those with a
>>value different from the default). For the suspect OSD, the value is
>>very low, 0.145327, and I suspect it is the cause of the problem.
>>A few
>>others have a value ~5 which I find also very low (all OSDs are using
>>the same recent HW/HDD).
>> 
>>Thanks for these informations. I'll follow your suggestions to
>>rerun the
>>benchmark and report if it improved the situation.
>> 
>>Best regards,
>> 
>>Michel
>> 
>>Le 22/03/2024 à 12:18, Frédéric Nass a écrit :
>>> Hello Michel,
>>>
>>> Pierre also suggested checking the performance of this OSD's
>>device(s) which can be done by running a ceph tell osd.x bench.
>>>
>>> One think I can think of is how the scrubbing speed of this very
>>OSD could be influenced by mclock sheduling, would the max iops
>>capacity calculated by this OSD during its initialization be
>>significantly lower than other OSDs's.
>>>
>>> What I would do is check (from this OSD's log) the calculated
>>value for max iops capacity and compare it to other OSDs.
>>Eventually force a recalculation by setting 'ceph config set osd.x
>>osd_mclock_force_run_benchmark_on_init true' and restart this OSD.
>>>
>>> Also I would:
>>>
>>> - compare running OSD's mclock values (cephadm shell ceph daemon
>>osd.x config show | grep mclock) to other OSDs's.
>>> - compare ceph tell osd.x bench to other OSDs's benchmarks.
>>> - compare the rotational status of this OSD's db and data
>>devices to other OSDs, to make sure things are in order.
>>>
>>> Bests,
>>> Frédéric.
>>>
>>> PS: If mclock is the culprit here, then setting osd_op_queue
>>back to mpq for this only OSD would probably reveal it. Not sure
>>about the implication of having a signel OSD running a different
>>scheduler in the cluster though.
>>>
>>>
>>> - Le 22 Mar 24, à 10:11, Michel Jouvin
>>michel.jou...@ijclab.in2p3.fr a écrit :
>>>
>>>> Pierre,
>>>>
>>>> Yes, as mentioned in my initial email, I checked the OSD state
>>and found
>>>> nothing wrong either in the OSD logs or in the system logs
>>(SMART errors).
>>>>
>>>> Thanks for the advice of increasing osd_max_scrubs, I may try
>>it, but I
>>>> doubt it is a contention problem because it really only affects
>>a fixed
>>>> set of PGs (no new PGS have a "stucked scrub") and there is a
>>>> significant scrubbing activity going on continuously (~10K PGs
>>in the
>>>> cluster).
>>>>
>>>> Again, it is not a problem for me to try to kick out

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Anthony D'Atri

Maybe because the Crucial units are detected as client drives?  But also look 
at the device paths and the output of whatever "disklist" is.  Your boot drives 
are SATA and the others are SAS which seems even more likely to be a factor.

> On Mar 22, 2024, at 10:42, Özkan Göksu  wrote:
> 
> Hello Anthony, thank you for the answer. 
> 
> While researching I also found out this type of issues but the thing I did 
> not understand is in the same server the OS drives "SAMSUNG MZ7WD480" is all 
> good.
> 
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX 
> DISC-ZERO
> sda   0  512B   2G
>  0
> ├─sda10  512B   2G
>  0
> ├─sda20  512B   2G
>  0
> └─sda30  512B   2G
>  0
>   └─md0   0  512B   2G
>  0
> └─md0p1   0  512B   2G
>  0
> sdb   0  512B   2G
>  0
> ├─sdb10  512B   2G
>  0
> ├─sdb20  512B   2G
>  0
> └─sdb30  512B   2G
>  0
>   └─md0   0  512B   2G
>  0
> └─md0p1   0  512B   2G
>  0
> 
> root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
> /sys/devices/pci:00/:00:11.4/ata1/host1/target1:0:0/1:0:0:0/scsi_disk/1:0:0:0/provisioning_mode:writesame_16
> /sys/devices/pci:00/:00:11.4/ata2/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:1/end_device-0:1/target0:0:1/0:0:1:0/scsi_disk/0:0:1:0/provisioning_mode:full
> 
> root@sd-01:~# disklist
> HCTL   NAME   SIZE  REV TRAN   WWNSERIAL  MODEL
> 1:0:0:0/dev/sda 447.1G 203Q sata   0x5002538500231d05 S1G1NYAF923 SAMSUNG 
> MZ7WD4
> 2:0:0:0/dev/sdb 447.1G 203Q sata   0x5002538500231a41 S1G1NYAF922 SAMSUNG 
> MZ7WD4
> 0:0:0:0/dev/sdc   3.6T 046  sas0x500a0751e6bd969b 2312E6BD969 
> CT4000MX500SSD
> 0:0:1:0/dev/sdd   3.6T 046  sas0x500a0751e6bd97ee 2312E6BD97E 
> CT4000MX500SSD
> 0:0:2:0/dev/sde   3.6T 046  sas0x500a0751e6bd9805 2312E6BD980 
> CT4000MX500SSD
> 0:0:3:0/dev/sdf   3.6T 046  sas0x500a0751e6bd9681 2312E6BD968 
> CT4000MX500SSD
> 0:0:4:0/dev/sdg   3.6T 045  sas0x500a0751e6b5d30a 2309E6B5D30 
> CT4000MX500SSD
> 0:0:5:0/dev/sdh   3.6T 046  sas0x500a0751e6bd967e 2312E6BD967 
> CT4000MX500SSD
> 0:0:6:0/dev/sdi   3.6T 046  sas0x500a0751e6bd97e4 2312E6BD97E 
> CT4000MX500SSD
> 0:0:7:0/dev/sdj   3.6T 046  sas0x500a0751e6bd96a0 2312E6BD96A 
> CT4000MX500SSD
> 
> So my question is why it only happens to CT4000MX500SSD drives and why it 
> just started now and I don't have in other servers? 
> Maybe it is related to firmware version "M3CR046 vs M3CR045" 
> I check the crucial website and actually "M3CR046" is not exist: 
> https://www.crucial.com/support/ssd-support/mx500-support
> In this forum people recommend upgrading "M3CR046" 
> https://forums.unraid.net/topic/134954-warning-crucial-mx500-ssds-world-of-pain-stay-away-from-these/
> But actually in my ud cluster all the drives are "M3CR045" and have lower 
> latency. I'm really confused.
> 
> 
> Instead of writing udev rules for only CT4000MX500SSD is there any 
> recommended udev rule for ceph and all type of sata drives? 
> 
> 
> 
> Anthony D'Atri mailto:a...@dreamsnake.net>>, 22 Mar 
> 2024 Cum, 17:00 tarihinde şunu yazdı:
>> 
>> How to stop sys from changing USB SSD provisioning_mode from unmap to full 
>> in Ubuntu 22.04?
>> askubuntu.com
>>  
>> <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub>How
>>  to stop sys from changing USB SSD provisioning_mode from unmap to full in 
>> Ubuntu 22.04? 
>> <https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full

[ceph-users] Re: High OSD commit_latency after kernel upgrade

2024-03-22 Thread Anthony D'Atri

https://askubuntu.com/questions/1454997/how-to-stop-sys-from-changing-usb-ssd-provisioning-mode-from-unmap-to-full-in-ub
How to stop sys from changing USB SSD provisioning_mode from unmap to full in 
Ubuntu 22.04?
askubuntu.com
?


> On Mar 22, 2024, at 09:36, Özkan Göksu  wrote:
> 
> Hello!
> 
> After upgrading "5.15.0-84-generic" to "5.15.0-100-generic" (Ubuntu 22.04.2
> LTS) , commit latency started acting weird with "CT4000MX500SSD" drives.
> 
> osd  commit_latency(ms)  apply_latency(ms)
> 36 867867
> 373045   3045
> 38  15 15
> 39  18 18
> 421409   1409
> 431224   1224
> 
> I downgraded the kernel but the result did not change.
> I have a similar build and it didn't get upgraded and it is just fine.
> While I was digging I realised a difference.
> 
> This is high latency cluster and as you can see the "DISC-GRAN=0B",
> "DISC-MAX=0B"
> root@sd-01:~# lsblk -D
> NAME   DISC-ALN DISC-GRAN DISC-MAX
> DISC-ZERO
> sdc   00B   0B
>0
> ├─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--201d5050--db0c--41b4--85c4--6416ee989d6c
> │ 00B   0B
>0
> └─ceph--76b7d255--2a01--4bd4--8d3e--880190181183-osd--block--5a376133--47de--4e29--9b75--2314665c2862
> 
> root@sd-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
> /sys/devices/pci:80/:80:03.0/:81:00.0/host0/port-0:0/end_device-0:0/target0:0:0/0:0:0:0/scsi_disk/0:0:0:0/provisioning_mode:full
> 
> --
> 
> This is low latency cluster and as you can see the "DISC-GRAN=4K",
> "DISC-MAX=2G"
> root@ud-01:~# lsblk -D
> NAME  DISC-ALN
> DISC-GRAN DISC-MAX DISC-ZERO
> sdc  0
>   4K   2G 0
> ├─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--ec86a029--23f7--4328--9600--a24a290e3003
> │0
>   4K   2G 0
> └─ceph--7496095f--18c7--41fd--90f2--d9b3e382bc8e-osd--block--5b69b748--d899--4f55--afc3--2ea3c8a05ca1
> 
> root@ud-01:~# find /sys/ -name provisioning_mode -exec grep -H . {} + | sort
> /sys/devices/pci:00/:00:11.4/ata3/host2/target2:0:0/2:0:0:0/scsi_disk/2:0:0:0/provisioning_mode:writesame_16
> 
> I think the problem is related to provisioning_mode but I really did not
> understand the reason.
> I boot with a live iso and still the drive was "provisioning_mode:full" so
> it means this is not related to my OS at all.
> 
> With the upgrade something changed and I think during boot sequence
> negotiation between LSI controller, drives and kernel started to assign
> "provisioning_mode:full" but I'm not sure.
> 
> What should I do ?
> 
> Best regards.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Need easy way to calculate Ceph cluster space for SolarWinds

2024-03-20 Thread Anthony D'Atri

0.99  1.01  184  up
> 8ssd  18.19040   1.0   18 TiB   13 TiB   13 TiB   20 GiB   59 GiB  
> 5.5 TiB  69.95  1.16  184  up
> 12ssd  18.19040   1.0   18 TiB   11 TiB   11 TiB  6.2 GiB   39 GiB  
> 7.4 TiB  59.22  0.98  180  up
> 16ssd  18.19040   1.0   18 TiB  9.8 TiB  9.7 TiB   14 GiB   37 GiB  
> 8.4 TiB  53.63  0.89  187  up
> 20ssd  18.19040   1.0   18 TiB  9.7 TiB  9.6 TiB   21 GiB   33 GiB  
> 8.5 TiB  53.14  0.88  181  up
> 24ssd  18.19040   1.0   18 TiB   12 TiB   12 TiB   10 GiB   46 GiB  
> 6.3 TiB  65.24  1.08  180  up
> 28ssd  18.19040   1.0   18 TiB   10 TiB   10 TiB   13 GiB   45 GiB  
> 8.1 TiB  55.41  0.92  192  up
> 31ssd  18.19040   1.0   18 TiB   12 TiB   12 TiB   22 GiB   48 GiB  
> 6.2 TiB  65.92  1.09  186  up
> 35ssd  18.19040   1.0   18 TiB   10 TiB   10 TiB   15 GiB   33 GiB  
> 8.0 TiB  56.11  0.93  175  up
> 37ssd  18.19040   1.0   18 TiB   13 TiB   13 TiB   13 GiB   53 GiB  
> 5.0 TiB  72.78  1.21  179  up
> 43ssd  18.19040   1.0   18 TiB  8.9 TiB  8.8 TiB   17 GiB   23 GiB  
> 9.3 TiB  48.71  0.81  178  up
>TOTAL  873 TiB  527 TiB  525 TiB  704 GiB  2.0 TiB  
> 346 TiB  60.40
> MIN/MAX VAR: 0.76/1.22  STDDEV: 6.98
> 
> -Original Message-
> From: Anthony D'Atri 
> Sent: Wednesday, March 20, 2024 5:09 PM
> To: Michael Worsham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Need easy way to calculate Ceph cluster space for 
> SolarWinds
> 
> This is an external email. Please take care when clicking links or opening 
> attachments. When in doubt, check with the Help Desk or Security.
> 
> 
> Looks like you have one device class and the same replication on all pools, 
> which makes that simpler.
> 
> Your MAX AVAIL figures are lower than I would expect if you're using size=3, 
> so I'd check if you have the balancer enabled, if it's working properly.
> 
> Run
> 
> ceph osd df
> 
> and look at the VAR column,
> 
> [rook@rook-ceph-tools-5ff8d58445-p9npl /]$ ceph osd df | head
> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE   DATA  OMAP META 
> AVAIL%USE   VAR   PGS  STATUS
> 
> Ideally the numbers should all be close to 1.00 + / -
> 
>> On Mar 20, 2024, at 16:55, Michael Worsham  
>> wrote:
>> 
>> I had a request from the upper management wanting to use SolarWinds to be 
>> able to extract what I am looking at and have SolarWinds track it in terms 
>> of total available space, remaining space of the overall cluster, and I 
>> guess would be the current RGW pools/buckets we have and their allocated 
>> sizes and space remaining in it as well. I am sort of in the dark when it 
>> comes to trying to break things down to make it readable/understandable for 
>> those that are non-technical.
>> 
>> I was told that when it comes to pools and buckets, you sort of have to see 
>> it this way:
>> - Bucket is like a folder
>> - Pool is like a hard drive.
>> - You can create many folders in a hard drive and you can add quota to each 
>> folder.
>> - But if you want to know the remaining space, you need to check the hard 
>> drive.
>> 
>> I did the "ceph df" command on the ceph monitor and we have something that 
>> looks like this:
>> 
>>>> sudo ceph df
>> --- RAW STORAGE ---
>> CLASS SIZEAVAIL USED  RAW USED  %RAW USED
>> ssd873 TiB  346 TiB  527 TiB   527 TiB  60.40
>> TOTAL  873 TiB  346 TiB  527 TiB   527 TiB  60.40
>> 
>> --- POOLS ---
>> POOL ID   PGS   STORED  OBJECTS USED  %USED  
>> MAX AVAIL
>> .mgr  1 1  449 KiB2  1.3 MiB  0  
>>61 TiB
>> default.rgw.buckets.data  2  2048  123 TiB   41.86M  371 TiB  66.76  
>>61 TiB
>> default.rgw.control   3 2  0 B8  0 B  0  
>>61 TiB
>> default.rgw.data.root 4 2  0 B0  0 B  0  
>>61 TiB
>> default.rgw.gc5 2  0 B0  0 B  0  
>>61 TiB
>> default.rgw.log   6 2   41 KiB  209  732 KiB  0  
>>61 TiB
>> default.rgw.intent-log7 2  0 B0  0 B  0  
>>61 TiB
>> default.rgw.meta  8 2   20 KiB   96  972 KiB  0  
>>61 TiB
>> default.rgw.otp   9 2  0 B0  0 B  0  
>>61 TiB
>> default.rgw.usage10 2  0 B

[ceph-users] Re: CephFS space usage

2024-03-20 Thread Anthony D'Atri

Grep through the ls output for ‘rados bench’ leftovers, it’s easy to leave them 
behind.  

> On Mar 20, 2024, at 5:28 PM, Igor Fedotov  wrote:
> 
> Hi Thorne,
> 
> unfortunately I'm unaware of any tools high level enough to easily map files 
> to rados objects without deep undestanding how this works. You might want to 
> try "rados ls" command to get the list of all the objects in the cephfs data 
> pool. And then  learn how that mapping is performed and parse your listing.
> 
> 
> Thanks,
> 
> Igor
> 
>> On 3/20/2024 1:30 AM, Thorne Lawler wrote:
>> 
>> Igor,
>> 
>> Those files are VM disk images, and they're under constant heavy use, so 
>> yes- there/is/ constant severe write load against this disk.
>> 
>> Apart from writing more test files into the filesystems, there must be Ceph 
>> diagnostic tools to describe what those objects are being used for, surely?
>> 
>> We're talking about an extra 10TB of space. How hard can it be to determine 
>> which file those objects are associated with?
>> 
>>> On 19/03/2024 8:39 pm, Igor Fedotov wrote:
>>> 
>>> Hi Thorn,
>>> 
>>> given the amount of files at CephFS volume I presume you don't have severe 
>>> write load against it. Is that correct?
>>> 
>>> If so we can assume that the numbers you're sharing are mostly refer to 
>>> your experiment. At peak I can see bytes_used increase = 629,461,893,120 
>>> bytes (45978612027392  - 45349150134272). With replica factor = 3 this 
>>> roughly matches your written data (200GB I presume?).
>>> 
>>> 
>>> More interestingly is that after file's removal we can see 419,450,880 
>>> bytes delta (=45349569585152 - 45349150134272). I could see two options 
>>> (apart that someone else wrote additional stuff to CephFS during the 
>>> experiment) to explain this:
>>> 
>>> 1. File removal wasn't completed at the last probe half an hour after 
>>> file's removal. Did you see stale object counter when making that probe?
>>> 
>>> 2. Some space is leaking. If that's the case this could be a reason for 
>>> your issue if huge(?) files at CephFS are created/removed periodically. So 
>>> if we're certain that the leak really occurred (and option 1. above isn't 
>>> the case) it makes sense to run more experiments with writing/removing a 
>>> bunch of huge files to the volume to confirm space leakage.
>>> 
>>> On 3/18/2024 3:12 AM, Thorne Lawler wrote:
 
 Thanks Igor,
 
 I have tried that, and the number of objects and bytes_used took a long 
 time to drop, but they seem to have dropped back to almost the original 
 level:
 
  * Before creating the file:
  o 3885835 objects
  o 45349150134272 bytes_used
  * After creating the file:
  o 3931663 objects
  o 45924147249152 bytes_used
  * Immediately after deleting the file:
  o 3935995 objects
  o 45978612027392 bytes_used
  * Half an hour after deleting the file:
  o 3886013 objects
  o 45349569585152 bytes_used
 
 Unfortunately, this is all production infrastructure, so there is always 
 other activity taking place.
 
 What tools are there to visually inspect the object map and see how it 
 relates to the filesystem?
 
>>> Not sure if there is anything like that at CephFS level but you can use 
>>> rados tool to view objects in cephfs data pool and try to build some 
>>> mapping between them and CephFS file list. Could be a bit tricky though.
 
 On 15/03/2024 7:18 pm, Igor Fedotov wrote:
> ceph df detail --format json-pretty
 --
 
 Regards,
 
 Thorne Lawler - Senior System Administrator
 *DDNS* | ABN 76 088 607 265
 First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
 P +61 499 449 170
 
 _DDNS
 
 /_*Please note:* The information contained in this email message and any 
 attached files may be confidential information, and may also be the 
 subject of legal professional privilege. _If you are not the intended 
 recipient any use, disclosure or copying of this email is unauthorised. 
 _If you received this email in error, please notify Discount Domain Name 
 Services Pty Ltd on 03 9815 6868 to report this matter and delete all 
 copies of this transmission together with any attachments. /
 
>>> --
>>> Igor Fedotov
>>> Ceph Lead Developer
>>> 
>>> Looking for help with your Ceph cluster? Contact us athttps://croit.io
>>> 
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web:https://croit.io  | YouTube:https://goo.gl/PGE1Bx
>> --
>> 
>> Regards,
>> 
>> Thorne Lawler - Senior System Administrator
>> *DDNS* | ABN 76 088 607 265
>> First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
>> P +61 499 449 170
>> 
>> _DDNS
>> 
>> /_*Please note:* The information contained in this email message and any 
>> attached files may be

[ceph-users] Re: Need easy way to calculate Ceph cluster space for SolarWinds

2024-03-20 Thread Anthony D'Atri

ault.rgw.jv-comm-pool.index   6332   83 GiB  811  248 GiB   0.13   
>   61 TiB
> default.rgw.jv-comm-pool.non-ec  6432  0 B0  0 B  0   
>   61 TiB
> default.rgw.jv-va-pool.data  6532  4.8 TiB   22.17M   14 TiB   7.28   
>   61 TiB
> default.rgw.jv-va-pool.index 6632   38 GiB  401  113 GiB   0.06   
>   61 TiB
> default.rgw.jv-va-pool.non-ec6732  0 B0  0 B  0   
>   61 TiB
> jv-edi-pool  6832  0 B0  0 B  0   
>   61 TiB
> 
> -- Michael
> 
> -Original Message-
> From: Anthony D'Atri 
> Sent: Wednesday, March 20, 2024 2:48 PM
> To: Michael Worsham 
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Need easy way to calculate Ceph cluster space for 
> SolarWinds
> 
> This is an external email. Please take care when clicking links or opening 
> attachments. When in doubt, check with the Help Desk or Security.
> 
> 
>> On Mar 20, 2024, at 14:42, Michael Worsham  
>> wrote:
>> 
>> Is there an easy way to poll a Ceph cluster to see how much space is
>> available
> 
> `ceph df`
> 
> The exporter has percentages per pool as well.
> 
> 
>> and how much space is available per bucket?
> 
> Are you using RGW quotas?
> 
>> 
>> Looking for a way to use SolarWinds to monitor the entire Ceph cluster space 
>> utilization and then also be able to break down each RGW bucket to see how 
>> much space it was provisioned for and how much is available.
> 
> RGW buckets do not provision space.  Optionally there may be some RGW quotas 
> but they're a different thing than you're implying.
> 
> 
>> 
>> -- Michael
>> 
>> 
>> Get Outlook for Android<https://aka.ms/AAb9ysg> This message and its
>> attachments are from Data Dimensions and are intended only for the use of 
>> the individual or entity to which it is addressed, and may contain 
>> information that is privileged, confidential, and exempt from disclosure 
>> under applicable law. If the reader of this message is not the intended 
>> recipient, or the employee or agent responsible for delivering the message 
>> to the intended recipient, you are hereby notified that any dissemination, 
>> distribution, or copying of this communication is strictly prohibited. If 
>> you have received this communication in error, please notify the sender 
>> immediately and permanently delete the original email and destroy any copies 
>> or printouts of this email as well as any attachments.
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>> email to ceph-users-le...@ceph.io
> 
> This message and its attachments are from Data Dimensions and are intended 
> only for the use of the individual or entity to which it is addressed, and 
> may contain information that is privileged, confidential, and exempt from 
> disclosure under applicable law. If the reader of this message is not the 
> intended recipient, or the employee or agent responsible for delivering the 
> message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, or copying of this communication is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender immediately and permanently delete the original email and destroy 
> any copies or printouts of this email as well as any attachments.
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Need easy way to calculate Ceph cluster space for SolarWinds

2024-03-20 Thread Anthony D'Atri




> On Mar 20, 2024, at 14:42, Michael Worsham  
> wrote:
> 
> Is there an easy way to poll a Ceph cluster to see how much space is available

`ceph df`

The exporter has percentages per pool as well.


> and how much space is available per bucket?

Are you using RGW quotas?

> 
> Looking for a way to use SolarWinds to monitor the entire Ceph cluster space 
> utilization and then also be able to break down each RGW bucket to see how 
> much space it was provisioned for and how much is available.

RGW buckets do not provision space.  Optionally there may be some RGW quotas 
but they're a different thing than you're implying.


> 
> -- Michael
> 
> 
> Get Outlook for Android
> This message and its attachments are from Data Dimensions and are intended 
> only for the use of the individual or entity to which it is addressed, and 
> may contain information that is privileged, confidential, and exempt from 
> disclosure under applicable law. If the reader of this message is not the 
> intended recipient, or the employee or agent responsible for delivering the 
> message to the intended recipient, you are hereby notified that any 
> dissemination, distribution, or copying of this communication is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender immediately and permanently delete the original email and destroy 
> any copies or printouts of this email as well as any attachments.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

2024-03-20 Thread Anthony D'Atri

Suggest issuing an explicit deep scrub against one of the subject PGs, see if 
it takes.

> On Mar 20, 2024, at 8:20 AM, Michel Jouvin  
> wrote:
> 
> Hi,
> 
> We have a Reef cluster that started to complain a couple of weeks ago about 
> ~20 PGs (over 10K) not scrubbed/deep-scrubbed in time. Looking at it since a 
> few days, I saw this affect only those PGs that could not be scrubbed since 
> mid-February. Old the other PGs are regularly scrubbed.
> 
> I decided to look if one OSD was present in all these PGs and found one! I 
> restarted this OSD but it had no effect. Looking at the logs for the suspect 
> OSD, I found nothing related to abnormal behaviour (but the log is very 
> verbose at restart time so easy to miss something...). And there is no error 
> associated with the OSD disk.
> 
> Any advice about where to look for some useful information would be 
> appreciated! Should I try to destroy the OSD and readd it? I'll be more 
> confortable if I was able to find some diagnostics before...
> 
> Best regards,
> 
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS space usage

2024-03-19 Thread Anthony D'Atri




> Those files are VM disk images, and they're under constant heavy use, so yes- 
> there/is/ constant severe write load against this disk.

Why are you using CephFS for an RBD application?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph osd different size to create a cluster for Openstack : asking for advice

2024-03-13 Thread Anthony D'Atri

NVMe (hope those are enterprise not client) drives aren't likely to suffer the 
same bottlenecks as HDDs or even SATA SSDs.  And a 2:1 size ratio isn't the 
largest I've seen.

So I would just use all 108 OSDs as a single device class and spread the pools 
across all of them.  That way you won't run out of space in one before the 
other.

When running OSDs of multiple sizes I do recommend setting mon_max_pg_per_osd 
to 1000 so that the larger ones won't run afoul of the default value, 
especially when there's a failure.


> 
> Hi,
> 
> I need some guidance from you folks...
> 
> I am going to deploy a ceph cluster in HCI mode for an openstack platform.
> My hardware will be :
> - 03 control nodes  :
> - 27 osd nodes : each node has 03x3.8To nvme + 01x1.9To nvme disks (those
> disks will all be used as OSDs)
> 
> In my Openstack I will be creating all sorts of pools : RBD, Cephfs and RGW.
> 
> I am planning to create two crush rules using the disk size as a parameter.
> Then divide my pools between the two rules.
> - RBD to use the 3.8To disks since I need more space here.
> - Cephfs and RGW to use 1.9To disks.
> 
> Is this a good configuration?
> 
> Regards
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Remove cluster_network without routing

2024-03-07 Thread Anthony D'Atri

I think heartbeats will failover to the public network if the private doesn't 
work -- may not have always done that.



>> Hi
>> Cephadm Reef 18.2.0.
>> We would like to remove our cluster_network without stopping the cluster and 
>> without having to route between the networks.
>> globaladvanced  cluster_network192.168.100.0/24  
>>   *
>> globaladvanced  public_network172.21.12.0/22 
>>*
>> The documentation[1] states:
>> "
>> You may specifically assign static IP addresses or override cluster_network 
>> settings using the cluster_addr setting for specific OSD daemons.
>> "
>> So for one OSD at a time I could set cluster_addr to override the 
>> cluster_network IP and use the public_network IP instead? As the containers 
>> are using host networking they have access to both IPs and will just layer 2 
>> the traffic, avoiding routing?
>> When all OSDs are running with a public_network IP set via cluster_addr we 
>> can just delete the cluster_network setting and then remove all the 
>> ceph_addr settings, as with no cluster_network setting the public_network 
>> setting will be used?
>> We tried with one OSD and it seems to work. Anyone see a problem with this 
>> approach?
> 
> Turned out to be even simpler for this setup where the OSD containers have 
> access til both host networks:
> 
> 1) ceph config rm global cluster_network -> nothing happened, no automatic 
> redeploy or restart
> 
> 2) Restart OSDs
> 
> Mvh.
> 
> Torkil
> 
>> Thanks
>> Torkil
>> [1] 
>> https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/#id3
> 
> -- 
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance
> Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: tor...@drcmr.dk
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: bluestore_min_alloc_size and bluefs_shared_alloc_size

2024-03-06 Thread Anthony D'Atri

> On Feb 28, 2024, at 17:55, Joel Davidow  wrote:
> 
> Current situation
> -
> We have three Ceph clusters that were originally built via cephadm on octopus 
> and later upgraded to pacific. All osds are HDD (will be moving to wal+db on 
> SSD) and were resharded after the upgrade to enable rocksdb sharding. 
> 
> The value for bluefs_shared_alloc_size has remained unchanged at 65535. 
> 
> The value for bluestore_min_alloc_size_hdd was 65535 in octopus but is 
> reported as 4096 by ceph daemon osd. config show in pacific.

min_alloc_size is baked into a given OSD when it is created.  The central 
config / runtime value does not affect behavior for existing OSDs.  The only 
way to change it is to destroy / redeploy the OSD.

There was a succession of PRs in the Octopus / Pacific timeframe around default 
min_alloc_size for HDD and SSD device classes, including IIRC one temporary 
reversion.  

> However, the osd label after upgrading to pacific retains the value of 65535 
> for bfm_bytes_per_block.

OSD label?

I'm not sure if your Pacific release has the back port, but not that along ago 
`ceph osd metadata` was amended to report the min_alloc_size that a given OSD 
was built with.  If you don't have that, the OSD's startup log should report it.

-- aad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph is constantly scrubbing 1/4 of all PGs and still have pigs not scrubbed in time

2024-03-06 Thread Anthony D'Atri

I don't see these in the config dump.

I think you might have to apply them to `global` for them to take effect, not 
just `osd`, FWIW.

> I have tried various settings, like osd_deep_scrub_interval, osd_max_scrubs, 
> mds_max_scrub_ops_in_progress etc.
> All those get ignored.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Number of pgs

2024-03-05 Thread Anthony D'Atri

If you only have one pool of significant size, then your PG ratio is around 40 
.  IMHO too low.

If you're using HDDs I personally might set to 8192 ; if using NVMe SSDS 
arguably 16384 -- assuming that your OSD sizes are more or less close to each 
other.


`ceph osd df` will show toward the right how many PG replicas are on each OSD.

> On Mar 5, 2024, at 14:50, Nikolaos Dandoulakis  wrote:
> 
> Hi Anthony,
> 
> I should have said, it’s replicated (3)
> 
> Best,
> Nick
> 
> Sent from my phone, apologies for any typos!
> From: Anthony D'Atri 
> Sent: Tuesday, March 5, 2024 7:22:42 PM
> To: Nikolaos Dandoulakis 
> Cc: ceph-users@ceph.io 
> Subject: Re: [ceph-users] Number of pgs
>  
> This email was sent to you by someone outside the University.
> You should only click on links or attachments if you are certain that the 
> email is genuine and the content is safe.
> 
> Replicated or EC?
> 
> > On Mar 5, 2024, at 14:09, Nikolaos Dandoulakis  wrote:
> >
> > Hi all,
> >
> > Pretty sure not the first time you see a thread like this.
> >
> > Our cluster consists of 12 nodes/153 OSDs/1.2 PiB used, 708 TiB /1.9 PiB 
> > avail
> >
> > The data pool is 2048 pgs big exactly the same number as when the cluster 
> > started. We have no issues with the cluster, everything runs as expected 
> > and very efficiently. We support about 1000 clients. The question is should 
> > we increase the number of pgs? If you think so, what is the sensible number 
> > to go to? 4096? More?
> >
> > I will eagerly await for your response.
> >
> > Best,
> > Nick
> >
> > P.S. Yes, autoscaler is off :)
> > The University of Edinburgh is a charitable body, registered in Scotland, 
> > with registration number SC005336. Is e buidheann carthannais a th' ann an 
> > Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Number of pgs

2024-03-05 Thread Anthony D'Atri

Replicated or EC?

> On Mar 5, 2024, at 14:09, Nikolaos Dandoulakis  wrote:
> 
> Hi all,
> 
> Pretty sure not the first time you see a thread like this.
> 
> Our cluster consists of 12 nodes/153 OSDs/1.2 PiB used, 708 TiB /1.9 PiB avail
> 
> The data pool is 2048 pgs big exactly the same number as when the cluster 
> started. We have no issues with the cluster, everything runs as expected and 
> very efficiently. We support about 1000 clients. The question is should we 
> increase the number of pgs? If you think so, what is the sensible number to 
> go to? 4096? More?
> 
> I will eagerly await for your response.
> 
> Best,
> Nick
> 
> P.S. Yes, autoscaler is off :)
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336. Is e buidheann carthannais a th' ann an 
> Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help with deep scrub warnings

2024-03-05 Thread Anthony D'Atri

* Try applying the settings to global so that mons/mgrs get them.

* Set your shallow scrub settings back to the default.  Shallow scrubs take 
very few resources

* Set your randomize_ratio back to the default, you’re just bunching them up

* Set the load threshold back to the default, I can’t imagine any OSD node ever 
having a load < 0.3, you’re basically keeping scrubs from ever running

* osd_deep_scrub_interval is the only thing you should need to change.

> On Mar 5, 2024, at 2:42 AM, Nicola Mori  wrote:
> 
> Dear Ceph users,
> 
> in order to reduce the deep scrub load on my cluster I set the deep scrub 
> interval to 2 weeks, and tuned other parameters as follows:
> 
> # ceph config get osd osd_deep_scrub_interval
> 1209600.00
> # ceph config get osd osd_scrub_sleep
> 0.10
> # ceph config get osd osd_scrub_load_threshold
> 0.30
> # ceph config get osd osd_deep_scrub_randomize_ratio
> 0.10
> # ceph config get osd osd_scrub_min_interval
> 259200.00
> # ceph config get osd osd_scrub_max_interval
> 1209600.00
> 
> In my admittedly poor knowledge of Ceph's deep scrub procedures, these 
> settings should spread the deep scrub operations in two weeks instead of the 
> default one week, lowering the scrub frequency and the related load. But I'm 
> currently getting warnings like:
> 
> [WRN] PG_NOT_DEEP_SCRUBBED: 56 pgs not deep-scrubbed in time
>pg 3.1e1 not deep-scrubbed since 2024-02-22T00:22:55.296213+
>pg 3.1d9 not deep-scrubbed since 2024-02-20T03:41:25.461002+
>pg 3.1d5 not deep-scrubbed since 2024-02-20T09:52:57.334058+
>pg 3.1cb not deep-scrubbed since 2024-02-20T03:30:40.510979+
>. . .
> 
> I don't understand the first one, since the deep scrub interval should be two 
> weeks so I don''t expect warnings for PGs which have been deep-scrubbed less 
> than 14 days ago (at the moment I'm writing it's Tue Mar  5 07:39:07 UTC 
> 2024).
> 
> Moreover, I don't understand why the deep scrub for so many PGs is lagging 
> behind. Is there something wrong in my settings?
> 
> Thanks in advance for any help,
> 
> Nicola
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSDs not balanced

2024-03-04 Thread Anthony D'Atri



> I think the short answer is "because you have so wildly varying sizes
> both for drives and hosts".

Arguably OP's OSDs *are* balanced in that their PGs are roughly in line with 
their sizes, but indeed the size disparity is problematic in some ways.

Notably, the 500GB OSD should just be removed.  I think balancing doesn't 
account for WAL/DB/other overhead, so it won't be accurately accounted for and 
can't hold much data nyway.

This cluster shows evidence of reweight-by-utilization having been run, but 
only on two of the hosts.  If the balancer module is active, those override 
weights will confound it.


> 
> If your drive sizes span from 0.5 to 9.5, there will naturally be
> skewed data, and it is not a huge surprise that the automation has
> some troubles getting it "good". When the balancer places a PG on a
> 0.5-sized drive compared to a 9.5-sized one, it eats up 19x more of
> the "free space" on the smaller one, so there are very few good
> options when the sizes are so different. Even if you placed all PGs
> correctly due to size, the 9.5-sized disk would end up getting 19x
> more IO than the small drive and for hdd, it seldom is possible to
> gracefully handle a 19-fold increase in IO, most of the time will
> probably be spent on seeks.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Question about erasure coding on cephfs

2024-03-02 Thread Anthony D'Atri

> On Mar 2, 2024, at 10:37 AM, Erich Weiler  wrote:
> 
> Hi Y'all,
> 
> We have a new ceph cluster online that looks like this:
> 
> md-01 : monitor, manager, mds
> md-02 : monitor, manager, mds
> md-03 : monitor, manager
> store-01 : twenty 30TB NVMe OSDs
> store-02 : twenty 30TB NVMe OSDs
> 
> The cephfs storage is using erasure coding at 4:2.  The crush domain is set 
> to "osd".
> 
> (I know that's not optimal but let me get to that in a minute)
> 
> We have a current regular single NFS server (nfs-01) with the same storage as 
> the OSD servers above (twenty 30TB NVME disks).  We want to wipe the NFS 
> server and integrate it into the above ceph cluster as "store-03".  When we 
> do that, we would then have three OSD servers.  We would then switch the 
> crush domain to "host".
> 
> My question is this:  Given that we have 4:2 erasure coding, would the data 
> rebalance evenly across the three OSD servers after we add store-03 such that 
> if a single OSD server went down, the other two would be enough to keep the 
> system online?  Like, with 4:2 erasure coding, would 2 shards go on store-01, 
> then 2 shards on store-02, and then 2 shards on store-03?  Is that how I 
> understand it?

Nope.  If the failure domain is *host*, without a carefully-crafted special 
CRUSH rule, CRUSH will want to spread the 6 shards over 6 failure domains, and 
you will only have 3.  I don’t remember for sure if the PGs would be stuck 
remapped or stuck unable to activate, but either way you would have a very bad 
day.

Say you craft a CRUSH rule that places two shards on each host.  One host goes 
down, and you have at most K shards up.  IIRC the PGs will be `inactive` but 
you won’t lose existing data. 

Here sind multiple reasons why for small deployments I favor 1U servers:

* Having enough servers so that if one is down, service can proceed
* Having enough failure domains to do EC — or at least replication — safely
* Is your networking a bottleneck?

Assuming these are QLC SSDs, do you have their min_alloc_size set to match the 
IU?  Ideally you would mix in a couple of TLC OSDs on each server — in this 
case including the control plane — for the CephFS metadata pool.

I’m curious which SSDs you’re using, please write me privately as I have 
history with QLC.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-01 Thread Anthony D'Atri

I have a number of drives in my fleet with old firmware that seems to have 
discard / TRIM bugs, as in the drives get bricked.

Much worse is that since they're on legacy RAID HBAs, many of them can't be 
updated.

ymmv.

> On Mar 1, 2024, at 13:15, Igor Fedotov  wrote:
> 
> I played with this feature a while ago and recall it had visible negative 
> impact on user operations due to the need to submit tons of discard 
> operations - effectively each data overwrite operation triggers one or more 
> discard operation submission to disk.
> 
> And I doubt this has been widely used if any.
> 
> Nevertheless recently we've got a PR to rework some aspects of thread 
> management for this stuff, see https://github.com/ceph/ceph/pull/55469
> 
> The author claimed they needed this feature for their cluster so you might 
> want to ask him about their user experience.
> 
> 
> W.r.t documentation - actually there are just two options
> 
> - bdev_enable_discard - enables issuing discard to disk
> 
> - bdev_async_discard - instructs whether discard requests are issued 
> synchronously (along with disk extents release) or asynchronously (using a 
> background thread).
> 
> Thanks,
> 
> Igor
> 
> On 01/03/2024 13:06, jst...@proxforge.de wrote:
>> Is there any update on this? Did someone test the option and has performance 
>> values before and after?
>> Is there any good documentation regarding this option?
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Seperate metadata pool in 3x MDS node

2024-02-24 Thread Anthony D'Atri




> 
> I'm designing a new Ceph storage from scratch and I want to increase CephFS
> speed and decrease latency.
> Usually I always build (WAL+DB on NVME with Sas-Sata SSD's)

Just go with pure-NVMe servers.  NVMe SSDs shouldn't cost much if anything more 
than the few remaining SATA or especially SAS SSDs, and you don't have to pay 
for an anachronistic HBA.  Your OSD lifecycle is way simpler too.  And five 
years from now you won't have to search eBay for additional drives.

> I have 5 racks

Nice.

> and the 3nd "middle" rack is my storage and management rack.

Consider striping all of your services over all 5 racks for fault tolerance.  
That also balances out your power and network usage.


> - At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB,
> 4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD)

You're limited to older / used servers?  4x 25GE is dramatic overkill with 
these resources.  Put your available dollars toward better drives and servers 
if you can.  Are these your bulk data OSDs?

Is this all pre-existing gear?  The MZ-PLK3T20 is a high-durability AIC from 7 
years ago -- and AIC's are all but history in the NVMe world.  A U.2 / E1.S / 
E3.S system would be more futureproof if you can swing it.

> 
> - My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5
> 
> With my current workflow, all the clients;
> 1- visit the rack data switch
> 2- jump to main VPC switch via 2x100G,
> 3- talk with MDS servers,
> 4- Go back to the client with the answer,
> 5- To access data follow the same HOP's and visit the OSD's everytime.

> 
> If I deploy separate metadata pool by using 4x MDS server at top of 
> RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster),
> 2x 960GB U.2 NVME "MZ-PLK3T20")

Do you mean MZ-WLK3T20 ?  You don't need a separate cluster / replication 
network, especially with small numbers of slow OSDs.

> Then all the clients will make the request directly in-rack 1 HOP away MDS

Are your clients and ranks pinned such that each rack of clients will 
necessarily talk to the MDS in that rack?  Don't assume that Ceph will do that 
without action on your part.

> servers and if the request is only metadata, then the MDS node doesn't need 
> to redirect the request to OSD nodes.

Others with more CephFS experience may weigh in, but you might do well to have 
a larger number of SSDs for the CephFS metadata pool.  For sure make sure that 
you have a decent number of PGs in that pool, i.e. more than the autoscaler 
will do by default.

> Also by locating MDS servers with seperated metadata pool across all the
> racks will reduce the high load on main VPC switch at RACK-3
> 
> If I'm not missing anything then only Recovery workload will suffer with
> this topology.

I wouldn't worry so much about network hops.

> 
> What do you think?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri

>   Low space hindering backfill (add storage if this doesn't resolve 
> itself): 21 pgs backfill_toofull

^^^ Ceph even told you what you need to do ;)


If your have recovery taking place and the numbers of misplaced objects and 
*full PGs/pools keeps decreasing, then yes wait.

As for getting your MDS / CephFS going after that, I’ll defer to others.



> On Feb 24, 2024, at 11:07 AM, nguyenvand...@baoviet.com.vn wrote:
> 
> Hi Mr Anthony,
> 
> Forget it, the osd is UP and recovery speed is x10times
> 
> Amazing
> 
> And now we just wait, right ?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri

Your recovery is stuck because there are no OSDs that have enough space to 
accept data.

Your second OSD host appears to only have 9 OSDs currently, so you should be 
able to add a 10TB OSD there without removing anything.

That will enable data to move to all three of your 10TB OSDs.

> On Feb 24, 2024, at 10:41 AM, nguyenvand...@baoviet.com.vn wrote:
> 
> HolySh*** 
> 
> First, we change the mon_max_pg_per_osd to 1000
> 
> About adding disk for cephosd02, for more detail , what is TO, sir ? I ll 
> make conversation with my boss. To be honest, im thinking that the volume 
> recovery progress will get problem...
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri

You aren’t going to be able to finish recovery without having somewhere to 
recover TO.

> On Feb 24, 2024, at 10:33 AM, nguyenvand...@baoviet.com.vn wrote:
> 
> Thank you, Sir. But i think i ll wait for PG BACKFILLFULL finish, my boss is 
> very angry now and will not allow me to add one more disk( this action make 
> him think that ceph would take more time for recovering and rebalancing ). We 
> want to wait volume recovering progress finish
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri

You also might want to increase mon_max_pg_per_osd since you have a wide spread 
of OSD sizes.

Default is 250.  Set it to 1000.

> On Feb 24, 2024, at 10:30 AM, Anthony D'Atri  wrote:
> 
> Add a 10tb HDD to the third node as I suggested, that will help your cluster.
> 
> 
>> On Feb 24, 2024, at 10:29 AM, nguyenvand...@baoviet.com.vn wrote:
>> 
>> I will correct some small things:
>> 
>> we have 6 nodes, 3 osd node and 3 gaeway node ( which run RGW, mds and nfs 
>> service)
>> you r corrct, 2/3 osd node have ONE-NEW 10tib disk
>> 
>> About your suggestion, add another osd host, we will. But we need to end 
>> this nightmare, my NFS folder which have 10tib data is down :(
>> 
>> My ratio
>> ceph osd dump | grep ratio
>> full_ratio 0.95
>> backfillfull_ratio 0.92
>> nearfull_ratio 0.85
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri

Add a 10tb HDD to the third node as I suggested, that will help your cluster.


> On Feb 24, 2024, at 10:29 AM, nguyenvand...@baoviet.com.vn wrote:
> 
> I will correct some small things:
> 
> we have 6 nodes, 3 osd node and 3 gaeway node ( which run RGW, mds and nfs 
> service)
> you r corrct, 2/3 osd node have ONE-NEW 10tib disk
> 
> About your suggestion, add another osd host, we will. But we need to end this 
> nightmare, my NFS folder which have 10tib data is down :(
> 
> My ratio
> ceph osd dump | grep ratio
> full_ratio 0.95
> backfillfull_ratio 0.92
> nearfull_ratio 0.85
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri

# ceph osd dump | grep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85


Read the four sections here:  
https://docs.ceph.com/en/quincy/rados/operations/health-checks/#osd-out-of-order-full

> On Feb 24, 2024, at 10:12 AM, nguyenvand...@baoviet.com.vn wrote:
> 
> Hi Mr Anthony, Could you tell me more details about raising the full and 
> backfullfull threshold 
> 
> is it
> ceph tell 'osd.*' injectargs --osd-max-backfills=2 --osd-recovery-max-active=6
> ??
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri

There ya go.

You have 4 hosts, one of which appears to be down and have a single OSD that is 
so small as to not be useful.  Whatever cephgw03 is, it looks like a mistake.  
OSDs much smaller than, say, 1TB often aren’t very useful.

Your pools appear to be replicated, size=3.

So each of your cephosd* hosts stores one replica of each RADOS object.

You added the 10TB spinners to only two of your hosts, which means that they’re 
only being used as though they were 4TB OSDs.  That’s part of what’s going on.

You want to add a 10TB spinner to cephosd02.  That will help your situation 
significantly.

After that, consider adding a cephosd04 host.  Having at least one more failure 
domain than replicas lets you better use uneven host capacities.

> On Feb 24, 2024, at 10:06 AM, nguyenvand...@baoviet.com.vn wrote:
> 
> Hi Mr Anthony,
> 
> pls check the output 
> 
> https://anotepad.com/notes/s7nykdmc
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread Anthony D'Atri


> 
> 2) It looks like you might have an interesting crush map. Allegedly you have 
> 41TiB of space but you can’t finish rococering you have lots of PGs stuck as 
> their destination is too full. Are you running homogenous hardware or do you 
> have different drive sizes? Are all the weights set correctly?

I suspect that the OP might not have balancing enabled, or that it isn’t 
functioning well.  Which a complex CRUSH topology could exacerbate.

*Temporarily* raising the full and backfillfull thresholds might help, but they 
would want to be reverted after sorting out the balancing.

Please supply

`ceph osd tree`
`ceph osd df`
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: High IO utilization for bstore_kv_sync

2024-02-22 Thread Anthony D'Atri




>  you can sometimes find really good older drives like Intel P4510s on ebay 
> for reasonable prices.  Just watch out for how much write wear they have on 
> them.

Also be sure to update to the latest firmware before use, then issue a Secure 
Erase.
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

1 2 3 4 5 >

1 - 100 of 486 matches

Mail list logo