[ceph-users] Transmit rate metric based per bucket
Hello, I'd like to know is there a way to query some metrics/logs in octopus (or if has newer version I'm interested for the future too) about the bandwidth used in the bucket for put/get operations? Thank you This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)
As a sidenote: there's the windows rbd driver which will get you wy more performance. It's labeled beta, but it seems to work fine for a lot of people. If you have a testlab you could try that. Angelo. On 19/06/2023 18:16, Work Ceph wrote: I see, thanks for the feedback guys! It is interesting that Ceph Manager does not allow us to export iSCSI blocks without selecting 2 or more iSCSI portals. Therefore, we will always use at least two, and as a consequence that feature is not going to be supported. Can I export an RBD image via iSCSI gateway using only one portal via GwCli? @Maged Mokhtar, I am not sure I follow. Do you guys have an iSCSI implementation that we can use to somehow replace the default iSCSI server in the default Ceph iSCSI Gateway? I didn't quite understand what the petasan project is, and if it is an OpenSource solution that we can somehow just pick/select/use one of its modules (e.g. just the iSCSI implementation) that you guys have. On Mon, Jun 19, 2023 at 10:07 AM Maged Mokhtar wrote: Windows Clustered Shared Volumes and Failover Clustering require the support of clustered persistence reservations by the block device to coordinate access by multiple hosts. The default iSCSI implementation in Ceph does not support this, you can use the iSCSI implementation in PetaSAN project: www.petasan.org which supports this feature and provides a high performance implementation. We currently use Ceph 17.2.5 On 19/06/2023 14:47, Work Ceph wrote: Hello guys, We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows clients. Recently, we had the need to add some VMWare clusters as clients for the iSCSI GW and also Windows systems with the use of Clustered Storage Volumes (CSV), and we are facing a weird situation. In windows for instance, the iSCSI block can be mounted, formatted and consumed by all nodes, but when we add in the CSV it fails with some generic exception. The same happens in VMWare, when we try to use it with VMFS it fails. We do not seem to find the root cause for these errors. However, the errors seem to be linked to the situation of multiple nodes consuming the same block by shared file systems. Have you guys seen this before? Are we missing some basic configuration in the iSCSI GW? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)
I see, thanks for the feedback guys! It is interesting that Ceph Manager does not allow us to export iSCSI blocks without selecting 2 or more iSCSI portals. Therefore, we will always use at least two, and as a consequence that feature is not going to be supported. Can I export an RBD image via iSCSI gateway using only one portal via GwCli? @Maged Mokhtar, I am not sure I follow. Do you guys have an iSCSI implementation that we can use to somehow replace the default iSCSI server in the default Ceph iSCSI Gateway? I didn't quite understand what the petasan project is, and if it is an OpenSource solution that we can somehow just pick/select/use one of its modules (e.g. just the iSCSI implementation) that you guys have. On Mon, Jun 19, 2023 at 10:07 AM Maged Mokhtar wrote: > Windows Clustered Shared Volumes and Failover Clustering require the > support of clustered persistence reservations by the block device to > coordinate access by multiple hosts. The default iSCSI implementation in > Ceph does not support this, you can use the iSCSI implementation in > PetaSAN project: > > www.petasan.org > > which supports this feature and provides a high performance > implementation. We currently use Ceph 17.2.5 > > > On 19/06/2023 14:47, Work Ceph wrote: > > Hello guys, > > > > We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD > > for some workloads, RadosGW (via S3) for others, and iSCSI for some > Windows > > clients. > > > > Recently, we had the need to add some VMWare clusters as clients for the > > iSCSI GW and also Windows systems with the use of Clustered Storage > Volumes > > (CSV), and we are facing a weird situation. In windows for instance, the > > iSCSI block can be mounted, formatted and consumed by all nodes, but when > > we add in the CSV it fails with some generic exception. The same happens > in > > VMWare, when we try to use it with VMFS it fails. > > > > We do not seem to find the root cause for these errors. However, the > errors > > seem to be linked to the situation of multiple nodes consuming the same > > block by shared file systems. Have you guys seen this before? > > > > Are we missing some basic configuration in the iSCSI GW? > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Help needed to configure erasure coding LRC plugin
Hi, adding the dev mailing list, hopefully someone there can chime in. But apparently the LRC code hasn't been maintained for a few years (https://github.com/ceph/ceph/tree/main/src/erasure-code/lrc). Let's see... Zitat von Michel Jouvin : Hi Eugen, Thank you very much for these detailed tests that match what I observed and reported earlier. I'm happy to see that we have the same understanding of how it should work (based on the documentation). Is there any other way that this list to enter in contact with the plugin developers as it seems they are not following this (very high volume) list... Or may somebody pass the email thread to one of them? Help would be really appreciated. Cheers, Michel Le 19/06/2023 à 14:09, Eugen Block a écrit : Hi, I have a real hardware cluster for testing available now. I'm not sure whether I'm completely misunderstanding how it's supposed to work or if it's a bug in the LRC plugin. This cluster has 18 HDD nodes available across 3 rooms (or DCs), I intend to use 15 nodes to be able to recover if one node fails. Given that I need one additional locality chunk per DC I need a profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 chunks in total across those 3 DCs, one chunk per host, I checked the chunk placement and it is correct. This is the profile I created: ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 crush-failure-domain=host crush-locality=room crush-device-class=hdd I created a pool with only one PG to make the output more readable. This profile should allow the cluster to sustain the loss of three chunks, the results are interesting. This is what I tested: 1. I stopped all OSDs on one host and the PG was still active with one missing chunk, everything's good. 2. Stopping a second host in the same DC resulted in the PG being marked as "down". That was unexpected since with m=3 I expected the PG to still be active but degraded. Before test #3 I started all OSDs to have the PG active+clean again. 3. I stopped one host per DC, so in total 3 chunks were missing and the PG was still active. Apparently, this profile is able to sustain the loss of m chunks, but not an entire DC. I get the impression (and I also discussed this with a colleague) that LRC with this implementation is either designed to loose only single OSDs which can be recovered quicker with fewer surviving OSDs and saving bandwidth. Or this is a bug because according to the low-level description [1] the algorithm works its way up in the reverse order within the configured layers, like in this example (not displaying my k, m, l requirements, just for reference): chunk nr 01234567 step 1 _cDD_cDD step 2 cDDD step 3 cDDD So if a whole DC fails and the chunks from step 3 can not be recovered, and maybe step 2 also fails, but eventually step 1 contains the actual k and m chunks which should sustain the loss of an entire DC. My impression is that the algorithm somehow doesn't arrive at step 1 and therefore the PG stays down although there are enough surviving chunks. I'm not sure if my observations and conclusion are correct, I'd love to have a comment from the developers on this topic. But in this state I would not recommend to use the LRC plugin when the resiliency requirements are to sustain the loss of an entire DC. Thanks, Eugen [1] https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration Zitat von Michel Jouvin : Hi, I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : Hi Eugen, My LRC pool is also somewhat experimental so nothing really urgent. If you manage to do some tests that help me to understand the problem I remain interested. I propose to keep this thread for that. Zitat, I shared my crush map in the email you answered if the attachment was not suppressed by mailman. Cheers, Michel Sent from my mobile Le 18 mai 2023 11:19:35 Eugen Block a écrit : Hi, I don’t have a good explanation for this yet, but I’ll soon get the opportunity to play around with a decommissioned cluster. I’ll try to get a better understanding of the LRC plugin, but it might take some time, especially since my vacation is coming up. :-) I have some thoughts about the down PGs with failure domain OSD, but I don’t have anything to confirm it yet. Zitat von Curt : Hi, I've been following this thread with interest as it seems like a unique use case to expand my knowledge. I don't use LRC or anything outside basic erasure coding. What is your current crush steps rule? I know you made
[ceph-users] Re: CephFS metadata pool grows by two orders of magnitude while trimming (?) snapshots
Hi Patrick, The event log size of 3/5 MDS is also very high, still. mds.1, mds.3, and mds.4 report between 4 and 5 million events, mds.0 around 1.4 million and mds.2 between 0 and 200,000. The numbers have been constant since my last MDS restart four days ago. I ran your ceph-gather.sh script a couple of times, but dumps only mds.0. Should I modify it to dump mds.3 instead so you can have a look? Yes, please. The session load on mds.3 had already resolved itself after a few days, so I cannot reproduce it any more. Right now, mds.0 has the highest load and a steadily growing event log, but it's not crazy (yet). Nonetheless, I've sent you my dumps with upload ID b95ee882-21e1-4ea1-a419-639a86acc785. The older dumps are from when mds.3 was under load, but they are all from mds.0. I also attached a newer batch, which I created just a few minutes ago. Janek -- Bauhaus-Universität Weimar Bauhausstr. 9a, R308 99423 Weimar, Germany Phone: +49 3643 58 3577 www.webis.de ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Help needed to configure erasure coding LRC plugin
Hi Eugen, Thank you very much for these detailed tests that match what I observed and reported earlier. I'm happy to see that we have the same understanding of how it should work (based on the documentation). Is there any other way that this list to enter in contact with the plugin developers as it seems they are not following this (very high volume) list... Or may somebody pass the email thread to one of them? Help would be really appreciated. Cheers, Michel Le 19/06/2023 à 14:09, Eugen Block a écrit : Hi, I have a real hardware cluster for testing available now. I'm not sure whether I'm completely misunderstanding how it's supposed to work or if it's a bug in the LRC plugin. This cluster has 18 HDD nodes available across 3 rooms (or DCs), I intend to use 15 nodes to be able to recover if one node fails. Given that I need one additional locality chunk per DC I need a profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 chunks in total across those 3 DCs, one chunk per host, I checked the chunk placement and it is correct. This is the profile I created: ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 crush-failure-domain=host crush-locality=room crush-device-class=hdd I created a pool with only one PG to make the output more readable. This profile should allow the cluster to sustain the loss of three chunks, the results are interesting. This is what I tested: 1. I stopped all OSDs on one host and the PG was still active with one missing chunk, everything's good. 2. Stopping a second host in the same DC resulted in the PG being marked as "down". That was unexpected since with m=3 I expected the PG to still be active but degraded. Before test #3 I started all OSDs to have the PG active+clean again. 3. I stopped one host per DC, so in total 3 chunks were missing and the PG was still active. Apparently, this profile is able to sustain the loss of m chunks, but not an entire DC. I get the impression (and I also discussed this with a colleague) that LRC with this implementation is either designed to loose only single OSDs which can be recovered quicker with fewer surviving OSDs and saving bandwidth. Or this is a bug because according to the low-level description [1] the algorithm works its way up in the reverse order within the configured layers, like in this example (not displaying my k, m, l requirements, just for reference): chunk nr 01234567 step 1 _cDD_cDD step 2 cDDD step 3 cDDD So if a whole DC fails and the chunks from step 3 can not be recovered, and maybe step 2 also fails, but eventually step 1 contains the actual k and m chunks which should sustain the loss of an entire DC. My impression is that the algorithm somehow doesn't arrive at step 1 and therefore the PG stays down although there are enough surviving chunks. I'm not sure if my observations and conclusion are correct, I'd love to have a comment from the developers on this topic. But in this state I would not recommend to use the LRC plugin when the resiliency requirements are to sustain the loss of an entire DC. Thanks, Eugen [1] https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration Zitat von Michel Jouvin : Hi, I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : Hi Eugen, My LRC pool is also somewhat experimental so nothing really urgent. If you manage to do some tests that help me to understand the problem I remain interested. I propose to keep this thread for that. Zitat, I shared my crush map in the email you answered if the attachment was not suppressed by mailman. Cheers, Michel Sent from my mobile Le 18 mai 2023 11:19:35 Eugen Block a écrit : Hi, I don’t have a good explanation for this yet, but I’ll soon get the opportunity to play around with a decommissioned cluster. I’ll try to get a better understanding of the LRC plugin, but it might take some time, especially since my vacation is coming up. :-) I have some thoughts about the down PGs with failure domain OSD, but I don’t have anything to confirm it yet. Zitat von Curt : Hi, I've been following this thread with interest as it seems like a unique use case to expand my knowledge. I don't use LRC or anything outside basic erasure coding. What is your current crush steps rule? I know you made changes since your first post and had some thoughts I wanted to share, but wanted to see your rule first so I could try to visualize the distribution better. The only way I can currently visualize it working is with more servers, I'm thinking 6 or 9 per data center min, but that could be my lack of
[ceph-users] Re: header_limit in AsioFrontend class
On Sat, Jun 17, 2023 at 8:37 AM Vahideh Alinouri wrote: > > Dear Ceph Users, > > I am writing to request the backporting changes related to the > AsioFrontend class and specifically regarding the header_limit value. > > In the Pacific release of Ceph, the header_limit value in the > AsioFrontend class was set to 4096. From Quincy release, there has > been a configurable option introduced to set the header_limit value > and the default value is 16384. > > I would greatly appreciate it if someone from the Ceph development > team backport this change to the older version. > > Best regards, > Vahideh Alinouri > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > hi Vahideh, i've prepared that pacific backport. you can follow its progress in https://tracker.ceph.com/issues/61728 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Starting v17.2.5 RGW SSE with default key (likely others) no longer works
On Sat, Jun 17, 2023 at 1:11 PM Jayanth Reddy wrote: > > Hello Folks, > > I've been experimenting with RGW encryption and found this out. > Focusing on Quincy and Reef dev, for the SSE (any methods) to work, transit > has to be end to end encrypted, however if there is a proxy, then [1] can > be made use to tell RGW that SSL is being terminated. As per docs, RGW can > still continue to accept SSE if rgw_crypt_require_ssl is set to false as an > overriding item for the requirement of encryption in transit. Below are my > observations. > > Until v17.2.3 ( > quay.io/ceph/ceph@sha256:43f6e905f3e34abe4adbc9042b9d6f6b625dee8fa8d93c2bae53fa9b61c3df1a), > setting the same key as in [2], would show the object unreadable when > copied using > # rados -p default.rgw.buckets.data get > 03c2ef32-b7c8-4e18-8e0c-ebac10a42f10.17254.1_file.plain file.enc > The object would be unreadable. The original object is in plain text. > Ofcourse, with rgw_crypt_require_ssl to false or [1] > > However, starting with v17.2.4 onwards and even until my recent testing > with reef-dev (18.0.0-4353-g1e3835ab > 1e3835abb2d19ce6ac4149c260ef804f1041d751) > When I try getting the same object onto the disk using rados command, the > object (contains plain text) would still be readable. > > Has something changed since v17.2.4? I'll also test with Pacific and let > you know. Not sure if it affects other SSE mechanisms as well. > > [1] > https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_trust_forwarded_https > [2] > https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only > > Thanks, > Jayanth Reddy > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > hi Jayanth, 17.2.4 coincides with backports of the SSE-S3 and PutBucketEncryption features. those changes include a regression where the rgw_crypt_default_encryption_key configurable no longer applies. you can track the fix for this in https://tracker.ceph.com/issues/61473 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: same OSD in multiple CRUSH hierarchies
Hi, Actually I've learned that it's not needed for a rule to start with a root bucket, so I can heve rules that will only consider a subtree of my total resources, and achieve what I was trying to do with the different disjunct hierarchies. BTW: it is possible to have different trees with different roots, with some OSDs being part of multiple such trees, and create different rules that will start with one root or the other. But I was told that this could mess up the calculation for pg autoscaler and other housekeeping functions. So it seems a better option to have each OSD in one single tree, and use rules that will only consider subtrees ... Regards, Laszlo Date: Mon, 19 Jun 2023 07:41:35 + From: Eugen Block Subject: [ceph-users] Re: same OSD in multiple CRUSH hierarchies To:ceph-users@ceph.io Message-ID: <20230619074135.horde.gs8nakqgzhlbv0hpymj-...@webmail.nde.ag> Content-Type: text/plain; charset=utf-8; format=flowed; DelSp=Yes Hi, I don't think this is going to work. Each OSD belongs to a specific host and you can't have multiple buckets (e.g. bucket type "host") with the same name in the crush tree. But if I understand your requirement correctly, there should be no need to do it this way. If you structure your crush tree according to your separation requirements and the critical pools use designated rules, you can still have a rule that doesn't care about the data separation but distributes the replicas across the available hosts (given your failure domain would be "host"), which is already the default for the replicated_rule. Did I misunderstand something? Regards, Eugen Zitat von Budai Laszlo: Hi there, I'm curious if there is anything against configuring an ODS to be part in multiple CRUSH hierarchies. I'm thinking of the following scenario: I want to create pools that are using distinct sets of OSDs. I want to make sure that a piece data which replicated at application level will not end up on the same OSD. So I would creat multiple CRUSH hierarchies (root - host - osd) but using different OSDs in each, and different rules that are using those hierarchies. Then I would create pools with the different rules, and use those different pools for storing the data for the different application instances. But I would also like to use the OSDs in the "default hierarchy" set up by ceph where all the hosts are in the same root bucket, and have the default replicated rule, so my generic data volumes would be able to spread across all the OSDs available. Is there something against this setup? Thank you for any advice! Laszlo ___ ceph-users mailing list --ceph-users@ceph.io To unsubscribe send an email toceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] How does a "ceph orch restart SERVICE" affect availability?
The documentation very briefly explains a few core commands for restarting things; https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-daemons but I feel I'm lacking quite some details of what is safe to do. I have a system in production, clusters connected via CephFS and some shared block devices. We would like to restart some things due to some new network configurations. Going daemon by daemon would take forever, so I'm curious as to what happens if one tries the command; ceph orch restart osd Will that try to be smart and just restart a few at a time to keep things up and available. Or will it just trigger a restart everywhere simultaneously. I guess in my current scenario, restarting one host at the time makes most sense, with a systemctl restart ceph-{fsid}.target and then checking that "ceph -s" says OK before proceeding to the next host, but I'm still curious as to what the "ceph orch restart xxx" command would do (but not enough to try it out in production) Best regards, Mikael Chalmers University of Technology ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)
Windows Clustered Shared Volumes and Failover Clustering require the support of clustered persistence reservations by the block device to coordinate access by multiple hosts. The default iSCSI implementation in Ceph does not support this, you can use the iSCSI implementation in PetaSAN project: www.petasan.org which supports this feature and provides a high performance implementation. We currently use Ceph 17.2.5 On 19/06/2023 14:47, Work Ceph wrote: Hello guys, We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows clients. Recently, we had the need to add some VMWare clusters as clients for the iSCSI GW and also Windows systems with the use of Clustered Storage Volumes (CSV), and we are facing a weird situation. In windows for instance, the iSCSI block can be mounted, formatted and consumed by all nodes, but when we add in the CSV it fails with some generic exception. The same happens in VMWare, when we try to use it with VMFS it fails. We do not seem to find the root cause for these errors. However, the errors seem to be linked to the situation of multiple nodes consuming the same block by shared file systems. Have you guys seen this before? Are we missing some basic configuration in the iSCSI GW? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)
On 19.06.23 13:47, Work Ceph wrote: Recently, we had the need to add some VMWare clusters as clients for the iSCSI GW and also Windows systems with the use of Clustered Storage Volumes (CSV), and we are facing a weird situation. In windows for instance, the iSCSI block can be mounted, formatted and consumed by all nodes, but when we add in the CSV it fails with some generic exception. The same happens in VMWare, when we try to use it with VMFS it fails. The iSCSI target used does not support SCSI persistent group reservations when in multipath mode. https://docs.ceph.com/en/quincy/rbd/iscsi-initiators/ AFAIK VMware uses these in VMFS. Regards -- Robert Sander Heinlein Support GmbH Linux: Akademie - Support - Hosting http://www.heinlein-support.de Tel: 030-405051-43 Fax: 030-405051-19 Zwangsangaben lt. §35a GmbHG: HRB 93818 B / Amtsgericht Berlin-Charlottenburg, Geschäftsführer: Peer Heinlein -- Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Grafana service fails to start due to bad directory name after Quincy upgrade
Hi, so grafana is starting successfully now? What did you change? Regarding the container images, yes there are defaults in cephadm which can be overridden with ceph config. Can you share this output? ceph config dump | grep container_image I tend to always use a specific image as described here [2]. I also haven't deployed grafana via dashboard yet so I can't really comment on that as well as on the warnings you report. Regards, Eugen [2] https://docs.ceph.com/en/latest/cephadm/services/monitoring/#using-custom-images Zitat von "Adiga, Anantha" : Hi Eugene, Thank you for your response, here is the update. The upgrade to Quincy was done following the cephadm orch upgrade procedure ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.6 Upgrade completed with out errors. After the upgrade, upon creating the Grafana service from Ceph dashboard, it deployed Grafana 6.7.4. The version is hardcoded in the code, should it not be 8.3.5 as listed below in Quincy documentation? See below [Grafana service started from Cephdashboard] Quincy documentation states: https://docs.ceph.com/en/latest/releases/quincy/ ……documentation snippet Monitoring and alerting: 43 new alerts have been added (totalling 68) improving observability of events affecting: cluster health, monitors, storage devices, PGs and CephFS. Alerts can now be sent externally as SNMP traps via the new SNMP gateway service (the MIB is provided). Improved integrated full/nearfull event notifications. Grafana Dashboards now use grafonnet format (though they’re still available in JSON format). Stack update: images for monitoring containers have been updated. Grafana 8.3.5, Prometheus 2.33.4, Alertmanager 0.23.0 and Node Exporter 1.3.1. This reduced exposure to several Grafana vulnerabilities (CVE-2021-43798, CVE-2021-39226, CVE-2021-43798, CVE-2020-29510, CVE-2020-29511). ………. I notice that the versions of the remaining stack, that Ceph dashboard deploys, are also older than what is documented. Prometheus 2.7.2, Alertmanager 0.16.2 and Node Exporter 0.17.0. AND 6.7.4 Grafana service reports a few warnings: highlighted below root@fl31ca104ja0201:/home/general# systemctl status ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104ja0201.service ● ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104ja0201.service - Ceph grafana.fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e Loaded: loaded (/etc/systemd/system/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2023-06-13 03:37:58 UTC; 11h ago Main PID: 391896 (bash) Tasks: 53 (limit: 618607) Memory: 17.9M CGroup: /system.slice/system-ceph\x2dd0a3b6e0\x2dd2c3\x2d11ed\x2dbe05\x2da7a3a1d7a87e.slice/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104j> ├─391896 /bin/bash /var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana.fl31ca104ja0201/unit.run └─391969 /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --net=host --init --name ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-grafana-fl> -- Logs begin at Sun 2023-06-11 20:41:51 UTC, end at Tue 2023-06-13 15:35:12 UTC. -- Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="alter user_auth.auth_id to length 190" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="Add OAuth access token to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="Add OAuth refresh token to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="Add OAuth token type to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="Add OAuth expiry to user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="Add index to user_id column in user_auth" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="create server_lock table" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="add index server_lock.operation_uid" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="create user auth token table" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ lvl=info msg="Executing migration" logger=migrator id="add unique index user_auth_token.auth_token" Jun 13 03:37:59 fl31ca104ja0201 bash[391969]:
[ceph-users] Re: Help needed to configure erasure coding LRC plugin
Hi, I have a real hardware cluster for testing available now. I'm not sure whether I'm completely misunderstanding how it's supposed to work or if it's a bug in the LRC plugin. This cluster has 18 HDD nodes available across 3 rooms (or DCs), I intend to use 15 nodes to be able to recover if one node fails. Given that I need one additional locality chunk per DC I need a profile with k + m = 12. So I chose k=9, m=3, l=4 which creates 15 chunks in total across those 3 DCs, one chunk per host, I checked the chunk placement and it is correct. This is the profile I created: ceph osd erasure-code-profile set lrc1 plugin=lrc k=9 m=3 l=4 crush-failure-domain=host crush-locality=room crush-device-class=hdd I created a pool with only one PG to make the output more readable. This profile should allow the cluster to sustain the loss of three chunks, the results are interesting. This is what I tested: 1. I stopped all OSDs on one host and the PG was still active with one missing chunk, everything's good. 2. Stopping a second host in the same DC resulted in the PG being marked as "down". That was unexpected since with m=3 I expected the PG to still be active but degraded. Before test #3 I started all OSDs to have the PG active+clean again. 3. I stopped one host per DC, so in total 3 chunks were missing and the PG was still active. Apparently, this profile is able to sustain the loss of m chunks, but not an entire DC. I get the impression (and I also discussed this with a colleague) that LRC with this implementation is either designed to loose only single OSDs which can be recovered quicker with fewer surviving OSDs and saving bandwidth. Or this is a bug because according to the low-level description [1] the algorithm works its way up in the reverse order within the configured layers, like in this example (not displaying my k, m, l requirements, just for reference): chunk nr01234567 step 1 _cDD_cDD step 2 cDDD step 3 cDDD So if a whole DC fails and the chunks from step 3 can not be recovered, and maybe step 2 also fails, but eventually step 1 contains the actual k and m chunks which should sustain the loss of an entire DC. My impression is that the algorithm somehow doesn't arrive at step 1 and therefore the PG stays down although there are enough surviving chunks. I'm not sure if my observations and conclusion are correct, I'd love to have a comment from the developers on this topic. But in this state I would not recommend to use the LRC plugin when the resiliency requirements are to sustain the loss of an entire DC. Thanks, Eugen [1] https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/#low-level-plugin-configuration Zitat von Michel Jouvin : Hi, I realize that the crushmap I attached to one of my email, probably required to understand the discussion here, has been stripped down by mailman. To avoid poluting the thread with a long output, I put it on at https://box.in2p3.fr/index.php/s/J4fcm7orfNE87CX. Download it if you are interested. Best regards, Michel Le 21/05/2023 à 16:07, Michel Jouvin a écrit : Hi Eugen, My LRC pool is also somewhat experimental so nothing really urgent. If you manage to do some tests that help me to understand the problem I remain interested. I propose to keep this thread for that. Zitat, I shared my crush map in the email you answered if the attachment was not suppressed by mailman. Cheers, Michel Sent from my mobile Le 18 mai 2023 11:19:35 Eugen Block a écrit : Hi, I don’t have a good explanation for this yet, but I’ll soon get the opportunity to play around with a decommissioned cluster. I’ll try to get a better understanding of the LRC plugin, but it might take some time, especially since my vacation is coming up. :-) I have some thoughts about the down PGs with failure domain OSD, but I don’t have anything to confirm it yet. Zitat von Curt : Hi, I've been following this thread with interest as it seems like a unique use case to expand my knowledge. I don't use LRC or anything outside basic erasure coding. What is your current crush steps rule? I know you made changes since your first post and had some thoughts I wanted to share, but wanted to see your rule first so I could try to visualize the distribution better. The only way I can currently visualize it working is with more servers, I'm thinking 6 or 9 per data center min, but that could be my lack of knowledge on some of the step rules. Thanks Curt On Tue, May 16, 2023 at 11:09 AM Michel Jouvin < michel.jou...@ijclab.in2p3.fr> wrote: Hi Eugen, Yes, sure, no problem to share it. I attach it to this email (as it may clutter the discussion if inline). If somebody on the list has some clue on the LRC plugin, I'm still interested by understand what I'm doing wrong! Cheers, Michel Le 04/05/2023 à 15:07, Eugen Block a écrit : Hi, I don't think you've shared your osd tree
[ceph-users] Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)
Hello guys, We have a Ceph cluster that runs just fine with Ceph Octopus; we use RBD for some workloads, RadosGW (via S3) for others, and iSCSI for some Windows clients. Recently, we had the need to add some VMWare clusters as clients for the iSCSI GW and also Windows systems with the use of Clustered Storage Volumes (CSV), and we are facing a weird situation. In windows for instance, the iSCSI block can be mounted, formatted and consumed by all nodes, but when we add in the CSV it fails with some generic exception. The same happens in VMWare, when we try to use it with VMFS it fails. We do not seem to find the root cause for these errors. However, the errors seem to be linked to the situation of multiple nodes consuming the same block by shared file systems. Have you guys seen this before? Are we missing some basic configuration in the iSCSI GW? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete
Hello Weiwen, Thank you for the response. I've attached the output for all PGs in state incomplete and remapped+incomplete. Thank you! Thanks, Jayanth Reddy On Sat, Jun 17, 2023 at 11:00 PM 胡 玮文 wrote: > Hi Jayanth, > > Can you post the complete output of “ceph pg query”? So that we can > understand the situation better. > > Can you get OSD 3 or 4 back into the cluster? If you are sure they cannot > rejoin, you may try “ceph osd lost ” (doc says this may result in > permanent data lost. I didn’t have a chance to try this myself). > > Weiwen Hu > > > 在 2023年6月18日,00:26,Jayanth Reddy 写道: > > > > Hello Nino / Users, > > > > After some initial analysis, I had increased max_pg_per_osd to 480, but > > we're out of luck. Also tried force-backfill and force-repair as well. > > On querying PG using *# ceph pg ** query* the output says > blocked_by > > 3 to 4 OSDs which are out of the cluster already. Guessing if these have > to > > do something with the recovery. > > > > Thanks, > > Jayanth Reddy > > > >> On Sat, Jun 17, 2023 at 12:31 PM Jayanth Reddy < > jayanthreddy5...@gmail.com> > >> wrote: > >> > >> Thanks, Nino. > >> > >> Would give these initial suggestions a try and let you know at the > >> earliest. > >> > >> Regards, > >> Jayanth Reddy > >> -- > >> *From:* Nino Kotur > >> *Sent:* Saturday, June 17, 2023 12:16:09 PM > >> *To:* Jayanth Reddy > >> *Cc:* ceph-users@ceph.io > >> *Subject:* Re: [ceph-users] EC 8+3 Pool PGs stuck in remapped+incomplete > >> > >> problem is just that some of your OSDs have too much PGs and pool cannot > >> recover as it cannot create more PGs > >> > >> [osd.214,osd.223,osd.548,osd.584] have slow ops. > >>too many PGs per OSD (330 > max 250) > >> > >> I'd have to guess that the safest thing would be permanently or > >> temporarily adding more storage so that PGs drop below 250, another > option > >> is just dropping down the total number of PGs but I don't know if I > would > >> perform that action before my pool was healthy! > >> > >> in case that there is only one OSD that has this number of OSDs but all > >> other OSDs have less than 100-150 than you can just reweight problematic > >> OSD so it rebalances those "too many PGs" > >> > >> But it looks to me that you have way too many PGs which is also super > >> negatively impacting performance. > >> > >> Another option is to increase max allowed PGs per OSD to say 350 this > >> should also allow cluster to rebuild honestly even tho this may be > easiest > >> option, i'd never do this, performance cost of having over 150 PGs per > OSD > >> suffer greatly. > >> > >> > >> kind regards, > >> Nino > >> > >> > >> On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy < > jayanthreddy5...@gmail.com> > >> wrote: > >> > >> Hello Users, > >> Greetings. We've a Ceph Cluster with the version > >> *ceph version 14.2.5-382-g8881d33957 > >> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)* > >> > >> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and > >> incomplete+remapped states. Below are the PGs, > >> > >> # ceph pg dump_stuck inactive > >> ok > >> PG_STAT STATE UP > >> UP_PRIMARY ACTING > >> ACTING_PRIMARY > >> 15.251e incomplete > [151,464,146,503,166,41,555,542,9,565,268] > >> 151 > >> [151,464,146,503,166,41,555,542,9,565,268]151 > >> 15.3f3 incomplete > [584,281,672,699,199,224,239,430,355,504,196] > >> 584 > >> [584,281,672,699,199,224,239,430,355,504,196]584 > >> 15.985 remapped+incomplete > [396,690,493,214,319,209,546,91,599,237,352] > >> 396 > >> > >> > [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352] > >> 214 > >> 15.39d3 remapped+incomplete > [404,221,223,585,38,102,533,471,568,451,195] > >> 404 > >> [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647] > >> 223 > >> 15.d46 remapped+incomplete > [297,646,212,254,110,169,500,372,623,470,678] > >> 297 > >> > [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678] > >> 548 > >> > >> Some of the OSDs had gone down on the cluster. Below is the # ceph > status > >> > >> # ceph -s > >> cluster: > >>id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794 > >>health: HEALTH_WARN > >>noscrub,nodeep-scrub flag(s) set > >>1 pools have many more objects per pg than average > >>Reduced data availability: 5 pgs inactive, 5 pgs incomplete > >>Degraded data redundancy: 44798/8718528059 objects degraded > >> (0.001%), 1 pg degraded, 1 pg undersized > >>22726 pgs not deep-scrubbed in time > >>23552 pgs not scrubbed in time > >>77 slow ops, oldest one blocked for 56400 sec, daemons > >> [osd.214,osd.223,osd.548,osd.584] have slow ops. > >>too many PGs per OSD (330 > max 250) > >> > >> services: > >>mon: 3 daemons, quorum
[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete
Hello Weiwen, Thank you for the response. I've attached the output for all PGs in state incomplete and remapped+incomplete. Thank you! Thanks, Jayanth Reddy On Sun, Jun 18, 2023 at 4:09 PM Jayanth Reddy wrote: > Hello Weiwen, > > Thank you for the response. I've attached the output for all PGs in state > incomplete and remapped+incomplete. Thank you! > > Thanks, > Jayanth Reddy > > On Sat, Jun 17, 2023 at 11:00 PM 胡 玮文 wrote: > >> Hi Jayanth, >> >> Can you post the complete output of “ceph pg query”? So that we can >> understand the situation better. >> >> Can you get OSD 3 or 4 back into the cluster? If you are sure they cannot >> rejoin, you may try “ceph osd lost ” (doc says this may result in >> permanent data lost. I didn’t have a chance to try this myself). >> >> Weiwen Hu >> >> > 在 2023年6月18日,00:26,Jayanth Reddy 写道: >> > >> > Hello Nino / Users, >> > >> > After some initial analysis, I had increased max_pg_per_osd to 480, but >> > we're out of luck. Also tried force-backfill and force-repair as well. >> > On querying PG using *# ceph pg ** query* the output says >> blocked_by >> > 3 to 4 OSDs which are out of the cluster already. Guessing if these >> have to >> > do something with the recovery. >> > >> > Thanks, >> > Jayanth Reddy >> > >> >> On Sat, Jun 17, 2023 at 12:31 PM Jayanth Reddy < >> jayanthreddy5...@gmail.com> >> >> wrote: >> >> >> >> Thanks, Nino. >> >> >> >> Would give these initial suggestions a try and let you know at the >> >> earliest. >> >> >> >> Regards, >> >> Jayanth Reddy >> >> -- >> >> *From:* Nino Kotur >> >> *Sent:* Saturday, June 17, 2023 12:16:09 PM >> >> *To:* Jayanth Reddy >> >> *Cc:* ceph-users@ceph.io >> >> *Subject:* Re: [ceph-users] EC 8+3 Pool PGs stuck in >> remapped+incomplete >> >> >> >> problem is just that some of your OSDs have too much PGs and pool >> cannot >> >> recover as it cannot create more PGs >> >> >> >> [osd.214,osd.223,osd.548,osd.584] have slow ops. >> >>too many PGs per OSD (330 > max 250) >> >> >> >> I'd have to guess that the safest thing would be permanently or >> >> temporarily adding more storage so that PGs drop below 250, another >> option >> >> is just dropping down the total number of PGs but I don't know if I >> would >> >> perform that action before my pool was healthy! >> >> >> >> in case that there is only one OSD that has this number of OSDs but all >> >> other OSDs have less than 100-150 than you can just reweight >> problematic >> >> OSD so it rebalances those "too many PGs" >> >> >> >> But it looks to me that you have way too many PGs which is also super >> >> negatively impacting performance. >> >> >> >> Another option is to increase max allowed PGs per OSD to say 350 this >> >> should also allow cluster to rebuild honestly even tho this may be >> easiest >> >> option, i'd never do this, performance cost of having over 150 PGs per >> OSD >> >> suffer greatly. >> >> >> >> >> >> kind regards, >> >> Nino >> >> >> >> >> >> On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy < >> jayanthreddy5...@gmail.com> >> >> wrote: >> >> >> >> Hello Users, >> >> Greetings. We've a Ceph Cluster with the version >> >> *ceph version 14.2.5-382-g8881d33957 >> >> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)* >> >> >> >> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and >> >> incomplete+remapped states. Below are the PGs, >> >> >> >> # ceph pg dump_stuck inactive >> >> ok >> >> PG_STAT STATE UP >> >> UP_PRIMARY ACTING >> >> ACTING_PRIMARY >> >> 15.251e incomplete >> [151,464,146,503,166,41,555,542,9,565,268] >> >> 151 >> >> [151,464,146,503,166,41,555,542,9,565,268]151 >> >> 15.3f3 incomplete >> [584,281,672,699,199,224,239,430,355,504,196] >> >> 584 >> >> [584,281,672,699,199,224,239,430,355,504,196]584 >> >> 15.985 remapped+incomplete >> [396,690,493,214,319,209,546,91,599,237,352] >> >> 396 >> >> >> >> >> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352] >> >> 214 >> >> 15.39d3 remapped+incomplete >> [404,221,223,585,38,102,533,471,568,451,195] >> >> 404 >> >> >> [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647] >> >> 223 >> >> 15.d46 remapped+incomplete >> [297,646,212,254,110,169,500,372,623,470,678] >> >> 297 >> >> >> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678] >> >> 548 >> >> >> >> Some of the OSDs had gone down on the cluster. Below is the # ceph >> status >> >> >> >> # ceph -s >> >> cluster: >> >>id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794 >> >>health: HEALTH_WARN >> >>noscrub,nodeep-scrub flag(s) set >> >>1 pools have many more objects per pg than average >> >>Reduced data availability: 5 pgs inactive, 5 pgs incomplete >> >>Degraded data redundancy: 44798/8718528059 objects degraded >> >> (0.001%), 1 pg
[ceph-users] Critical Information: DELL/Toshiba SSDs dying after 70,000 hours of operation
Hello, This message does not concern Ceph itself but a hardware vulnerability which can lead to permanent loss of data on a Ceph cluster equipped with the same hardware in separate fault domains. The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives of the 13G generation of DELL servers are subject to a vulnerability which renders them unusable after 70,000 hours of operation, i.e. approximately 7 years and 11 months of activity. This topic has been discussed here: https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-communication-on-the-same-date/td-p/8353438 The risk is all the greater since these disks may die at the same time in the same server leading to the loss of all data in the server. To date, DELL has not provided any firmware fixing this vulnerability, the latest firmware version being "A3B3" released on Sept. 12, 2016: https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k If your have servers running these drives, check their uptime. If they are close to the 70,000 hour limit, replace them immediately. The smartctl tool does not report the uptime for these SSDs, but if you have HDDs in the server, you can query their SMART status and get their uptime, which should be about the same as the SSDs. The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the iSCSI bus number). We have informed DELL about this but have no information yet on the arrival of a fix. We have lost 6 disks, in 3 different servers, in the last few weeks. Our observation shows that the drives don't survive full shutdown and restart of the machine (power off then power on in iDrac), but they may also die during a single reboot (init 6) or even while the machine is running. Fujitsu released a corrective firmware in June 2021 but this firmware is most certainly not applicable to DELL drives: https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf Regards, Frederic Sous-direction Infrastructure and Services Direction du Numérique Université de Lorraine ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OpenStack (cinder) volumes retyping on Ceph back-end
Hi, I don't quite understand the issue yet, maybe you can clarify. If I perform a "change volume type" from OpenStack on volumes attached to the VMs the system successfully migrates the volume from the source pool to the destination pool and at the end of the process the volume is visible in the new pool and is removed from the old pool. Are these volumes root disks or just additional volumes? But apparently, the retype works. The problem encountered is that when reconfiguring the VM, to specify the new pool associated with the volumes (performed through a resize of the VM, I haven't found any other method to change the information on the nova/cinder db automatically. If the retype already works then what is your goal by "reconfiguring the VM"? What information is wrong in the DB? This part needs some clarification for me. Can you give some examples? The VM after the retype continues to work perfectly in RW but the "new" volume created in the new pool is not used to write data and consequently when the VM is shut down all the changes are lost. Just wondering, did you shut down the VM before retyping the volume? I'll try to reproduce this in a test cluster. Regards, Eugen Zitat von andrea.mar...@oscct.it: Hello, I configured different back-end storage on OpenStack (Yoga release) and using Ceph (ceph version 17.2.4) with different pools (volumes, cloud-basic, shared-hosting-os, shared-hosting-homes,...) for RBD application. I created different volume types towards each of the backends and everything works perfectly. If I perform a "change volume type" from OpenStack on volumes attached to the VMs the system successfully migrates the volume from the source pool to the destination pool and at the end of the process the volume is visible in the new pool and is removed from the old pool. The problem encountered is that when reconfiguring the VM, to specify the new pool associated with the volumes (performed through a resize of the VM, I haven't found any other method to change the information on the nova/cinder db automatically. I also did some tests shut-off of the VM and modification of the xml through virsh edit and startup the VM) the volume presented to the VM is exactly the version and content on the retype date of the volume itself. All data written and modified after the retype is lost. The VM after the retype continues to work perfectly in RW but the "new" volume created in the new pool is not used to write data and consequently when the VM is shut down all the changes are lost. Do you have any idea how to carry out a check and possibly how to proceed in order not to lose the data of the vm of which I have retyped the volume? The data is written somewhere because the VMs work perfectly. Thank you ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: same OSD in multiple CRUSH hierarchies
Hi, I don't think this is going to work. Each OSD belongs to a specific host and you can't have multiple buckets (e.g. bucket type "host") with the same name in the crush tree. But if I understand your requirement correctly, there should be no need to do it this way. If you structure your crush tree according to your separation requirements and the critical pools use designated rules, you can still have a rule that doesn't care about the data separation but distributes the replicas across the available hosts (given your failure domain would be "host"), which is already the default for the replicated_rule. Did I misunderstand something? Regards, Eugen Zitat von Budai Laszlo : Hi there, I'm curious if there is anything against configuring an ODS to be part in multiple CRUSH hierarchies. I'm thinking of the following scenario: I want to create pools that are using distinct sets of OSDs. I want to make sure that a piece data which replicated at application level will not end up on the same OSD. So I would creat multiple CRUSH hierarchies (root - host - osd) but using different OSDs in each, and different rules that are using those hierarchies. Then I would create pools with the different rules, and use those different pools for storing the data for the different application instances. But I would also like to use the OSDs in the "default hierarchy" set up by ceph where all the hosts are in the same root bucket, and have the default replicated rule, so my generic data volumes would be able to spread across all the OSDs available. Is there something against this setup? Thank you for any advice! Laszlo ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] autocaling not work and active+remapped+backfilling
hi i have a problem with ceph 17.2.6 , cephfs with mds daemons but see an unusual behavior. create a data pool with default crush rule but data just store in 3 specific osd and other osd is clean PG auto-scaling is also active but its size does not change when the pool is biger I did this manually but the problem was not solved and I got the error pg are not balanced across osds How do I solve this problem? Is this a bug? I did not have this problem in previous versions I solved this problem. There are several identical crash rules in the folder step chooseleaf firstn 0 type host I think this confuses the balancer and autoscale and output for ceph osd pool autoscale-status is empty after remove other crush rules autoscale runing but move data from osd full to clear osd is slow trying with Reducing the weight of filled OSDs, I tried to prioritize the use of other OSDs ceph osd reweight-by-utilization I hope this works Is there a solution that makes the process of autoscaling and cleaning placement groups faster? --- [root@opcsdfpsbpp0201 ~]# ceph osd crush rule dump [ { "rule_id": 0, "rule_name": "replicated_rule", "type": 1, "steps": [ { "op": "take", "item": -1, "item_name": "default" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 1, "rule_name": "r3-host", "type": 1, "steps": [ { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] }, { "rule_id": 2, "rule_name": "r3", "type": 1, "steps": [ { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } ] # ceph osd status | grep back 23 opcsdfpsbpp0211 1900G 147G 00 00 backfillfull,exists,up 48 opcsdfpsbpp0201 1900G 147G 00 00 backfillfull,exists,up 61 opcsdfpsbpp0205 1900G 147G 00 00 backfillfull,exists,up -- Every 2.0s: ceph -s opcsdfpsbpp0201: Sun Jun 18 11:44:29 2023 cluster: id: 79a2627c-0821-11ee-a494-00505695c58c health: HEALTH_WARN 3 backfillfull osd(s) 6 pool(s) backfillfull services: mon: 3 daemons, quorum opcsdfpsbpp0201,opcsdfpsbpp0205,opcsdfpsbpp0203 (age 6d) mgr: opcsdfpsbpp0201.vttwxa(active, since 5d), standbys: opcsdfpsbpp0205.tpodbs, opcsdfpsbpp0203.jwjkcl mds: 1/1 daemons up, 2 standby osd: 74 osds: 74 up (since 7d), 74 in (since 7d); 107 remapped pgs data: volumes: 1/1 healthy pools: 6 pools, 359 pgs objects: 599.64k objects, 2.2 TiB usage: 8.1 TiB used, 140 TiB / 148 TiB avail pgs: 923085/1798926 objects misplaced (51.313%) 252 active+clean 87 active+remapped+backfill_wait 20 active+remapped+backfilling io: client: 255 B/s rd, 0 op/s rd, 0 op/s wr recovery: 33 MiB/s, 8 objects/s progress: Global Recovery Event (5h) [===.] (remaining: 2h) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephfs mount with kernel driver
I noticed that in my scenario, when I mount cephfs via the kernel module, it directly copies to one or three of the OSDs. And the writing speed of the client is higher than the speed of replication and auto scaling This causes the writing operation to stop as soon as those OSDs are filled, and the error of free space is not available What should be done to solve this problem? Is there a way to increase the speed of scaling or moving objects in OSD? Or a way to mount cephfs that does not have these problems? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io