[ceph-users] Re: Clients failing to advance oldest client?
You can use the "ceph health detail" command to see which clients are not responding. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Clients failing to advance oldest client?
Ok! Thank you. Is there a way to tell which client is slow? > On Mar 25, 2024, at 9:06 PM, David Yang wrote: > > It is recommended to disconnect the client first and then observe > whether the cluster's slow requests recover. > > Erich Weiler 于2024年3月26日周二 05:02写道: >> >> Hi Y'all, >> >> I'm seeing this warning via 'ceph -s' (this is on Reef): >> >> # ceph -s >> cluster: >> id: 58bde08a-d7ed-11ee-9098-506b4b4da440 >> health: HEALTH_WARN >> 3 clients failing to advance oldest client/flush tid >> 1 MDSs report slow requests >> 1 MDSs behind on trimming >> >> services: >> mon: 5 daemons, quorum >> pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d) >> mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz >> mds: 1/1 daemons up, 1 standby >> osd: 46 osds: 46 up (since 3d), 46 in (since 2w) >> >> data: >> volumes: 1/1 healthy >> pools: 4 pools, 1313 pgs >> objects: 258.13M objects, 454 TiB >> usage: 688 TiB used, 441 TiB / 1.1 PiB avail >> pgs: 1303 active+clean >> 8active+clean+scrubbing >> 2active+clean+scrubbing+deep >> >> io: >> client: 131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr >> >> I googled around and looked at the docs and it seems like this isn't a >> critical problem, but I couldn't find a clear path to resolution. Does >> anyone have any advice on what I can do to resolve the health issues up top? >> >> My CephFS filesystem is incredibly busy so I have a feeling that has >> some impact here, but not 100% sure... >> >> Thanks as always for the help! >> >> cheers, >> erich >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Clients failing to advance oldest client?
It is recommended to disconnect the client first and then observe whether the cluster's slow requests recover. Erich Weiler 于2024年3月26日周二 05:02写道: > > Hi Y'all, > > I'm seeing this warning via 'ceph -s' (this is on Reef): > > # ceph -s >cluster: > id: 58bde08a-d7ed-11ee-9098-506b4b4da440 > health: HEALTH_WARN > 3 clients failing to advance oldest client/flush tid > 1 MDSs report slow requests > 1 MDSs behind on trimming > >services: > mon: 5 daemons, quorum > pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d) > mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz > mds: 1/1 daemons up, 1 standby > osd: 46 osds: 46 up (since 3d), 46 in (since 2w) > >data: > volumes: 1/1 healthy > pools: 4 pools, 1313 pgs > objects: 258.13M objects, 454 TiB > usage: 688 TiB used, 441 TiB / 1.1 PiB avail > pgs: 1303 active+clean > 8active+clean+scrubbing > 2active+clean+scrubbing+deep > >io: > client: 131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr > > I googled around and looked at the docs and it seems like this isn't a > critical problem, but I couldn't find a clear path to resolution. Does > anyone have any advice on what I can do to resolve the health issues up top? > > My CephFS filesystem is incredibly busy so I have a feeling that has > some impact here, but not 100% sure... > > Thanks as always for the help! > > cheers, > erich > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why you might want packages not containers for Ceph deployments
> "complexity, OMG!!!111!!!" is not enough of a statement. You have to explain > what complexity you gain and what complexity you reduce. > Installing SeaweedFS consists of the following: `cd seaweedfs/weed && make > install` > This is the type of problem that Ceph is trying to solve, and starting a > discussion by saying that everything is bad, without providing any helpful > message is useless FUD. > > Max Max stop eating seaweed ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On 25-03-2024 23:07, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. It's a huge change and 42% of you data need to be moved. And this move is not only to the new OSD but also between the existing OSD, but they are busy with backfilling so they have no free backfill reservation. I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper. Forgot the link https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf Thanks again, seems the explanation for the low number of concurrent backfills is then simply that backfill_wait can hold partial reservations. Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] put bucket notification configuration - access denied
Hello everyone, we are facing a problem regarding the s3 operation put bucket notification configuration. We are using Ceph version 17.2.6. We are trying to configure buckets in our cluster so that a notification message is sent via amqps protocol when the content of the bucket change. To do so, we created a local rgw user with "special" capabilities and we wrote ad hoc policies for this user (list of all buckets, read access to all buckets and possibility to add, list and delete bucket configurations). The problems regards the configurations of all buckets except the one he owns, when doing this put bucket notification configuration cross-account operation we get an access denied error. I have the suspect that this problem is related to the version we are using, because when we were doing tests on another cluster we were using version 18.2.1 and we did not face this problem. Can you confirm my hypothesis? Thanks, GM. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph RGW reply "ERROR: S3 error: 404 (NoSuchKey)" but rgw object metadata exist
Hi, My ceph cluster has 9 nodes for Ceph Object Store. Recently, I have experienced data loss that reply 404 (NoSuchKey) by s3cmd get xxx command. However, I can get metadata info by s3cmd ls xxx. The RGW object size is above 1GB that have many multipart object. Commanding 'rados -p default.rgw.buckets.data stats object' show that it only have head object, all of multipart and shadow part have gone. The bucket data only support write and read operation, no delete, and has no lifecycle policy. I have found similar problem in https://tracker.ceph.com/issues/47866 that had repaired in v16.0.0. Maybe this is new data loss problem that very serious for us. ceph version: 16.2.5 #command info: s3cmd ls s3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz 2024-03-13 09:27 1208269953 s3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz s3cmd get s3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz download: 's3://solr-scrapy.commoncrawl-warc/batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz' -> './CC-MAIN-20200118052321-20200118080321- 00547.warc.gz' [1 of 1] ERROR: Download of './CC-MAIN-20200118052321-20200118080321-00547.warc.gz' failed (Reason: 404 (NoSuchKey)) ERROR: S3 error: 404 (NoSuchKey) # head exist and size is 0, multipart and shadow had lost rados -p default.rgw.buckets.data stat df8c0fe6-01c8-4c07-b310-2d102356c004.76248.1__multipart_batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz.2~C2M72EJLHrNe_fnHnifS4N7pw70hVmE.1 error stat-ing eck6m2.rgw.buckets.data/df8c0fe6-01c8-4c07-b310-2d102356c004.76248.1__multipart_batch_2024031314/Scrapy/main/CC-MAIN-20200118052321-20200118080321-00547.warc.gz.2~C2M72EJLHrNe_fnHnifS4N7pw70hVmE.1: (2) No such file or directory thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Quincy/Dashboard: Object Gateway not accessible after applying self-signed cert to rgw service
Hi, I am running a Ceph cluster and configured RGW for S3, initially w/o SSL. The service works nicely and I updated the service usinfg SSL certs, signed by our own CA, just as I already did for the dashboard itself. However, as soon as I applied the new config, the dashboard wasn't able to access and display the service anymore, while the service itself still works, now using the supplied SSL certificate. The error supplied is: Error 500 The server encountered an unexpected condition which prevented it from fulfilling the request. My guess is, that the dashboard for some reason doesn't like the certificate, the rgw service is providing, despite the fact, that itself is using it. Any hints on how to make dashboard display the Object Gatway again? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Mounting A RBD Via Kernal Modules
Hi, March 24, 2024 at 8:19 AM, "duluxoz" wrote: > > Hi, > > Yeah, I've been testing various configurations since I sent my last > > email - all to no avail. > > So I'm back to the start with a brand new 4T image which is rbdmapped to > > /dev/rbd0. > > Its not formatted (yet) and so not mounted. > > Every time I attempt a mkfs.xfs /dev/rbd0 (or mkfs.xfs > > /dev/rbd/my_pool/my_image) I get the errors I previous mentioned and the > > resulting image then becomes unusable (in ever sense of the word). > > If I run a fdisk -l (before trying the mkfs.xfs) the rbd image shows up > > in the list - no, I don't actually do a full fdisk on the image. > > An rbd info my_pool:my_image shows the same expected values on both the > > host and ceph cluster. > > I've tried this with a whole bunch of different sized images from 100G > > to 4T and all fail in exactly the same way. (My previous successful 100G > > test I haven't been able to reproduce). > > I've also tried all of the above using an "admin" CephX(sp?) account - I > > always can connect via rbdmap, but as soon as I try an mkfs.xfs it > > fails. This failure also occurs with a mkfs.ext4 as well (all size drives). > > The Ceph Cluster is good (self reported and there are other hosts > > happily connected via CephFS) and this host also has a CephFS mapping > > which is working. > > Between running experiments I've gone over the Ceph Doco (again) and I > > can't work out what's going wrong. > > There's also nothing obvious/helpful jumping out at me from the > > logs/journal (sample below): > > ~~~ > > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > > 524773 0~65536 result -1 > > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > > 524772 65536~4128768 result -1 > > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1 > > Mar 24 17:38:29 my_host.my_net.local kernel: blk_print_req_error: 119 > > callbacks suppressed > > Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector > > 4298932352 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2 > > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > > 524774 0~65536 result -1 > > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write at objno > > 524773 65536~4128768 result -1 > > Mar 24 17:38:29 my_host.my_net.local kernel: rbd: rbd0: write result -1 > > Mar 24 17:38:29 my_host.my_net.local kernel: I/O error, dev rbd0, sector > > 4298940544 op 0x1:(WRITE) flags 0x4000 phys_seg 1024 prio class 2 > > ~~~ > > Any ideas what I should be looking at? Could you please share the command you've used to create the RBD? Cheers, Alwin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Linux Laptop Losing CephFS mounts on Sleep/Hibernate
Hi All, So I've got a Ceph Reef Cluster (latest version) with a CephFS system set up with a number of directories on it. On a Laptop (running Rocky Linux (latest version)) I've used fstab to mount a number of those directories - all good, everything works, happy happy joy joy! :-) However, when the laptop goes into sleep or hibernate mode (ie when I close the lid) and then bring it back out of sleep/hibernate (ie open the lid) the CephFS mounts are "not present". The only way to get them back is to run `mount -a` as either root or as sudo. This, as I'm sure you'll agree, is less than ideal - especially as this is a pilot project for non-admin users (ie they won't have access to the root account or sudo on their own (corporate) laptops). So, my question to the combined wisdom of the Community is what's the best way to resolve this issue? I've looked at autofs, and even tried (half-heartedly - it was late, and I wanted to go home :-) ) to get this running, but I'm note sure if this is the best way to resolve things. All help and advice on this greatly appreciated - thank in advance Cheers Dulux-Oz ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On 25-03-2024 22:58, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. It's a huge change and 42% of you data need to be moved. And this move is not only to the new OSD but also between the existing OSD, but they are busy with backfilling so they have no free backfill reservation. If I have 60 backfills going on that would be 60 read reservations and 60 write reservations if I understand it correctly. The only way I can see that getting stuck at 60 backfills with osd_max_backfill = 3 is for those 60 reservations to be tied up on 20 OSDs being the only ones either read from or written to, and all other OSDs waiting on those. > I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper. Thanks, I'll take a look at that =) Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Mounting A RBD Image via Kernal Modules
Hi All, I'm looking for a bit of advice on the subject of this post. I've been "staring at the trees so long I can't see the forest any more". :-) Rocky Linux Client latest version. Ceph Reef latest version. I have read *all* the doco on the Ceph website. I have created a pool (my_pool) and an image (my_image). I had activated the pool for RBD. I can run the `rbdmap map` command on the client and the image shows up as /dev/rbd0 (and also /dev/rbd/my_pool/my_image). But here's where I'm running into issues - and I'm pretty sure it's a 'Level 8' issue, so it'll be something simple that I'm just not "getting": Do I need to run `mkfs` on /dev/rbd0 before I try `mount /dev/rbd/my_pool/my_image /mnt/rbd_image`? The reason I ask is that I've tried to mount the image before I run 'mkfs' and I get back `mount: /mnt/rbd_image: wrong fs type, bad option, bad superblock on /dev/rbdo, missing codepage or helper program, or other error`. I've also tried to mount the image after I run 'mkfs' and I get back `mount: /mnt/rbd_image: can't read superblock on /dev/rbd0`. Basically, as I've said, I'm missing or don't understand *something* about this process - which is why I'm now seeking the collective wisdom of the Community. All help and advice greatly appreciated - thanks in advance Cheers Dulux-Oz ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why a lot of pgs are degraded after host(+osd) restarted?
I understood the mechanism more through your answer. I'm using erasure coding and backfilling step took quite a long time :( If there was just a lot of pg peering. I think it's reasonable. but I was curious why there was a lot of backfill_wait instead of peering. (e.g. pg 9.5a is stuck undersized for 39h, current state active+undersized+degraded+remapped+backfill_wait ) let me know if you have the tips to increase the performance of backfill or prevent unnecessary backfill. Thank you for your answer. Joshua Baergen wrote: > Hi Jaemin, > > It is normal for PGs to become degraded during a host reboot, since a > copy of the data was taken offline and needs to be resynchronized > after the host comes back. Normally this is quick, as the recovery > mechanism only needs to modify those objects that have changed while > the host is down. > > However, if you have backfills ongoing and reboot a host that contains > OSDs involved in those backfills, then those backfills become > degraded, and you will need to wait for them to complete for > degradation to clear. Do you know if you had backfills at the time the > host was rebooted? If so, the way to avoid this is to wait for > backfill to complete before taking any OSDs/hosts down for > maintenance. > > Josh ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Cephadm host keeps trying to set osd_memory_target to less than minimum
I have a virtual ceph cluster running 17.2.6 with 4 ubuntu 22.04 hosts in it, each with 4 OSD's attached. The first 2 servers hosting mgr's have 32GB of RAM each, and the remaining have 24gb For some reason i am unable to identify, the first host in the cluster appears to constantly be trying to set the osd_memory_target variable to roughly half of what the calculated minimum is for the cluster, i see the following spamming the logs constantly Unable to set osd_memory_target on my-ceph01 to 480485376: error parsing value: Value '480485376' is below minimum 939524096 Default is set to 4294967296. I did double check and osd_memory_base (805306368) + osd_memory_cache_min (134217728) adds up to minimum exactly osd_memory_target_autotune is currently enabled. But i cannot for the life of me figure out how it is arriving at 480485376 as a value for that particular host that even has the most RAM. Neither the cluster or the host is even approaching max utilization on memory, so it's not like there are processes competing for resources. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] #1359 (update) Ceph filesystem failure | Ceph filesystem probleem
English follows Dutch ## Update 2024-03-19 Positief nieuws; we zijn nu bezig met het kopiëren van data uit CephFS. We konden het filesystem weer mounten met hulp van 42on, onze support club. We kopiëren nu de data en dat lijkt goed te gaan. Op elk moment kunnen we tegen problematische metadata aanlopen, dus afwachten en duimen. Als we de data uit CephFS hebben gekopieerd op tijdelijke storage, kunnen we verdere oplossingen voor de toekomst verzinnen en implementeren. NB: neem contact op met postmaster (mailto:postmas...@science.ru.nl) als je een urgent verzoek hebt voor een *kleine* set van files in een specifieke locatie, dan kunnen we daar prioriteit aan geven. Een Petabyte aan data kopiëren duurt weken/maanden, kleine datasets (< 1TB) kan wel relatief snel. bron: https://cncz.science.ru.nl/nl/cpk/1359 = ## Update 2024-03-19 Somewhat good news; we are now copying the RDR data out of CephFS. We have been able to mount CephFS again with help from 42on (our support party). Copying to temporary storage is going OK, but we can run into issues at any time (or not). We're hopeful that this process will let us recover the data stored in CephFs and then we can look for future solutions. NB: contact postmaster (mailto:postmas...@science.ru.nl) if you have urgent requests for *small* sets of files in a particular location, so we may restore this with priority. A Petabyte of data takes weeks/months to copy, but a small amount (< 1TB) can be retrieved relatively fast. source: https://cncz.science.ru.nl/en/cpk/1359 -- Postmaster: Simon Oosthoek Postmaster Phone: +31 24 365 3535 Personal Phone: +31 24 365 2097 OpenPGP_signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] S3 Partial Reads from Erasure Pool
I am dealing with a cluster that is having terrible performance with partial reads from an erasure coded pool. Warp tests and s3bench tests result in acceptable performance but when the application hits the data, performance plummets. Can anyone clear this up for me, When radosgw gets a partial read does it have to assemble all the rados objects that make up the s3 object before returning the range? With a replicated poll i am seeing 6 to 7 GiB/s of read performance and only 1GiB/s of read from the erasure coded pool which leads me to believe that the replicated pool is returning just the rados objects for the partial s3 object and the erasure coded pool is not. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph Dashboard Clear Cache
Hello Ceph members, How do I clear the Ceph dashboard cache? Kindly guide me on how to do this. Thanks ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgrading from Pacific to Quincy fails with "Unexpected error"
We were having this same error; after some troubleshooting it turned out that the 17.2.7 cephadm orchestrator's ssh client was choking on the keyboard-interactive AuthenticationMethod (which is really PAM); Our sshd configuration was: AuthenticationMethods keyboard-interactive publickey,keyboard-interactive gssapi-with-mic,keyboard-interactive Thus, cephadm was trying to use "publickey,keyboard-interactive"; publickey would succeed, but the cephadm ssh client would close the connection as soon as a follow-up keyboard-interactive method was attempted. Adding this to sshd_config for each orchestrator seemed to fix it by using only publickey AuthenticationMethod for just the cephadm orchestrators, but using the standard config for everybody else: Match Address # For some reason the 17.2.7 cephadm orchestrator # chokes on keyboard-interactive (PAM) # AuthenticationMethod; thus, exclude it. AuthenticationMethods publickey PermitRootLogin yes Match Address # For some reason the 17.2.7 cephadm orchestrator # chokes on keyboard-interactive (PAM) # AuthenticationMethod; thus, exclude it. AuthenticationMethods publickey PermitRootLogin yes Cheers. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Ceph-Cluster integration with Ovirt-Cluster
Hi Guys, I have a running ovirt-4.3 cluster with 1 manager and 4 hypervisors nodes and for storage using traditional SAN storage which is connect using iscsi. where i can create VM's and assign storge from SAN. This is running fine since a decade but now i want to move from traditional SAN storage to ceph storage cluster due to slow speed and scalability issues of storage. But i am a very newbie in ceph but able to install ceph-cluster reef_v18 in my lab environment with 5 VM's (2 Mon's and Manager and 3 OSD's) on Debian 12. And i have installed ovirt 4.5 with 1 hypervisor on VM's on centos stream 9 and want to integrate it with ceph-cluster so that i can use ceph-cluster storage like i am using SAN with my ovirt and I have tried to do some google and read ceph documentation but did not find anything regarding this. So how can i do it? If you need any other information, please let me know! Many Thanks! PJ111288 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MANY_OBJECT_PER_PG on 1 pool which is cephfs_metadata
Dear Eugen, Sorry i forgot to update the case. I have upgraded to the latest pacific release 16.2.15 and i have done the necessary for the pg_num :) Thanks for the followup on this. Topic can be closed. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Adding new OSD's - slow_ops and other issues.
Hi. We have a cluster working very nicely since it was put up more than a year ago. Now we needed to add more NVMe drives to expand. After setting all the "no" flags.. we added them using $ ceph orch osd add The twist is that we have managed to get the default weights set to 1 for all disks not 7.68 (as the default for the ceph orch command. Thus we did a subsequent reweight to change weight -- and then removed the "no" flags. As a consequence we had a bunch of OSD's delivering slow_ops and -- after manually restarting osd's to get rid of them - the system returned to normal. ... second try... Same drill - but somehow the ceph orch command failed to bring the new OSD online before we ran the reweight command ... and it works flawlessly ... third try ... Same drill - but now ceph orch brought the new OSD into the system - and we saw excactly the same problem again. Being a bit wiser - we forcefully restarted the new OSD.. and everything whet back into normal mode again. Thus it seems like the "reweight" command on online OSD's have a bad effect on our setup - causing major service disruption. 1) Is it possible to "bulk" change default weights on all OSD's without a huge data movement going on? 2) or Is it possible to instrurct "ceph orch osd add" to set default weight before it putting the new OSD into the system? I would not expect above to be expected behaviour - if someone has ideas about what goes on more than above please share? Setup: # ceph version ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) 43 7.68 TB NVMe's over 12 OSD hosts - all connected using 2x 100GbitE Thanks Jesper ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: PG damaged "failed_repair"
Hi, Sorry for the broken formatting. Here are the outputs again. ceph osd df: ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL%USE VAR PGS STATUS 3hdd 1.81879 0 0 B 0 B 0 B 0 B 0 B 0 B 0 00down 12hdd 1.81879 1.0 1.8 TiB 385 GiB 383 GiB 6.7 MiB 1.4 GiB 1.4 TiB 20.66 1.73 18 up 13hdd 1.81879 1.0 1.8 TiB 422 GiB 421 GiB 5.8 MiB 1.3 GiB 1.4 TiB 22.67 1.90 17 up 15hdd 1.81879 1.0 1.8 TiB 264 GiB 263 GiB 4.6 MiB 1.1 GiB 1.6 TiB 14.17 1.19 14 up 16hdd 9.09520 1.0 9.1 TiB 1.0 TiB 1023 GiB 8.8 MiB 2.6 GiB 8.1 TiB 11.01 0.92 65 up 17hdd 1.81879 1.0 1.8 TiB 319 GiB 318 GiB 6.1 MiB 1.0 GiB 1.5 TiB 17.13 1.43 15 up 1hdd 5.45749 1.0 5.5 TiB 546 GiB 544 GiB 7.8 MiB 1.4 GiB 4.9 TiB 9.76 0.82 29 up 4hdd 5.45749 1.0 5.5 TiB 801 GiB 799 GiB 8.3 MiB 2.4 GiB 4.7 TiB 14.34 1.20 44 up 8hdd 5.45749 1.0 5.5 TiB 708 GiB 706 GiB 9.7 MiB 2.1 GiB 4.8 TiB 12.67 1.06 36 up 11hdd 5.45749 0 0 B 0 B 0 B 0 B 0 B 0 B 0 00down 14hdd 1.81879 1.0 1.8 TiB 200 GiB 198 GiB 3.8 MiB 1.3 GiB 1.6 TiB 10.71 0.90 10 up 0hdd 9.09520 0 0 B 0 B 0 B 0 B 0 B 0 B 0 00down 5hdd 9.09520 1.0 9.1 TiB 859 GiB 857 GiB 17 MiB 2.1 GiB 8.3 TiB 9.23 0.77 46 up 9hdd 9.09520 1.0 9.1 TiB 924 GiB 922 GiB 11 MiB 2.3 GiB 8.2 TiB 9.92 0.83 55 up TOTAL 53 TiB 6.3 TiB 6.3 TiB 90 MiB 19 GiB 46 TiB 11.95 MIN/MAX VAR: 0.77/1.90 STDDEV: 4.74 ceph osd pool ls detail : pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 32 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr pool 2 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 9327 lfor 0/0/104 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 3 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 9018 lfor 0/0/104 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 4 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 9149 lfor 0/0/106 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 5 'polyphoto_backup' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 372 lfor 0/0/362 flags hashpspool,selfmanaged_snaps stripe_width 0 compression_algorithm snappy compression_mode aggressive application rbd The error seems to come from a software error in Ceph. I see this error in the logs : "FAILED ceph_assert(clone_overlap.count(clone))" Thanks, Romain Lebbadi-Breteau ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Why you might want packages not containers for Ceph deployments
Dear Nico, do you think it is sensible and it's a precise statement saying that "we can't reduce complexity by adding a layer of complexity"? Containers are always adding a so-called layer, but people keep using them, and in some cases, they offload complexity from another side. Claiming the "complexity", without explaining examining the details is pure FUD. "complexity, OMG!!!111!!!" is not enough of a statement. You have to explain what complexity you gain and what complexity you reduce. Installing SeaweedFS consists of the following: `cd seaweedfs/weed && make install` This is the type of problem that Ceph is trying to solve, and starting a discussion by saying that everything is bad, without providing any helpful message is useless FUD. Max ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Upgarde from 16.2.1 to 16.2.2 pacific stuck
Dear Eugen, Thanks again for the help. We managed to upgrade to a minor release 16.2.3, next week we will upgrade to latest 16.2.15. You were right about the number of manager which was blocking the update. Thanks again for the help. I Topic solved. Best Regards. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On Mon, Mar 25, 2024 at 10:58:24PM +0100, Kai Stian Olstad wrote: On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. It's a huge change and 42% of you data need to be moved. And this move is not only to the new OSD but also between the existing OSD, but they are busy with backfilling so they have no free backfill reservation. I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper. Forgot the link https://ceph.io/assets/pdfs/user_dev_meeting_2023_10_19_joshua_baergen.pdf -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On Mon, Mar 25, 2024 at 09:28:01PM +0100, Torkil Svensgaard wrote: My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. It's a huge change and 42% of you data need to be moved. And this move is not only to the new OSD but also between the existing OSD, but they are busy with backfilling so they have no free backfill reservation. I do recommend this document by Joshua Baergen at Digital Ocean that explains backfilling and the problem with it and there solution, a tool called pgremapper. -- Kai Stian Olstad ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] quincy-> reef upgrade non-cephadm
Hi, I am upgrading my test cluster from 17.2.6 (quincy) to 18.2.2 (reef). As it was an rpm install, i am following the directions here: Reef — Ceph Documentation | | | | Reef — Ceph Documentation | | | The upgrade worked, but I have some observations and questions before I move to my production cluster: 1. I see no systemd units with the fsid in them, as described in the document above. Both before and after the upgrade, my mon and other units are: ceph-mon@.serviceceph-osd@[N].service etc Should I be concerned? 2. Does order matter? Based on past upgrades, I do not think so, but I wanted to be sure. For example, can I update: mon/mds/radosgw/mgrs first, then afterwards update the osds? This is what i have done in previous updates and and all was well. 3. Again on order, if a server serves say, a mon and mds, I can't really easily update one without the other, based on shared libraries and such. It appears that that is ok, based on my test cluster, but wanted to be sure. Again if an mds is one of the servers to update, I know I have to updatethe remaining one after max_mds is set to 1 and others are stopped, first. 4. After upgrade of my mgr node I get: "Module [several module names] has missing NOTIFY_TYPES member" in ceph-mgr..log But the mgr starts up eventually The system is Rocky Linux 8.9 Thanks for any thoughts -Chris ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Clients failing to advance oldest client?
Hi Y'all, I'm seeing this warning via 'ceph -s' (this is on Reef): # ceph -s cluster: id: 58bde08a-d7ed-11ee-9098-506b4b4da440 health: HEALTH_WARN 3 clients failing to advance oldest client/flush tid 1 MDSs report slow requests 1 MDSs behind on trimming services: mon: 5 daemons, quorum pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 3d) mgr: pr-md-01.jemmdf(active, since 3w), standbys: pr-md-02.emffhz mds: 1/1 daemons up, 1 standby osd: 46 osds: 46 up (since 3d), 46 in (since 2w) data: volumes: 1/1 healthy pools: 4 pools, 1313 pgs objects: 258.13M objects, 454 TiB usage: 688 TiB used, 441 TiB / 1.1 PiB avail pgs: 1303 active+clean 8active+clean+scrubbing 2active+clean+scrubbing+deep io: client: 131 MiB/s rd, 111 MiB/s wr, 41 op/s rd, 613 op/s wr I googled around and looked at the docs and it seems like this isn't a critical problem, but I couldn't find a clear path to resolution. Does anyone have any advice on what I can do to resolve the health issues up top? My CephFS filesystem is incredibly busy so I have a feeling that has some impact here, but not 100% sure... Thanks as always for the help! cheers, erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
Neither downing or restarting the OSD cleared the bogus blocked_by. I guess it makes no sense to look further at blocked_by as the cause when the data can't be trusted and there is no obvious smoking gun like a few OSDs blocking everything. My tally came to 412 out of 539 OSDs showing up in a blocked_by list and that is about every OSD with data prior to adding ~100 empty OSDs. How 400 read targets and 100 write targets can only equal ~60 backfills with osd_max_backill set at 3 just makes no sense to me but alas. It seems I can just increase osd_max_backfill even further to get the numbers I want so that will do. Thank you all for taking the time to look at this. Mvh. Torkil On 25-03-2024 20:44, Anthony D'Atri wrote: First try "ceph osd down 89" On Mar 25, 2024, at 15:37, Alexander E. Patrakov wrote: On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard wrote: On 24/03/2024 01:14, Torkil Svensgaard wrote: On 24-03-2024 00:31, Alexander E. Patrakov wrote: Hi Torkil, Hi Alexander Thanks for the update. Even though the improvement is small, it is still an improvement, consistent with the osd_max_backfills value, and it proves that there are still unsolved peering issues. I have looked at both the old and the new state of the PG, but could not find anything else interesting. I also looked again at the state of PG 37.1. It is known what blocks the backfill of this PG; please search for "blocked_by." However, this is just one data point, which is insufficient for any conclusions. Try looking at other PGs. Is there anything too common in the non-empty "blocked_by" blocks? I'll take a look at that tomorrow, perhaps we can script something meaningful. Hi Alexander While working on a script querying all PGs and making a list of all OSDs found in a blocked_by list, and how many times for each, I discovered something odd about pool 38: " [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38 OSDs blocking other OSDs: All PGs in the pool are active+clean so why are there any blocked_by at all? One example attached. I don't know. In any case, it doesn't match the "one OSD blocks them all" scenario that I was looking for. I think this is something bogus that can probably be cleared in your example by restarting osd.89 (i.e, the one being blocked). Mvh. Torkil I think we have to look for patterns in other ways, too. One tool that produces good visualizations is TheJJ balancer. Although it is called a "balancer," it can also visualize the ongoing backfills. The tool is available at https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py Run it as follows: ./placementoptimizer.py showremapped --by-osd | tee remapped.txt Output attached. Thanks again. Mvh. Torkil On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard wrote: Hi Alex New query output attached after restarting both OSDs. OSD 237 is no longer mentioned but it unfortunately made no difference for the number of backfills which went 59->62->62. Mvh. Torkil On 23-03-2024 22:26, Alexander E. Patrakov wrote: Hi Torkil, I have looked at the files that you attached. They were helpful: pool 11 is problematic, it complains about degraded objects for no obvious reason. I think that is the blocker. I also noted that you mentioned peering problems, and I suspect that they are not completely resolved. As a somewhat-irrational move, to confirm this theory, you can restart osd.237 (it is mentioned at the end of query.11.fff.txt, although I don't understand why it is there) and then osd.298 (it is the primary for that pg) and see if any additional backfills are unblocked after that. Also, please re-query that PG again after the OSD restart. On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard wrote: On 23-03-2024 21:19, Alexander E. Patrakov wrote: Hi Torkil, Hi Alexander I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage. Thank you for taking the time =) What happens if you increase the osd_max_backfills setting temporarily? We already had the mclock override option in place and I re-enabled our babysitter script which sets osd_max_backfills pr OSD to 1-3 depending on how full they are. Active backfills went from 16 to 53 which is probably because default osd_max_backfills for mclock is 1. I think 53 is still a low number of active backfills given the large percentage misplaced. It may be a good idea to investigate a few of the stalled PGs. Please run commands similar to this one: ceph pg 37.0 query > query.37.0.txt ceph pg 37.1 query > query.37.1.txt ... and the same for the other affected pools. A few samples attached. Still, I must say that some of your rules are actually unsafe. The 4+2 rule as used by rbd_ec_data will not survive a datacenter-offline incident. Namely, for each PG, it chooses OSDs from two hosts in each datacenter, so 6 OSDs total. When a datacenter is offline, you
[ceph-users] mark direct Zabbix support deprecated? Re: Ceph versus Zabbix: failure: no data sent
Well, at least on my RHEL Ceph cluster, turns out zabbix-sender, zabbix-agent, etc aren't in the container image. Doesn't explain why it didn't work with the Debian/proxmox version, but *shrug*. It appears there is no interest in adding them back in, per: https://github.com/ceph/ceph-container/issues/1651 As such, may I recommend marking the Ceph documentation to this effect? Possibly referring to Zabbix instructions with Agent 2? On Fri, Mar 22, 2024 at 7:04 PM John Jasen wrote: > If the documentation is to be believed, it's just install the zabbix > sender, then; > > ceph mgr module enable zabbix > > ceph zabbix config-set zabbix_host my-zabbix-server > > (Optional) Set the identifier to the fsid. > > And poof. I should now have a discovered entity on my zabbix server to add > templates to. > > However, this has not worked yet on either of my ceph clusters (one RHEL, > one proxmox). > > Reference: https://docs.ceph.com/en/latest/mgr/zabbix/ > > On Reddit advice, I installed the Ceph templates for Zabbix. > https://raw.githubusercontent.com/ceph/ceph/master/src/pybind/mgr/zabbix/zabbix_template.xml > > Still no dice. No traffic at all seems to be generated, that I've seen > from packet traces, > > ... OK. > > I su'ed to the ceph user on both clusters, and ran zabbix_send: > > zabbix_sender -v -z 10.0.0.1 -s "$my_fsid" -k ceph.osd_avg_pgs -o 1 > > Response from "10.0.0.1:10051": "processed: 1; failed: 0; total: 1; > seconds spent: 0.42" > > sent: 1; skipped: 0; total: 1 > > As the ceph user, ceph zabbix send/discovery still fail. > > I am officially stumped. > > Any ideas as to which tree I should be barking up? > > Thanks in advance! > > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
First try "ceph osd down 89" > On Mar 25, 2024, at 15:37, Alexander E. Patrakov wrote: > > On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard wrote: >> >> >> >> On 24/03/2024 01:14, Torkil Svensgaard wrote: >>> On 24-03-2024 00:31, Alexander E. Patrakov wrote: Hi Torkil, >>> >>> Hi Alexander >>> Thanks for the update. Even though the improvement is small, it is still an improvement, consistent with the osd_max_backfills value, and it proves that there are still unsolved peering issues. I have looked at both the old and the new state of the PG, but could not find anything else interesting. I also looked again at the state of PG 37.1. It is known what blocks the backfill of this PG; please search for "blocked_by." However, this is just one data point, which is insufficient for any conclusions. Try looking at other PGs. Is there anything too common in the non-empty "blocked_by" blocks? >>> >>> I'll take a look at that tomorrow, perhaps we can script something >>> meaningful. >> >> Hi Alexander >> >> While working on a script querying all PGs and making a list of all OSDs >> found in a blocked_by list, and how many times for each, I discovered >> something odd about pool 38: >> >> " >> [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38 >> OSDs blocking other OSDs: > > >> All PGs in the pool are active+clean so why are there any blocked_by at >> all? One example attached. > > I don't know. In any case, it doesn't match the "one OSD blocks them > all" scenario that I was looking for. I think this is something bogus > that can probably be cleared in your example by restarting osd.89 > (i.e, the one being blocked). > >> >> Mvh. >> >> Torkil >> I think we have to look for patterns in other ways, too. One tool that produces good visualizations is TheJJ balancer. Although it is called a "balancer," it can also visualize the ongoing backfills. The tool is available at https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py Run it as follows: ./placementoptimizer.py showremapped --by-osd | tee remapped.txt >>> >>> Output attached. >>> >>> Thanks again. >>> >>> Mvh. >>> >>> Torkil >>> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard wrote: > > Hi Alex > > New query output attached after restarting both OSDs. OSD 237 is no > longer mentioned but it unfortunately made no difference for the number > of backfills which went 59->62->62. > > Mvh. > > Torkil > > On 23-03-2024 22:26, Alexander E. Patrakov wrote: >> Hi Torkil, >> >> I have looked at the files that you attached. They were helpful: pool >> 11 is problematic, it complains about degraded objects for no obvious >> reason. I think that is the blocker. >> >> I also noted that you mentioned peering problems, and I suspect that >> they are not completely resolved. As a somewhat-irrational move, to >> confirm this theory, you can restart osd.237 (it is mentioned at the >> end of query.11.fff.txt, although I don't understand why it is there) >> and then osd.298 (it is the primary for that pg) and see if any >> additional backfills are unblocked after that. Also, please re-query >> that PG again after the OSD restart. >> >> On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard >> wrote: >>> >>> >>> >>> On 23-03-2024 21:19, Alexander E. Patrakov wrote: Hi Torkil, >>> >>> Hi Alexander >>> I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage. >>> >>> Thank you for taking the time =) >>> What happens if you increase the osd_max_backfills setting temporarily? >>> >>> We already had the mclock override option in place and I re-enabled >>> our >>> babysitter script which sets osd_max_backfills pr OSD to 1-3 depending >>> on how full they are. Active backfills went from 16 to 53 which is >>> probably because default osd_max_backfills for mclock is 1. >>> >>> I think 53 is still a low number of active backfills given the large >>> percentage misplaced. >>> It may be a good idea to investigate a few of the stalled PGs. Please run commands similar to this one: ceph pg 37.0 query > query.37.0.txt ceph pg 37.1 query > query.37.1.txt ... and the same for the other affected pools. >>> >>> A few samples attached. >>> Still, I must say that some of your rules are actually unsafe. The 4+2 rule as used by rbd_ec_data will not survive a datacenter-offline incident. Namely, for each PG, it chooses OSDs from two hosts in each datacenter, so 6 OSDs total. When
[ceph-users] Re: Call for Interest: Managed SMB Protocol Support
On Monday, March 25, 2024 3:22:26 PM EDT Alexander E. Patrakov wrote: > On Mon, Mar 25, 2024 at 11:01 PM John Mulligan > > wrote: > > On Friday, March 22, 2024 2:56:22 PM EDT Alexander E. Patrakov wrote: > > > Hi John, > > > > > > > A few major features we have planned include: > > > > * Standalone servers (internally defined users/groups) > > > > > > No concerns here > > > > > > > * Active Directory Domain Member Servers > > > > > > In the second case, what is the plan regarding UID mapping? Is NFS > > > coexistence planned, or a concurrent mount of the same directory using > > > CephFS directly? > > > > In the immediate future the plan is to have a very simple, fairly > > "opinionated" idmapping scheme based on the autorid backend. > > OK, the docs for clustered SAMBA do mention the autorid backend in > examples. It's a shame that the manual page does not explicitly list > it as compatible with clustered setups. > > However, please consider that the majority of Linux distributions > (tested: CentOS, Fedora, Alt Linux, Ubuntu, OpenSUSE) use "realmd" to > join AD domains by default (where "default" means a pointy-clicky way > in a workstation setup), which uses SSSD, and therefore, by this > opinionated choice of the autorid backend, you create mappings that > disagree with the supposed majority and the default. This will create > problems in the future when you do consider NFS coexistence. > Thanks, I'll keep that in mind. > Well, it's a different topic that most organizations that I have seen > seem to ignore this default. Maybe those that don't have any problems > don't have any reason to talk to me? I think that more research is > needed here on whether RedHat's and GNOME's push of SSSD is something > not-ready or indeed the de-facto standard setup. > I think it's a bit of a mix, but am not sure either. > Even if you don't want to use SSSD, providing an option to provision a > few domains with idmap rid backend with statically configured ranges > (as an override to autorid) would be a good step forward, as this can > be made compatible with the default RedHat setup. That's reasonable. Thanks for the suggestion. > > > Sharing the same directories over both NFS and SMB at the same time, also > > known as "multi-protocol", is not planned for now, however we're all aware > > that there's often a demand for this feature and we're aware of the > > complexity it brings. I expect we'll work on that at some point but not > > initially. Similarly, sharing the same directories over a SMB share and > > directly on a cephfs mount won't be blocked but we won't recommend it. > > OK. Feature request: in the case if there are several CephFS > filesystems, support configuration of which one to serve. > Putting it on the list. > > > In fact, I am quite skeptical, because, at least in my experience, > > > every customer's SAMBA configuration as a domain member is a unique > > > snowflake, and cephadm would need an ability to specify arbitrary UID > > > mapping configuration to match what the customer uses elsewhere - and > > > the match must be precise. > > > > I agree - our initial use case is something along the lines: > > Users of a Ceph Cluster that have Windows systems, Mac systems, or > > appliances that are joined to an existing AD > > but are not currently interoperating with the Ceph cluster. > > > > I expect to add some idpapping configuration and agility down the line, > > especially supporting some form of rfc2307 idmapping (where unix IDs are > > stored in AD). > > Yes, for whatever reason, people do this, even though it is cumbersome > to manage. > > > But those who already have idmapping schemes and samba accessing ceph will > > probably need to just continue using the existing setups as we don't have > > an immediate plan for migrating those users. > > > > > Here is what I have seen or was told about: > > > > > > 1. We don't care about interoperability with NFS or CephFS, so we just > > > let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2" > > > idmap backend. It's completely OK that workstations get different UIDs > > > and GIDs, as only SIDs traverse the wire. > > > > This is pretty close to our initial plan but I'm not clear why you'd think > > that "workstations get different UIDs and GIDs". For all systems acessing > > the (same) ceph cluster the id mapping should be consistent. > > You did make me consider multi-cluster use cases with something like > > cephfs > > volume mirroring - that's something that I hadn't thought of before *but* > > using an algorithmic mapping backend like autorid (and testing) I think > > we're mostly OK there. > > The tdb2 backend (used in my example) is not algorithmic, it is > allocating. That is, it sequentially allocates IDs on the > first-seen-first-allocated basis. Yet this is what this customer uses, > presumably because it is the only backend that explicitly specifies > clustering operation in its manual page. > >
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On Mon, Mar 25, 2024 at 7:37 PM Torkil Svensgaard wrote: > > > > On 24/03/2024 01:14, Torkil Svensgaard wrote: > > On 24-03-2024 00:31, Alexander E. Patrakov wrote: > >> Hi Torkil, > > > > Hi Alexander > > > >> Thanks for the update. Even though the improvement is small, it is > >> still an improvement, consistent with the osd_max_backfills value, and > >> it proves that there are still unsolved peering issues. > >> > >> I have looked at both the old and the new state of the PG, but could > >> not find anything else interesting. > >> > >> I also looked again at the state of PG 37.1. It is known what blocks > >> the backfill of this PG; please search for "blocked_by." However, this > >> is just one data point, which is insufficient for any conclusions. Try > >> looking at other PGs. Is there anything too common in the non-empty > >> "blocked_by" blocks? > > > > I'll take a look at that tomorrow, perhaps we can script something > > meaningful. > > Hi Alexander > > While working on a script querying all PGs and making a list of all OSDs > found in a blocked_by list, and how many times for each, I discovered > something odd about pool 38: > > " > [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38 > OSDs blocking other OSDs: > All PGs in the pool are active+clean so why are there any blocked_by at > all? One example attached. I don't know. In any case, it doesn't match the "one OSD blocks them all" scenario that I was looking for. I think this is something bogus that can probably be cleared in your example by restarting osd.89 (i.e, the one being blocked). > > Mvh. > > Torkil > > >> I think we have to look for patterns in other ways, too. One tool that > >> produces good visualizations is TheJJ balancer. Although it is called > >> a "balancer," it can also visualize the ongoing backfills. > >> > >> The tool is available at > >> https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py > >> > >> Run it as follows: > >> > >> ./placementoptimizer.py showremapped --by-osd | tee remapped.txt > > > > Output attached. > > > > Thanks again. > > > > Mvh. > > > > Torkil > > > >> On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard > >> wrote: > >>> > >>> Hi Alex > >>> > >>> New query output attached after restarting both OSDs. OSD 237 is no > >>> longer mentioned but it unfortunately made no difference for the number > >>> of backfills which went 59->62->62. > >>> > >>> Mvh. > >>> > >>> Torkil > >>> > >>> On 23-03-2024 22:26, Alexander E. Patrakov wrote: > Hi Torkil, > > I have looked at the files that you attached. They were helpful: pool > 11 is problematic, it complains about degraded objects for no obvious > reason. I think that is the blocker. > > I also noted that you mentioned peering problems, and I suspect that > they are not completely resolved. As a somewhat-irrational move, to > confirm this theory, you can restart osd.237 (it is mentioned at the > end of query.11.fff.txt, although I don't understand why it is there) > and then osd.298 (it is the primary for that pg) and see if any > additional backfills are unblocked after that. Also, please re-query > that PG again after the OSD restart. > > On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard > wrote: > > > > > > > > On 23-03-2024 21:19, Alexander E. Patrakov wrote: > >> Hi Torkil, > > > > Hi Alexander > > > >> I have looked at the CRUSH rules, and the equivalent rules work on my > >> test cluster. So this cannot be the cause of the blockage. > > > > Thank you for taking the time =) > > > >> What happens if you increase the osd_max_backfills setting > >> temporarily? > > > > We already had the mclock override option in place and I re-enabled > > our > > babysitter script which sets osd_max_backfills pr OSD to 1-3 depending > > on how full they are. Active backfills went from 16 to 53 which is > > probably because default osd_max_backfills for mclock is 1. > > > > I think 53 is still a low number of active backfills given the large > > percentage misplaced. > > > >> It may be a good idea to investigate a few of the stalled PGs. Please > >> run commands similar to this one: > >> > >> ceph pg 37.0 query > query.37.0.txt > >> ceph pg 37.1 query > query.37.1.txt > >> ... > >> and the same for the other affected pools. > > > > A few samples attached. > > > >> Still, I must say that some of your rules are actually unsafe. > >> > >> The 4+2 rule as used by rbd_ec_data will not survive a > >> datacenter-offline incident. Namely, for each PG, it chooses OSDs > >> from > >> two hosts in each datacenter, so 6 OSDs total. When a datacenter is > >> offline, you will, therefore, have only 4 OSDs up, which is exactly > >> the number of data chunks. However, the pool requires min_size 5, so >
[ceph-users] Re: Call for Interest: Managed SMB Protocol Support
On Mon, Mar 25, 2024 at 11:01 PM John Mulligan wrote: > > On Friday, March 22, 2024 2:56:22 PM EDT Alexander E. Patrakov wrote: > > Hi John, > > > > > A few major features we have planned include: > > > * Standalone servers (internally defined users/groups) > > > > No concerns here > > > > > * Active Directory Domain Member Servers > > > > In the second case, what is the plan regarding UID mapping? Is NFS > > coexistence planned, or a concurrent mount of the same directory using > > CephFS directly? > > In the immediate future the plan is to have a very simple, fairly > "opinionated" idmapping scheme based on the autorid backend. OK, the docs for clustered SAMBA do mention the autorid backend in examples. It's a shame that the manual page does not explicitly list it as compatible with clustered setups. However, please consider that the majority of Linux distributions (tested: CentOS, Fedora, Alt Linux, Ubuntu, OpenSUSE) use "realmd" to join AD domains by default (where "default" means a pointy-clicky way in a workstation setup), which uses SSSD, and therefore, by this opinionated choice of the autorid backend, you create mappings that disagree with the supposed majority and the default. This will create problems in the future when you do consider NFS coexistence. Well, it's a different topic that most organizations that I have seen seem to ignore this default. Maybe those that don't have any problems don't have any reason to talk to me? I think that more research is needed here on whether RedHat's and GNOME's push of SSSD is something not-ready or indeed the de-facto standard setup. Even if you don't want to use SSSD, providing an option to provision a few domains with idmap rid backend with statically configured ranges (as an override to autorid) would be a good step forward, as this can be made compatible with the default RedHat setup. > Sharing the same directories over both NFS and SMB at the same time, also > known as "multi-protocol", is not planned for now, however we're all aware > that there's often a demand for this feature and we're aware of the complexity > it brings. I expect we'll work on that at some point but not initially. > Similarly, sharing the same directories over a SMB share and directly on a > cephfs mount won't be blocked but we won't recommend it. OK. Feature request: in the case if there are several CephFS filesystems, support configuration of which one to serve. > > > > > In fact, I am quite skeptical, because, at least in my experience, > > every customer's SAMBA configuration as a domain member is a unique > > snowflake, and cephadm would need an ability to specify arbitrary UID > > mapping configuration to match what the customer uses elsewhere - and > > the match must be precise. > > > > I agree - our initial use case is something along the lines: > Users of a Ceph Cluster that have Windows systems, Mac systems, or appliances > that are joined to an existing AD > but are not currently interoperating with the Ceph cluster. > > I expect to add some idpapping configuration and agility down the line, > especially supporting some form of rfc2307 idmapping (where unix IDs are > stored in AD). Yes, for whatever reason, people do this, even though it is cumbersome to manage. > > But those who already have idmapping schemes and samba accessing ceph will > probably need to just continue using the existing setups as we don't have an > immediate plan for migrating those users. > > > Here is what I have seen or was told about: > > > > 1. We don't care about interoperability with NFS or CephFS, so we just > > let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2" > > idmap backend. It's completely OK that workstations get different UIDs > > and GIDs, as only SIDs traverse the wire. > > This is pretty close to our initial plan but I'm not clear why you'd think > that "workstations get different UIDs and GIDs". For all systems acessing the > (same) ceph cluster the id mapping should be consistent. > You did make me consider multi-cluster use cases with something like cephfs > volume mirroring - that's something that I hadn't thought of before *but* > using an algorithmic mapping backend like autorid (and testing) I think we're > mostly OK there. The tdb2 backend (used in my example) is not algorithmic, it is allocating. That is, it sequentially allocates IDs on the first-seen-first-allocated basis. Yet this is what this customer uses, presumably because it is the only backend that explicitly specifies clustering operation in its manual page. And the "autorid" backend is also not fully algorithmic, it allocates ranges to domains on the same sequential basis (see https://github.com/samba-team/samba/blob/6fb98f70c6274e172787c8d5f73aa93920171e7c/source3/winbindd/idmap_autorid_tdb.c#L82), and therefore can create mismatching mappings if two workstations or servers have seen the users DOMA\usera and DOMB\userb in a different order. It is even mentioned in the manual page.
[ceph-users] Re: Call for Interest: Managed SMB Protocol Support
On Monday, March 25, 2024 1:46:26 PM EDT Ralph Boehme wrote: > Hi John, > > On 3/21/24 20:12, John Mulligan wrote: > > > I'd like to formally let the wider community know of some work I've been > > involved with for a while now: adding Managed SMB Protocol Support to > > Ceph. SMB being the well known network file protocol native to Windows > > systems and supported by MacOS (and Linux). The other key word "managed" > > meaning integrating with Ceph management tooling - in this particular > > case cephadm for orchestration and eventually a new MGR module for > > managing SMB shares. > > The effort is still in it's very early stages. We have a PR adding > > initial > > support for Samba Containers to cephadm [1] and a prototype for an smb > > MGR > > module [2]. We plan on using container images based on the > > samba-container > > project [3] - a team I am already part of. What we're aiming for is a > > feature set similar to the current NFS integration in Ceph, but with a > > focus on bridging non-Linux/Unix clients to CephFS using a protocol built > > into those systems. > > > > A few major features we have planned include: > > * Standalone servers (internally defined users/groups) > > * Active Directory Domain Member Servers > > * Clustered Samba support > > * Exporting Samba stats via Prometheus metrics > > * A `ceph` cli workflow loosely based on the nfs mgr module > > > > I wanted to share this information in case there's wider community > > interest in this effort. > > > certainly! :) > > If it makes sense, you may want to pull in samba-technical where it > makes sense. Absolutely. I'm currently focusing on the basics and those are mostly good- to-go for our needs in current samba releases. In the future, I'm sure we'll run into times where technical help or changes will be needed. > If there's a need, you can also pull me in directly into > meetings or other channels to discuss things. > Thanks! I appreciate it! > Looking forward to seeing you at SambaXP, at least virtually. You too. :-) > Any plans to attend SDC from you or others from your team? > I'm unsure. I'll ask around. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Call for Interest: Managed SMB Protocol Support
Hi John, On 3/21/24 20:12, John Mulligan wrote: I'd like to formally let the wider community know of some work I've been involved with for a while now: adding Managed SMB Protocol Support to Ceph. SMB being the well known network file protocol native to Windows systems and supported by MacOS (and Linux). The other key word "managed" meaning integrating with Ceph management tooling - in this particular case cephadm for orchestration and eventually a new MGR module for managing SMB shares. The effort is still in it's very early stages. We have a PR adding initial support for Samba Containers to cephadm [1] and a prototype for an smb MGR module [2]. We plan on using container images based on the samba-container project [3] - a team I am already part of. What we're aiming for is a feature set similar to the current NFS integration in Ceph, but with a focus on bridging non-Linux/Unix clients to CephFS using a protocol built into those systems. A few major features we have planned include: * Standalone servers (internally defined users/groups) * Active Directory Domain Member Servers * Clustered Samba support * Exporting Samba stats via Prometheus metrics * A `ceph` cli workflow loosely based on the nfs mgr module I wanted to share this information in case there's wider community interest in this effort. certainly! :) If it makes sense, you may want to pull in samba-technical where it makes sense. If there's a need, you can also pull me in directly into meetings or other channels to discuss things. Looking forward to seeing you at SambaXP, at least virtually. Any plans to attend SDC from you or others from your team? Cheers! -slow OpenPGP_signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Spam in log file
Nope. On Mon, Mar 25, 2024 at 8:33 AM Albert Shih wrote: > > Le 25/03/2024 à 08:28:54-0400, Patrick Donnelly a écrit > Hi, > > > > > The fix is in one of the next releases. Check the tracker ticket: > > https://tracker.ceph.com/issues/63166 > > Oh thanks. Didn't find it with google. > > Is they are any risk/impact for the cluster ? > > Regards. > -- > Albert SHIH 嶺 > France > Heure locale/Local time: > lun. 25 mars 2024 13:31:27 CET > -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Call for Interest: Managed SMB Protocol Support
On Friday, March 22, 2024 2:56:22 PM EDT Alexander E. Patrakov wrote: > Hi John, > > > A few major features we have planned include: > > * Standalone servers (internally defined users/groups) > > No concerns here > > > * Active Directory Domain Member Servers > > In the second case, what is the plan regarding UID mapping? Is NFS > coexistence planned, or a concurrent mount of the same directory using > CephFS directly? In the immediate future the plan is to have a very simple, fairly "opinionated" idmapping scheme based on the autorid backend. Sharing the same directories over both NFS and SMB at the same time, also known as "multi-protocol", is not planned for now, however we're all aware that there's often a demand for this feature and we're aware of the complexity it brings. I expect we'll work on that at some point but not initially. Similarly, sharing the same directories over a SMB share and directly on a cephfs mount won't be blocked but we won't recommend it. > > In fact, I am quite skeptical, because, at least in my experience, > every customer's SAMBA configuration as a domain member is a unique > snowflake, and cephadm would need an ability to specify arbitrary UID > mapping configuration to match what the customer uses elsewhere - and > the match must be precise. > I agree - our initial use case is something along the lines: Users of a Ceph Cluster that have Windows systems, Mac systems, or appliances that are joined to an existing AD but are not currently interoperating with the Ceph cluster. I expect to add some idpapping configuration and agility down the line, especially supporting some form of rfc2307 idmapping (where unix IDs are stored in AD). But those who already have idmapping schemes and samba accessing ceph will probably need to just continue using the existing setups as we don't have an immediate plan for migrating those users. > Here is what I have seen or was told about: > > 1. We don't care about interoperability with NFS or CephFS, so we just > let SAMBA invent whatever UIDs and GIDs it needs using the "tdb2" > idmap backend. It's completely OK that workstations get different UIDs > and GIDs, as only SIDs traverse the wire. This is pretty close to our initial plan but I'm not clear why you'd think that "workstations get different UIDs and GIDs". For all systems acessing the (same) ceph cluster the id mapping should be consistent. You did make me consider multi-cluster use cases with something like cephfs volume mirroring - that's something that I hadn't thought of before *but* using an algorithmic mapping backend like autorid (and testing) I think we're mostly OK there. > 2. [not seen in the wild, the customer did not actually implement it, > it's a product of internal miscommunication, and I am not sure if it > is valid at all] We don't care about interoperability with CephFS, > and, while we have NFS, security guys would not allow running NFS > non-kerberized. Therefore, no UIDs or GIDs traverse the wire, only > SIDs and names. Therefore, all we need is to allow both SAMBA and NFS > to use shared UID mapping allocated on as-needed basis using the > "tdb2" idmap module, and it doesn't matter that these UIDs and GIDs > are inconsistent with what clients choose. Unfortunately, I don't really understand this item. Fortunately, you say it was only considered not implemented. :-) > 3. We don't care about ACLs at all, and don't care about CephFS > interoperability. We set ownership of all new files to root:root 0666 > using whatever options are available [well, I would rather use a > dedicated nobody-style uid/gid here]. All we care about is that only > authorized workstations or authorized users can connect to each NFS or > SMB share, and we absolutely don't want them to be able to set custom > ownership or ACLs. Some times known as the "drop-box" use case I think (not to be confused with the cloud app of a similar name). We could probably implement something like that as an option but I had not considered it before. > 4. We care about NFS and CephFS file ownership being consistent with > what Windows clients see. We store all UIDs and GIDs in Active > Directory using the rfc2307 schema, and it's mandatory that all > servers (especially SAMBA - thanks to the "ad" idmap backend) respect > that and don't try to invent anything [well, they do - BUILTIN/Users > gets its GID through tdb2]. Oh, and by the way, we have this strangely > low-numbered group that everybody gets wrong unless they set "idmap > config CORP : range = 500-99". This is oh so similar to a project I worked on prior to working with Ceph. I think we'll need to do this one eventually but maybe not this year. One nice side-effect of running in containers is that the low-id number is less of an issue because the ids only matter within the container context (and only then if using the kernel file system access methods). We have much more flexibility with IDs in a container. >
[ceph-users] March Ceph Science Virtual User Group
Hey All, We will be having a Ceph science/research/big cluster call on Wednesday March 27th. If anyone wants to discuss something specific they can add it to the pad linked below. If you have questions or comments you can contact me. This is an informal open call of community members mostly from hpc/htc/research/big cluster environments (though anyone is welcome) where we discuss whatever is on our minds regarding ceph. Updates, outages, features, maintenance, etc...there is no set presenter but I do attempt to keep the conversation lively. Pad URL: https://pad.ceph.com/p/Ceph_Science_User_Group_20240327 Virtual event details: March 27, 2024 14:00 UTC 3pm Central European 9am Central US Main pad for discussions: https://pad.ceph.com/p/Ceph_Science_User_Group_Index Meetings will be recorded and posted to the Ceph Youtube channel. To join the meeting on a computer or mobile phone: https://meet.jit.si/ceph-science-wg Kevin -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS/TROPICS Space Science & Engineering Center University of Wisconsin-Madison ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Spam in log file
Le 25/03/2024 à 08:28:54-0400, Patrick Donnelly a écrit Hi, > > The fix is in one of the next releases. Check the tracker ticket: > https://tracker.ceph.com/issues/63166 Oh thanks. Didn't find it with google. Is they are any risk/impact for the cluster ? Regards. -- Albert SHIH 嶺 France Heure locale/Local time: lun. 25 mars 2024 13:31:27 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Spam in log file
Hi Albert, The fix is in one of the next releases. Check the tracker ticket: https://tracker.ceph.com/issues/63166 On Mon, Mar 25, 2024 at 8:23 AM Albert Shih wrote: > > Hi everyone. > > On my cluster I got spam by my cluster with message like > > Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return > metadata for mds.cephfs.cthulhu2.dqahyt: (2) No such file or directory > Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return > metadata for mds.cephfs.cthulhu3.xvboir: (2) No such file or directory > Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return > metadata for mds.cephfs.cthulhu5.kwmyyg: (2) No such file or directory > > I got 5 server for the service (cthulhu 1->5) and indeed when from cthulhu1 > (or 2) I try : > > something: > > root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu2.dqahyt > {} > Error ENOENT: > root@cthulhu2: > > but that works on 1 or 4 > > root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu1.sikvjf > { > "addr": > "[v2:145.238.187.184:6800/1315478297,v1:145.238.187.184:6801/1315478297]", > "arch": "x86_64", > "ceph_release": "quincy", > "ceph_version": "ceph version 17.2.7 > (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)", > "ceph_version_short": "17.2.7", > "container_hostname": "cthulhu1", > "container_image": > "quay.io/ceph/ceph@sha256:62465e744a80832bde6a57120d3ba076613e8a19884b274f9cc82580e249f6e1", > "cpu": "Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz", > "distro": "centos", > "distro_description": "CentOS Stream 8", > "distro_version": "8", > "hostname": "cthulhu1", > "kernel_description": "#1 SMP Debian 5.10.209-2 (2024-01-31)", > "kernel_version": "5.10.0-28-amd64", > "mem_swap_kb": "16777212", > "mem_total_kb": "263803496", > "os": "Linux" > } > root@cthulhu2:/etc/ceph# > > I check the caps and don't see anything special. > > I got also (I don't know if it's related) those message : > > Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open > from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready > for session (expect reconnect) > Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open > from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready > for session (expect reconnect) > Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open > from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready > for session (expect reconnect) > Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open > from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready > for session (expect reconnect) > Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open > from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready > for session (expect reconnect) > Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open > from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready > for session (expect reconnect) > > Regards. > -- > Albert SHIH 嶺 > France > Heure locale/Local time: > lun. 25 mars 2024 13:08:33 CET > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Spam in log file
Hi everyone. On my cluster I got spam by my cluster with message like Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return metadata for mds.cephfs.cthulhu2.dqahyt: (2) No such file or directory Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return metadata for mds.cephfs.cthulhu3.xvboir: (2) No such file or directory Mar 25 13:10:13 cthulhu2 ceph-mgr[2843]: mgr finish mon failed to return metadata for mds.cephfs.cthulhu5.kwmyyg: (2) No such file or directory I got 5 server for the service (cthulhu 1->5) and indeed when from cthulhu1 (or 2) I try : something: root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu2.dqahyt {} Error ENOENT: root@cthulhu2: but that works on 1 or 4 root@cthulhu2:/etc/ceph# ceph mds metadata cephfs.cthulhu1.sikvjf { "addr": "[v2:145.238.187.184:6800/1315478297,v1:145.238.187.184:6801/1315478297]", "arch": "x86_64", "ceph_release": "quincy", "ceph_version": "ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)", "ceph_version_short": "17.2.7", "container_hostname": "cthulhu1", "container_image": "quay.io/ceph/ceph@sha256:62465e744a80832bde6a57120d3ba076613e8a19884b274f9cc82580e249f6e1", "cpu": "Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz", "distro": "centos", "distro_description": "CentOS Stream 8", "distro_version": "8", "hostname": "cthulhu1", "kernel_description": "#1 SMP Debian 5.10.209-2 (2024-01-31)", "kernel_version": "5.10.0-28-amd64", "mem_swap_kb": "16777212", "mem_total_kb": "263803496", "os": "Linux" } root@cthulhu2:/etc/ceph# I check the caps and don't see anything special. I got also (I don't know if it's related) those message : Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready for session (expect reconnect) Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready for session (expect reconnect) Mar 25 13:18:38 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready for session (expect reconnect) Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open from mds.cephfs.cthulhu3.xvboir v2:145.238.187.186:6800/1297104944; not ready for session (expect reconnect) Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open from mds.cephfs.cthulhu2.dqahyt v2:145.238.187.185:6800/2763465960; not ready for session (expect reconnect) Mar 25 13:18:39 cthulhu2 ceph-mgr[2843]: mgr.server handle_open ignoring open from mds.cephfs.cthulhu5.kwmyyg v2:145.238.187.188:6800/449122091; not ready for session (expect reconnect) Regards. -- Albert SHIH 嶺 France Heure locale/Local time: lun. 25 mars 2024 13:08:33 CET ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Large number of misplaced PGs but little backfill going on
On 24/03/2024 01:14, Torkil Svensgaard wrote: On 24-03-2024 00:31, Alexander E. Patrakov wrote: Hi Torkil, Hi Alexander Thanks for the update. Even though the improvement is small, it is still an improvement, consistent with the osd_max_backfills value, and it proves that there are still unsolved peering issues. I have looked at both the old and the new state of the PG, but could not find anything else interesting. I also looked again at the state of PG 37.1. It is known what blocks the backfill of this PG; please search for "blocked_by." However, this is just one data point, which is insufficient for any conclusions. Try looking at other PGs. Is there anything too common in the non-empty "blocked_by" blocks? I'll take a look at that tomorrow, perhaps we can script something meaningful. Hi Alexander While working on a script querying all PGs and making a list of all OSDs found in a blocked_by list, and how many times for each, I discovered something odd about pool 38: " [root@lazy blocked_by]# sh blocked_by.sh 38 |tee pool38 OSDs blocking other OSDs: OSD 425: 5 instance(s) OSD 426: 6 instance(s) OSD 34: 7 instance(s) OSD 36: 5 instance(s) OSD 146: 3 instance(s) OSD 6: 2 instance(s) OSD 5: 8 instance(s) OSD 131: 7 instance(s) OSD 4: 9 instance(s) OSD 3: 5 instance(s) OSD 2: 5 instance(s) OSD 1: 2 instance(s) OSD 0: 4 instance(s) OSD 167: 1 instance(s) OSD 168: 3 instance(s) OSD 450: 2 instance(s) OSD 46: 6 instance(s) OSD 154: 3 instance(s) OSD 156: 2 instance(s) OSD 90: 2 instance(s) OSD 227: 4 instance(s) OSD 10: 4 instance(s) OSD 15: 6 instance(s) OSD 449: 4 instance(s) OSD 192: 2 instance(s) OSD 67: 3 instance(s) " All PGs in the pool are active+clean so why are there any blocked_by at all? One example attached. Mvh. Torkil I think we have to look for patterns in other ways, too. One tool that produces good visualizations is TheJJ balancer. Although it is called a "balancer," it can also visualize the ongoing backfills. The tool is available at https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py Run it as follows: ./placementoptimizer.py showremapped --by-osd | tee remapped.txt Output attached. Thanks again. Mvh. Torkil On Sun, Mar 24, 2024 at 5:50 AM Torkil Svensgaard wrote: Hi Alex New query output attached after restarting both OSDs. OSD 237 is no longer mentioned but it unfortunately made no difference for the number of backfills which went 59->62->62. Mvh. Torkil On 23-03-2024 22:26, Alexander E. Patrakov wrote: Hi Torkil, I have looked at the files that you attached. They were helpful: pool 11 is problematic, it complains about degraded objects for no obvious reason. I think that is the blocker. I also noted that you mentioned peering problems, and I suspect that they are not completely resolved. As a somewhat-irrational move, to confirm this theory, you can restart osd.237 (it is mentioned at the end of query.11.fff.txt, although I don't understand why it is there) and then osd.298 (it is the primary for that pg) and see if any additional backfills are unblocked after that. Also, please re-query that PG again after the OSD restart. On Sun, Mar 24, 2024 at 4:56 AM Torkil Svensgaard wrote: On 23-03-2024 21:19, Alexander E. Patrakov wrote: Hi Torkil, Hi Alexander I have looked at the CRUSH rules, and the equivalent rules work on my test cluster. So this cannot be the cause of the blockage. Thank you for taking the time =) What happens if you increase the osd_max_backfills setting temporarily? We already had the mclock override option in place and I re-enabled our babysitter script which sets osd_max_backfills pr OSD to 1-3 depending on how full they are. Active backfills went from 16 to 53 which is probably because default osd_max_backfills for mclock is 1. I think 53 is still a low number of active backfills given the large percentage misplaced. It may be a good idea to investigate a few of the stalled PGs. Please run commands similar to this one: ceph pg 37.0 query > query.37.0.txt ceph pg 37.1 query > query.37.1.txt ... and the same for the other affected pools. A few samples attached. Still, I must say that some of your rules are actually unsafe. The 4+2 rule as used by rbd_ec_data will not survive a datacenter-offline incident. Namely, for each PG, it chooses OSDs from two hosts in each datacenter, so 6 OSDs total. When a datacenter is offline, you will, therefore, have only 4 OSDs up, which is exactly the number of data chunks. However, the pool requires min_size 5, so all PGs will be inactive (to prevent data corruption) and will stay inactive until the datacenter comes up again. However, please don't set min_size to 4 - then, any additional incident (like a defective disk) will lead to data loss, and the shards in the datacenter which went offline would be useless because they do not correspond to the updated shards written by the clients. Thanks for the explanation. This is an
[ceph-users] Re: ceph cluster extremely unbalanced
Hi Denis, As the vast majority of OSDs have bluestore_min_alloc_size = 65536, I think you can safely ignore https://tracker.ceph.com/issues/64715. The only consequence will be that 58 OSDs will be less full than others. In other words, please use either the hybrid approach or the built-in balancer right away. As for migrating to the modern defaults for bluestore_min_alloc_size, yes, recreating OSDs host-by-host (once you have the cluster balanced) is the only way. You can keep using the built-in balancer while doing that. On Mon, Mar 25, 2024 at 5:04 PM Denis Polom wrote: > > Hi Alexander, > > that sounds pretty promising to me. > > I've checked bluestore_min_alloc_size and most 1370 OSDs have value 65536. > > You mentioned: "You will have to do that weekly until you redeploy all > OSDs that were created with 64K bluestore_min_alloc_size" > > Is it the only way to approach this, that each OSD has to be recreated? > > Thank you for reply > > dp > > On 3/24/24 12:44 PM, Alexander E. Patrakov wrote: > > Hi Denis, > > > > My approach would be: > > > > 1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K > > bluestore_min_alloc_size. If so, you cannot really use the built-in > > balancer, as it would result in a bimodal distribution instead of a > > proper balance, see https://tracker.ceph.com/issues/64715, but let's > > ignore this little issue if you have enough free space. > > 2. Change the weights as appropriate. Make absolutely sure that there > > are no reweights other than 1.0. Delete all dead or destroyed OSDs > > from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL > > warnings that appear, they will be gone during the next step. > > 3. Run this little script from Cern to stop the data movement that was > > just initiated: > > https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py, > > pipe its output to bash. This should cancel most of the data movement, > > but not all - the script cannot stop the situation when two OSDs want > > to exchange their erasure-coded shards, like this: [1,2,3,4] -> > > [1,3,2,4]. > > 4. Set the "target max misplaced ratio" option for MGR to what you > > think is appropriate. The default is 0.05, and this means that the > > balancer will enable at most 5% of the PGs to participate in the data > > movement. I suggest starting with 0.01 and increasing if there is no > > visible impact of the balancing on the client traffic. > > 5. Enable the balancer. > > > > If you think that https://tracker.ceph.com/issues/64715 is a problem > > that would prevent you from using the built-in balancer: > > > > 4. Download this script: > > https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py > > 5. Run it as follows: ./placementoptimizer.py -v balance --osdsize > > device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash > > > > This will move at most 500 PGs to better places, starting with the > > fullest OSDs. All weights are ignored, and the switches take care of > > the bluestore_min_alloc_size overhead mismatch. You will have to do > > that weekly until you redeploy all OSDs that were created with 64K > > bluestore_min_alloc_size. > > > > A hybrid approach (initial round of balancing with TheJJ, then switch > > to the built-in balancer) may also be viable. > > > > On Sun, Mar 24, 2024 at 7:09 PM Denis Polom wrote: > >> Hi guys, > >> > >> recently I took over a care of Ceph cluster that is extremely > >> unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus -> > >> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it. > >> > >> Crush failure domain is datacenter (there are 3), data pool is EC 3+3. > >> > >> This cluster had and has balancer disabled for years. And was "balanced" > >> manually by changing OSDs crush weights. So now it is complete mess and > >> I would like to change it to have OSDs crush weight same (3.63898) and > >> to enable balancer with upmap. > >> > >> From `ceph osd df ` sorted from the least used to most used OSDs: > >> > >> IDCLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > >> AVAIL %USE VAR PGS STATUS > >> MIN/MAX VAR: 0.76/1.16 STDDEV: 5.97 > >>TOTAL 5.1 PiB 3.7 PiB 3.7 PiB 2.9 MiB 8.5 > >> TiB 1.5 PiB 71.50 > >>428hdd 3.63898 1.0 3.6 TiB 2.0 TiB 2.0 TiB1 KiB 5.6 > >> GiB 1.7 TiB 54.55 0.76 96 up > >>223hdd 3.63898 1.0 3.6 TiB 2.0 TiB 2.0 TiB3 KiB 5.6 > >> GiB 1.7 TiB 54.58 0.76 95 up > >> ... > >> > >> ... > >> > >> ... > >> > >>591hdd 3.53999 1.0 3.6 TiB 3.0 TiB 3.0 TiB1 KiB 7.0 > >> GiB 680 GiB 81.74 1.14 125 up > >>832hdd 3.5 1.0 3.6 TiB 3.0 TiB 3.0 TiB4 KiB 6.9 > >> GiB 680 GiB 81.75 1.14 114 up > >>248hdd 3.63898 1.0 3.6 TiB 3.0 TiB 3.0 TiB3 KiB 7.2 > >> GiB 646 GiB 82.67 1.16 121
[ceph-users] Re: ceph cluster extremely unbalanced
Hi Alexander, that sounds pretty promising to me. I've checked bluestore_min_alloc_size and most 1370 OSDs have value 65536. You mentioned: "You will have to do that weekly until you redeploy all OSDs that were created with 64K bluestore_min_alloc_size" Is it the only way to approach this, that each OSD has to be recreated? Thank you for reply dp On 3/24/24 12:44 PM, Alexander E. Patrakov wrote: Hi Denis, My approach would be: 1. Run "ceph osd metadata" and see if you have a mix of 64K and 4K bluestore_min_alloc_size. If so, you cannot really use the built-in balancer, as it would result in a bimodal distribution instead of a proper balance, see https://tracker.ceph.com/issues/64715, but let's ignore this little issue if you have enough free space. 2. Change the weights as appropriate. Make absolutely sure that there are no reweights other than 1.0. Delete all dead or destroyed OSDs from the CRUSH map by purging them. Ignore any PG_BACKFILL_FULL warnings that appear, they will be gone during the next step. 3. Run this little script from Cern to stop the data movement that was just initiated: https://raw.githubusercontent.com/cernceph/ceph-scripts/master/tools/upmap/upmap-remapped.py, pipe its output to bash. This should cancel most of the data movement, but not all - the script cannot stop the situation when two OSDs want to exchange their erasure-coded shards, like this: [1,2,3,4] -> [1,3,2,4]. 4. Set the "target max misplaced ratio" option for MGR to what you think is appropriate. The default is 0.05, and this means that the balancer will enable at most 5% of the PGs to participate in the data movement. I suggest starting with 0.01 and increasing if there is no visible impact of the balancing on the client traffic. 5. Enable the balancer. If you think that https://tracker.ceph.com/issues/64715 is a problem that would prevent you from using the built-in balancer: 4. Download this script: https://raw.githubusercontent.com/TheJJ/ceph-balancer/master/placementoptimizer.py 5. Run it as follows: ./placementoptimizer.py -v balance --osdsize device --osdused delta --max-pg-moves 500 --osdfrom fullest | bash This will move at most 500 PGs to better places, starting with the fullest OSDs. All weights are ignored, and the switches take care of the bluestore_min_alloc_size overhead mismatch. You will have to do that weekly until you redeploy all OSDs that were created with 64K bluestore_min_alloc_size. A hybrid approach (initial round of balancing with TheJJ, then switch to the built-in balancer) may also be viable. On Sun, Mar 24, 2024 at 7:09 PM Denis Polom wrote: Hi guys, recently I took over a care of Ceph cluster that is extremely unbalanced. Cluster is running on Quincy 17.2.7 (upgraded Nautilus -> Octopus -> Quincy) and has 1428 OSDs (HDDs). We are running CephFS on it. Crush failure domain is datacenter (there are 3), data pool is EC 3+3. This cluster had and has balancer disabled for years. And was "balanced" manually by changing OSDs crush weights. So now it is complete mess and I would like to change it to have OSDs crush weight same (3.63898) and to enable balancer with upmap. From `ceph osd df ` sorted from the least used to most used OSDs: IDCLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS MIN/MAX VAR: 0.76/1.16 STDDEV: 5.97 TOTAL 5.1 PiB 3.7 PiB 3.7 PiB 2.9 MiB 8.5 TiB 1.5 PiB 71.50 428hdd 3.63898 1.0 3.6 TiB 2.0 TiB 2.0 TiB1 KiB 5.6 GiB 1.7 TiB 54.55 0.76 96 up 223hdd 3.63898 1.0 3.6 TiB 2.0 TiB 2.0 TiB3 KiB 5.6 GiB 1.7 TiB 54.58 0.76 95 up ... ... ... 591hdd 3.53999 1.0 3.6 TiB 3.0 TiB 3.0 TiB1 KiB 7.0 GiB 680 GiB 81.74 1.14 125 up 832hdd 3.5 1.0 3.6 TiB 3.0 TiB 3.0 TiB4 KiB 6.9 GiB 680 GiB 81.75 1.14 114 up 248hdd 3.63898 1.0 3.6 TiB 3.0 TiB 3.0 TiB3 KiB 7.2 GiB 646 GiB 82.67 1.16 121 up 559hdd 3.63799 1.0 3.6 TiB 3.0 TiB 3.0 TiB 0 B 7.0 GiB 644 GiB 82.70 1.16 123 up TOTAL 5.1 PiB 3.7 PiB 3.6 PiB 2.9 MiB 8.5 TiB 1.5 PiB 71.50 MIN/MAX VAR: 0.76/1.16 STDDEV: 5.97 crush rule: { "rule_id": 10, "rule_name": "ec33hdd_rule", "type": 3, "steps": [ { "op": "set_chooseleaf_tries", "num": 5 }, { "op": "set_choose_tries", "num": 100 }, { "op": "take", "item": -2, "item_name": "default~hdd" }, { "op": "choose_indep", "num": 3, "type": "datacenter" }, { "op": "choose_indep", "num": 2, "type": "osd" }, { "op": "emit" } ]
[ceph-users] Re: Call for Interest: Managed SMB Protocol Support
Hi, On 3/22/24 19:56, Alexander E. Patrakov wrote: In fact, I am quite skeptical, because, at least in my experience, every customer's SAMBA configuration as a domain member is a unique snowflake, and cephadm would need an ability to specify arbitrary UID mapping configuration to match what the customer uses elsewhere - and the match must be precise. Yes, there has to be a great flexibility possible in the configuration of the SMB service. BTW: It would be great of the orchestrator could configure Ganesha to export NFs shares with Kerberos security, but this is off-topic in this thread. Oh, and by the way, we have this strangely low-numbered group that everybody gets wrong unless they set "idmap config CORP : range = 500-99". This is because Debian changed the standard minimum uid/gid somewhere in the 2000s. And if you have an "old" company running Debian since before then you have user IDs and group IDs in the range 500 - 1000. Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io