[ceph-users] Re: unknown PGs after adding hosts in different subtree
Hi Eugen, so it is partly "unexpectedly expected" and partly buggy. I really wish the crush implementation was honouring a few obvious invariants. It is extremely counter-intuitive that mappings taken from a sub-set change even if both, the sub-set and the mapping instructions themselves don't. > - Use different root names That's what we are doing and it works like a charm, also for draining OSDs. > more specific crush rules. I guess you mean use something like "step take DCA class hdd" instead of "step take default class hdd" as in: rule rule-ec-k7m11 { id 1 type erasure min_size 3 max_size 18 step set_chooseleaf_tries 5 step set_choose_tries 100 step take DCA class hdd step chooseleaf indep 9 type host step take DCB class hdd step chooseleaf indep 9 type host step emit } According to the documentation, this should actually work and be almost equivalent to your crush rule. The difference here is that it will make sure that the first 9 shards are from DCA and the second 9 shards from DCB (its an ordering). Side effect is that all primary OSDs will be in DCA if both DCs are up. I remember people asking for that as a feature in multi-DC set-ups to pick the one with lowest latency to have the primary OSDs by default. Can you give this crush rule a try and report back whether or not the behaviour when adding hosts changes? In case you have time, it would be great if you could collect information on (reproducing) the fatal peering problem. While remappings might be "unexpectedly expected" it is clearly a serious bug that incomplete and unknown PGs show up in the process of adding hosts at the root. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Friday, May 24, 2024 2:51 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree I start to think that the root cause of the remapping is just the fact that the crush rule(s) contain(s) the "step take default" line: step take default class hdd My interpretation is that crush simply tries to honor the rule: consider everything underneath the "default" root, so PGs get remapped if new hosts are added there (but not in their designated subtree buckets). The effect (unknown PGs) is bad, but there are a couple of options to avoid that: - Use different root names and/or more specific crush rules. - Use host spec file(s) to place new hosts directly where they belong. - Set osd_crush_initial_weight = 0 to avoid remapping until everything is where it's supposed to be, then reweight the OSDs. Zitat von Eugen Block : > Hi Frank, > > thanks for looking up those trackers. I haven't looked into them > yet, I'll read your response in detail later, but I wanted to add > some new observation: > > I added another root bucket (custom) to the osd tree: > > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -12 0 root custom > -1 0.27698 root default > -8 0.09399 room room1 > -3 0.04700 host host1 > 7hdd 0.02299 osd.7 up 1.0 1.0 > 10hdd 0.02299 osd.10 up 1.0 1.0 > ... > > Then I tried this approach to add a new host directly to the > non-default root: > > # cat host5.yaml > service_type: host > hostname: host5 > addr: 192.168.168.54 > location: > root: custom > labels: >- osd > > # ceph orch apply -i host5.yaml > > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > -12 0.04678 root custom > -23 0.04678 host host5 > 1hdd 0.02339 osd.1 up 1.0 1.0 > 13hdd 0.02339 osd.13 up 1.0 1.0 > -1 0.27698 root default > -8 0.09399 room room1 > -3 0.04700 host host1 > 7hdd 0.02299 osd.7 up 1.0 1.0 > 10hdd 0.02299 osd.10 up 1.0 1.0 > ... > > host5 is placed directly underneath the new custom root correctly, > but not a single PG is marked "remapped"! So this is actually what I > (or we) expected. I'm not sure yet what to make of it, but I'm > leaning towards using this approach in the future and add hosts > underneath a different root first, and then move it to its > designated location. > > Just to validate again, I added host6 without a location spec, so > it's placed underneath the default root again: > > # ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS REW
[ceph-users] Re: unknown PGs after adding hosts in different subtree
Hi Eugen, just to add another strangeness observation from long ago: https://www.spinics.net/lists/ceph-users/msg74655.html. I didn't see any reweights in your trees, so its something else. However, there seem to be multiple issues with EC pools and peering. I also want to clarify: > If this is the case, it is possible that this is partly intentional and > partly buggy. "Partly intentional" here means the code behaviour changes when you add OSDs to the root outside the rooms and this change is not considered a bug. It is clearly *not* expected as it means you cannot do maintenance on a pool living on a tree A without affecting pools on the same device class living on an unmodified subtree of A. >From a ceph user's point of view everything you observe looks buggy. I would >really like to see a good explanation why the mappings in the subtree *should* >change when adding OSDs above that subtree as in your case when the >expectation for good reasons is that they don't. This would help devising >clean procedures for adding hosts when you (and I) want to add OSDs first >without any peering and then move OSDs into place to have it happen separate >from adding and not a total mess with everything in parallel. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Frank Schilder Sent: Thursday, May 23, 2024 6:32 PM To: Eugen Block Cc: ceph-users@ceph.io Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree Hi Eugen, I'm at home now. Could you please check all the remapped PGs that they have no shards on the new OSDs, i.e. its just shuffling around mappings within the same set of OSDs under rooms? If this is the case, it is possible that this is partly intentional and partly buggy. The remapping is then probably intentional and the method I use with a disjoint tree for new hosts prevents such remappings initially (the crush code sees the new OSDs in the root, doesn't use them but their presence does change choice orders resulting in remapped PGs). However, the unknown PGs should clearly not occur. I'm afraid that the peering code has quite a few bugs, I reported something at least similarly weird a long time ago: https://tracker.ceph.com/issues/56995 and https://tracker.ceph.com/issues/46847. Might even be related. It looks like peering can loose track of PG members in certain situations (specifically after adding OSDs until rebalancing completed). In my cases, I get degraded objects even though everything is obviously still around. Flipping between the crush-maps before/after the change re-discovers everything again. Issue 46847 is long-standing and still unresolved. In case you need to file a tracker, please consider to refer to the two above as well as "might be related" if you deem that they might be related. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: unknown PGs after adding hosts in different subtree
Hi Eugen, I'm at home now. Could you please check all the remapped PGs that they have no shards on the new OSDs, i.e. its just shuffling around mappings within the same set of OSDs under rooms? If this is the case, it is possible that this is partly intentional and partly buggy. The remapping is then probably intentional and the method I use with a disjoint tree for new hosts prevents such remappings initially (the crush code sees the new OSDs in the root, doesn't use them but their presence does change choice orders resulting in remapped PGs). However, the unknown PGs should clearly not occur. I'm afraid that the peering code has quite a few bugs, I reported something at least similarly weird a long time ago: https://tracker.ceph.com/issues/56995 and https://tracker.ceph.com/issues/46847. Might even be related. It looks like peering can loose track of PG members in certain situations (specifically after adding OSDs until rebalancing completed). In my cases, I get degraded objects even though everything is obviously still around. Flipping between the crush-maps before/after the change re-discovers everything again. Issue 46847 is long-standing and still unresolved. In case you need to file a tracker, please consider to refer to the two above as well as "might be related" if you deem that they might be related. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: does the RBD client block write when the Watcher times out?
Hi, we run into the same issue and there is actually another use case: live-migration of VMs. This requires an RBD image being mapped to two clients simultaneously, so this is intentional. If multiple clints map an image in RW-mode, the ceph back-end will cycle the write lock between the clients to allow each of them to flush writes, this is intentional. The way to coordinate here is the job of the orchestrator. In this case specifically, its explicitly managing a write lock during live-migration such that writes occur in the correct order. Its not a ceph job, its an orchestration job. The rbd interface just provides the tools to do it, for example, you can attach information that helps you hunting down dead-looking clients and kill them proper before mapping an image somewhere else. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Ilya Dryomov Sent: Thursday, May 23, 2024 2:05 PM To: Yuma Ogami Cc: ceph-users@ceph.io Subject: [ceph-users] Re: does the RBD client block write when the Watcher times out? On Thu, May 23, 2024 at 4:48 AM Yuma Ogami wrote: > > Hello. > > I'm currently verifying the behavior of RBD on failure. I'm wondering > about the consistency of RBD images after network failures. As a > result of my investigation, I found that RBD sets a Watcher to RBD > image if a client mounts this volume to prevent multiple mounts. In Hi Yuma, The watcher is created to watch for updates (technically, to listen to notifications) on the RBD image, not to prevent multiple mounts. RBD allows the same image to be mapped multiple times on the same node or on different nodes. > addition, I found that if the client is isolated from the network for > a long time, the Watcher is released. However, the client still mounts > this image. In this situation, if another client can also mount this > image and the image is writable from both clients, data corruption > occurs. Could you tell me whether this is a realistic scenario? Yes, this is a realistic scenario which can occur even if the client isn't isolated from the network. If the user does this, it's up to the user to ensure that everything remains consistent. One use case for mapping the same image on multiple nodes is a clustered (also referred to as a shared disk) filesystem, such as OCFS2. Thanks, Ilya ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: unknown PGs after adding hosts in different subtree
Hi Eugen, thanks for this clarification. Yes, with the observations you describe for transition 1->2, something is very wrong. Nothing should happen. Unfortunately, I'm going to be on holidays and, generally, don't have too much time. If they can afford to share the osdmap (ceph osd getmap -o file), I could also take a look at some point. I don't think it has to do with set_choose_tries, there is likely something else screwed up badly. There should simply not be any remapping going on at this stage. Just for fun, you should be able to produce a clean crushmap from scratch with a similar or the same tree and check if you see the same problems. Using the full osdmap with osdmaptool allows to reproduce the exact mappings as used in the cluster and it encodes other important information as well. That's why I'm asking for this instead of just the crush map. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Thursday, May 23, 2024 1:26 PM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: unknown PGs after adding hosts in different subtree Hi Frank, thanks for chiming in here. > Please correct if this is wrong. Assuming its correct, I conclude > the following. You assume correctly. > Now, from your description it is not clear to me on which of the > transitions 1->2 or 2->3 you observe > - peering and/or > - unknown PGs. The unknown PGs were observed during/after 1 -> 2. All or almost all PGs were reported as "remapped", I don't remember the exact number, but it was more than 4k, and the largest pool has 4096 PGs. We didn't see down OSDs at all. Only after moving the hosts into their designated location (the DCs) the unknown PGs cleared and the application resumed its operation. I don't want to overload this thread but I asked for a copy of their crushmap to play around a bit. I moved the new hosts out of the DCs into the default root via 'crushtool --move ...', then running the crushtool --test command # crushtool -i crushmap --test --rule 1 --num-rep 18 --show-choose-tries [--show-bad-mappings] --show-utilization results in a couple of issues: - there are lots of bad mappings no matter how high the number for set_choose_tries is set - the show-utilization output shows 240 OSDs in usage (there were 240 OSDs before the expansion), but plenty of them have only 9 chunks assigned: rule 1 (rule-ec-k7m11), x = 0..1023, numrep = 18..18 rule 1 (rule-ec-k7m11) num_rep 18 result size == 0: 55/1024 rule 1 (rule-ec-k7m11) num_rep 18 result size == 9: 488/1024 rule 1 (rule-ec-k7m11) num_rep 18 result size == 18:481/1024 And this reminds me of the inactive PGs we saw before I failed the mgr, those inactive PGs showed only 9 chunks in the acting set. With k=7 (and min_size=8) that should still be enough, we have successfully tested disaster recovery with one entire DC down multiple times. - with --show-mappings some lines contain an empty set like this: CRUSH rule 1 x 22 [] And one more observation: with the currently active crushmap there are no bad mappings at all when the hosts are in their designated location. So there's definitely something wrong here, I just can't tell what it is yet. I'll play a bit more with that crushmap... Thanks! Eugen Zitat von Frank Schilder : > Hi Eugen, > > I'm afraid the description of your observation breaks a bit with > causality and this might be the reason for the few replies. To > produce a bit more structure for when exactly what happened, let's > look at what I did and didn't get: > > Before adding the hosts you have situation > > 1) > default > DCA > host A1 ... AN > DCB > host B1 ... BM > > Now you add K+L hosts, they go into the default root and we have situation > > 2) > default > host C1 ... CK, D1 ... DL > DCA > host A1 ... AN > DCB > host B1 ... BM > > As a last step, you move the hosts to their final locations and we > arrive at situation > > 3) > default > DCA > host A1 ... AN, C1 ... CK > DCB > host B1 ... BM, D1 ... DL > > Please correct if this is wrong. Assuming its correct, I conclude > the following. > > Now, from your description it is not clear to me on which of the > transitions 1->2 or 2->3 you observe > - peering and/or > - unknown PGs. > > We use a somewhat similar procedure except that we have a second > root (separate disjoint tree) for new hosts/OSDs. However, in terms > of peering it is the same and if everything is configured correctly > I would expect this to happen (this is what happens when we add > OSDs/hosts): > > transition 1->2: hosts get added: no peering, no remapped objects, > nothing, just new OSDs doing nothing > transition 2->3:
[ceph-users] Re: unknown PGs after adding hosts in different subtree
Hi Eugen, I'm afraid the description of your observation breaks a bit with causality and this might be the reason for the few replies. To produce a bit more structure for when exactly what happened, let's look at what I did and didn't get: Before adding the hosts you have situation 1) default DCA host A1 ... AN DCB host B1 ... BM Now you add K+L hosts, they go into the default root and we have situation 2) default host C1 ... CK, D1 ... DL DCA host A1 ... AN DCB host B1 ... BM As a last step, you move the hosts to their final locations and we arrive at situation 3) default DCA host A1 ... AN, C1 ... CK DCB host B1 ... BM, D1 ... DL Please correct if this is wrong. Assuming its correct, I conclude the following. Now, from your description it is not clear to me on which of the transitions 1->2 or 2->3 you observe - peering and/or - unknown PGs. We use a somewhat similar procedure except that we have a second root (separate disjoint tree) for new hosts/OSDs. However, in terms of peering it is the same and if everything is configured correctly I would expect this to happen (this is what happens when we add OSDs/hosts): transition 1->2: hosts get added: no peering, no remapped objects, nothing, just new OSDs doing nothing transition 2->3: hosts get moved: peering starts and remapped objects appear, all PGs active+clean Unknown PGs should not occur (maybe only temporarily when the primary changes or the PG is slow to respond/report status??). The crush bug with too few set_choose_tries is observed if one has *just enough hosts* for the EC profile and should not be observed if all PGs are active+clean and one *adds hosts*. Persistent unknown PGs can (to my understanding, does unknown mean "has no primary"?) only occur if the number of PGs changes (autoscaler messing around??) because all PGs were active+clean before. The crush bug leads to incomplete PGs, so PGs can go incomplete but they should always have an acting primary. This is assuming no OSDs went down/out during the process. Can you please check if my interpretation is correct and describe at which step exactly things start diverging from my expectations. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Thursday, May 23, 2024 12:05 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree Hi again, I'm still wondering if I misunderstand some of the ceph concepts. Let's assume the choose_tries value is too low and ceph can't find enough OSDs for the remapping. I would expect that there are some PG chunks in remapping state or unknown or whatever, but why would it affect the otherwise healthy cluster in such a way? Even if ceph doesn't know where to put some of the chunks, I wouldn't expect inactive PGs and have a service interruption. What am I missing here? Thanks, Eugen Zitat von Eugen Block : > Thanks, Konstantin. > It's been a while since I was last bitten by the choose_tries being > too low... Unfortunately, I won't be able to verify that... But I'll > definitely keep that in mind, or least I'll try to. :-D > > Thanks! > > Zitat von Konstantin Shalygin : > >> Hi Eugen >> >>> On 21 May 2024, at 15:26, Eugen Block wrote: >>> >>> step set_choose_tries 100 >> >> I think you should try to increase set_choose_tries to 200 >> Last year we had an Pacific EC 8+2 deployment of 10 racks. And even >> with 50 hosts, the value of 100 not worked for us >> >> >> k ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?
Hi Stefan, ahh OK, misunderstood your e-mail. It sounded like it was a custom profile, not a standard one shipped with tuned. Thanks for the clarification! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Stefan Bauer Sent: Wednesday, May 22, 2024 12:44 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: How network latency affects ceph performance really with NVME only storage? Hi Frank, it's pretty straightforward. Just follow the steps: apt install tuned tuned-adm profile network-latency According to [1]: network-latency A server profile focused on lowering network latency. This profile favors performance over power savings by setting |intel_pstate| and |min_perf_pct=100|. It disables transparent huge pages, and automatic NUMA balancing. It also uses *cpupower* to set the |performance| cpufreq governor, and requests a /|cpu_dma_latency|/ value of |1|. It also sets /|busy_read|/ and /|busy_poll|/ times to |50| μs, and /|tcp_fastopen|/ to |3|. [1] https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-tuned_adm Cheers. Stefan Am 22.05.24 um 12:18 schrieb Frank Schilder: > Hi Stefan, > > can you provide a link to or copy of the contents of the tuned-profile so > others can also profit from it? > > Thanks! > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Stefan Bauer > Sent: Wednesday, May 22, 2024 10:51 AM > To: Anthony D'Atri;ceph-users@ceph.io > Subject: [ceph-users] Re: How network latency affects ceph performance really > with NVME only storage? > > Hi Anthony and others, > > thank you for your reply. To be honest, I'm not even looking for a > solution, i just wanted to ask if latency affects the performance at all > in my case and how others handle this ;) > > One of our partners delivered a solution with a latency-optimized > profile for tuned-daemon. Now the latency is much better: > > apt install tuned > > tuned-adm profile network-latency > > # ping 10.1.4.13 > PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data. > 64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms > 64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms > 64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms > 64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms > 64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms > 64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms > 64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms > 64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms > 64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms > 64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms > ^C > --- 10.1.4.13 ping statistics --- > 10 packets transmitted, 10 received, 0% packet loss, time 9001ms > rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms > > Am 21.05.24 um 15:08 schrieb Anthony D'Atri: >> Check the netmask on your interfaces, is it possible that you're sending >> inter-node traffic up and back down needlessly? >> >>> On May 21, 2024, at 06:02, Stefan Bauer wrote: >>> >>> Dear Users, >>> >>> i recently setup a new ceph 3 node cluster. Network is meshed between all >>> nodes (2 x 25G with DAC). >>> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20) >>> >>> The latency with ping tests between the nodes shows: >>> >>> # ping 10.1.3.13 >>> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data. >>> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms >>> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms >>> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms >>> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms >>> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms >>> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms >>> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms >>> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms >>> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms >>> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms >>> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms >>> --- 10.1.3.13 ping statistics --- >>> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms >>> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms >>> >>> >>> On another cluster i have much better values, with 10G SFP+ and >>> fibre-cables: >>> >>> 64 bytes fr
[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?
Hi Stefan, can you provide a link to or copy of the contents of the tuned-profile so others can also profit from it? Thanks! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Stefan Bauer Sent: Wednesday, May 22, 2024 10:51 AM To: Anthony D'Atri; ceph-users@ceph.io Subject: [ceph-users] Re: How network latency affects ceph performance really with NVME only storage? Hi Anthony and others, thank you for your reply. To be honest, I'm not even looking for a solution, i just wanted to ask if latency affects the performance at all in my case and how others handle this ;) One of our partners delivered a solution with a latency-optimized profile for tuned-daemon. Now the latency is much better: apt install tuned tuned-adm profile network-latency # ping 10.1.4.13 PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data. 64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms 64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms 64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms 64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms 64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms 64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms 64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms 64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms 64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms 64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms ^C --- 10.1.4.13 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9001ms rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms Am 21.05.24 um 15:08 schrieb Anthony D'Atri: > Check the netmask on your interfaces, is it possible that you're sending > inter-node traffic up and back down needlessly? > >> On May 21, 2024, at 06:02, Stefan Bauer wrote: >> >> Dear Users, >> >> i recently setup a new ceph 3 node cluster. Network is meshed between all >> nodes (2 x 25G with DAC). >> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20) >> >> The latency with ping tests between the nodes shows: >> >> # ping 10.1.3.13 >> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data. >> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms >> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms >> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms >> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms >> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms >> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms >> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms >> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms >> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms >> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms >> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms >> --- 10.1.3.13 ping statistics --- >> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms >> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms >> >> >> On another cluster i have much better values, with 10G SFP+ and fibre-cables: >> >> 64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms >> 64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms >> 64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms >> 64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms >> 64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms >> 64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms >> 64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms >> 64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms >> 64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms >> 64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms >> 64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms >> 64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms >> ^C >> --- long-ipv6-ip ping statistics --- >> 53 packets transmitted, 53 received, 0% packet loss, time 53260ms >> rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms >> >> If i want best performance, does the latency difference matter at all? >> Should i change DAC to SFP-transceivers wwith fibre-cables to improve >> overall ceph performance or is this nitpicking? >> >> Thanks a lot. >> >> Stefan >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io -- Mit freundlichen Grüßen Stefan Bauer Schulstraße 5 83308 Trostberg 0179-1194767 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: dkim on this mailing list
Hi Marc, in case you are working on the list server, at least for me the situation seems to have improved no more than 2-3 hours ago. My own e-mails to the list now pass. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Marc Sent: Tuesday, May 21, 2024 5:34 PM To: ceph-users@ceph.io Subject: [ceph-users] dkim on this mailing list Just to confirm if I am messing up my mailserver configs. But currently all messages from this mailing list should generate a dkim pass status? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Please discuss about Slow Peering
> Not with the most recent Ceph releases. Actually, this depends. If its SSDs for which IOPs profit from higher iodepth, it is very likely to improve performance, because until today each OSD has only one kv_sync_thread and this is typically the bottleneck with heavy IOPs load. Having 2-4 kv_sync_threads per SSD, meaning 2-4 OSDs per disk, will help a lot if this thread is saturating. For NVMes this is usually not required. The question still remains, do you have enough CPU? If you have 13 disks with 4 OSDs each, you will need a core-count of at least 50-ish per host. Newer OSDs might be able to utilize even more on fast disks. You will also need 4 times the RAM. > I suspect your PGs are too few though. In addition, on these drives you should aim for 150-200 PGs per OSD (another reason to go x4 OSDs - x4 PGs per drive). We have 198PGs/OSD on average and this helps a lot with IO, recovery, everything. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Anthony D'Atri Sent: Tuesday, May 21, 2024 3:06 PM To: 서민우 Cc: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Please discuss about Slow Peering I have additional questions, We use 13 disk (3.2TB NVMe) per server and allocate one OSD to each disk. In other words 1 Node has 13 osds. Do you think this is inefficient? Is it better to create more OSD by creating LV on the disk? Not with the most recent Ceph releases. I suspect your PGs are too few though. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Please discuss about Slow Peering
We are using the read-intensive kioxia drives (octopus cluster) in RBD pools and are very happy with them. I don't think its the drives. The last possibility I could think of is CPU. We run 4 OSDs per 1.92TB Kioxia drive to utilize their performance (single OSD per disk doesn't cut it at all) and have 2x16-core Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz per server. During normal operations the CPU is only lightly loaded. During peering this load peaks at or above 100%. If not enough CPU power is available, peering will be hit very badly. What you could check is: - number of cores: at least 1 HT per OSD, better 1 core per OSD. - cstates disabled: we run with virtualization-performance profile and the CPU is basically always at all-core boost (3.2GHz) - sufficient RAM: we run these OSDs with 6G memory limit, that's 24G per disk! Still, the servers have 50% OSD RAM utilisation and 50% buffers, so there is enough for fast peak allocations during peering. - check vm.min_free_kbytes: the default is way too low for OSD hosts, we use vm.min_free_kbytes=4194304 (4G), this can have latency impact for network connections - swap disabled: disable swap on OSD hosts - sysctl network tuning: check that your network parameters are appropriate for your network cards, the kernel defaults are still for 1G connections, there are great tuning guides on-line, here some of our settings for 10G NICs: # Increase autotuning TCP buffer limits # 10G fiber/64MB buffers (67108864) net.core.rmem_max = 67108864 net.core.wmem_max = 67108864 net.core.rmem_default = 67108864 net.core.wmem_default = 67108864 net.core.optmem_max = 40960 net.ipv4.tcp_rmem = 22500 218450 67108864 net.ipv4.tcp_wmem = 22500 81920 67108864 - last check: are you using WPQ or MCLOCK? The mclock scheduler still has serious issues and switching to WPQ might help. If none of these help, I'm out of ideas. For us the Kioxia drives work like a charm, its the pool that is easiest to manage and maintain with super-fast recovery and really good sustained performance. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: 서민우 Sent: Tuesday, May 21, 2024 11:25 AM To: Anthony D'Atri Cc: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Please discuss about Slow Peering We used the "kioxia kcd6xvul3t20" model. Any infamous information of this Model? 2024년 5월 17일 (금) 오전 2:58, Anthony D'Atri mailto:anthony.da...@gmail.com>>님이 작성: If using jumbo frames, also ensure that they're consistently enabled on all OS instances and network devices. > On May 16, 2024, at 09:30, Frank Schilder mailto:fr...@dtu.dk>> > wrote: > > This is a long shot: if you are using octopus, you might be hit by this > pglog-dup problem: > https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't > mention slow peering explicitly in the blog, but its also a consequence > because the up+acting OSDs need to go through the PG_log during peering. > > We are also using octopus and I'm not sure if we have ever seen slow ops > caused by peering alone. It usually happens when a disk cannot handle load > under peering. We have, unfortunately, disks that show random latency spikes > (firmware update pending). You can try to monitor OPS latencies for your > drives when peering and look for something that sticks out. People on this > list were reporting quite bad results for certain infamous NVMe brands. If > you state your model numbers, someone else might recognize it. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: 서민우 mailto:smw940...@gmail.com>> > Sent: Thursday, May 16, 2024 7:39 AM > To: ceph-users@ceph.io<mailto:ceph-users@ceph.io> > Subject: [ceph-users] Please discuss about Slow Peering > > Env: > - OS: Ubuntu 20.04 > - Ceph Version: Octopus 15.0.0.1 > - OSD Disk: 2.9TB NVMe > - BlockStorage (Replication 3) > > Symptom: > - Peering when OSD's node up is very slow. Peering speed varies from PG to > PG, and some PG may even take 10 seconds. But, there is no log for 10 > seconds. > - I checked the effect of client VM's. Actually, Slow queries of mysql > occur at the same time. > > There are Ceph OSD logs of both Best and Worst. > > Best Peering Case (0.5 Seconds) > 2024-04-11T15:32:44.693+0900 7f108b522700 1 osd.7 pg_epoch: 27368 pg[6.8] > state: transitioning to Primary > 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] > state: Peering, affected_by_map, going to Reset > 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] > start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11], > acting_primary 7 -> 6, up_p
[ceph-users] Re: Please discuss about Slow Peering
This is a long shot: if you are using octopus, you might be hit by this pglog-dup problem: https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't mention slow peering explicitly in the blog, but its also a consequence because the up+acting OSDs need to go through the PG_log during peering. We are also using octopus and I'm not sure if we have ever seen slow ops caused by peering alone. It usually happens when a disk cannot handle load under peering. We have, unfortunately, disks that show random latency spikes (firmware update pending). You can try to monitor OPS latencies for your drives when peering and look for something that sticks out. People on this list were reporting quite bad results for certain infamous NVMe brands. If you state your model numbers, someone else might recognize it. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: 서민우 Sent: Thursday, May 16, 2024 7:39 AM To: ceph-users@ceph.io Subject: [ceph-users] Please discuss about Slow Peering Env: - OS: Ubuntu 20.04 - Ceph Version: Octopus 15.0.0.1 - OSD Disk: 2.9TB NVMe - BlockStorage (Replication 3) Symptom: - Peering when OSD's node up is very slow. Peering speed varies from PG to PG, and some PG may even take 10 seconds. But, there is no log for 10 seconds. - I checked the effect of client VM's. Actually, Slow queries of mysql occur at the same time. There are Ceph OSD logs of both Best and Worst. Best Peering Case (0.5 Seconds) 2024-04-11T15:32:44.693+0900 7f108b522700 1 osd.7 pg_epoch: 27368 pg[6.8] state: transitioning to Primary 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] state: Peering, affected_by_map, going to Reset 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8] start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11], acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] state: transitioning to Primary 2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8] start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11], acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting Worst Peering Case (11.6 Seconds) 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] state: transitioning to Stray 2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20] start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] state: transitioning to Stray 2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20] start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1], acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] state: transitioning to Stray 2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20] start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1], acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting *I wish to know about* - Why some PG's take 10 seconds until Peering finishes. - Why Ceph log is quiet during peering. - Is this symptom intended in Ceph. *And please give some advice,* - Is there any way to improve peering speed? - Or, Is there a way to not affect the client when peering occurs? P.S - I checked the symptoms in the following environments. -> Octopus Version, Reef Version, Cephadm, Ceph-Ansible ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503
I think you are panicking way too much. Chances are that you will never need that command, so don't get fussed out by an old post. Just follow what I wrote and, in the extremely rare case that recovery does not complete due to missing information, send an e-mail to this list and state that you still have the disk of the down OSD. Someone will send you the export/import commands within a short time. So stop worrying and just administrate your cluster with common storage admin sense. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mary Zhang Sent: Tuesday, April 30, 2024 5:00 PM To: Frank Schilder Cc: Eugen Block; ceph-users@ceph.io; Wesley Dillingham Subject: Re: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503 Thank you Frank for sharing such valuable experience! I really appreciate it. We observe similar timelines: it took more than 1 week to drain our OSD. Regarding export PGs from failed disk and inject it back to the cluster, do you have any documentations? I find this online Ceph.io — Incomplete PGs -- OH MY!<https://ceph.io/en/news/blog/2015/incomplete-pgs-oh-my/>, but not sure whether it's the standard process. Thanks, Mary On Tue, Apr 30, 2024 at 3:27 AM Frank Schilder mailto:fr...@dtu.dk>> wrote: Hi all, I second Eugen's recommendation. We have a cluster with large HDD OSDs where the following timings are found: - drain an OSD: 2 weeks. - down an OSD and let cluster recover: 6 hours. The drain OSD procedure is - in my experience - a complete waste of time, actually puts your cluster at higher risk of a second failure (its not guaranteed that the bad PG(s) is/are drained first) and also screws up all sorts of internal operations like scrub etc for an unnecessarily long time. The recovery procedure is much faster, because it uses all-to-all recovery while drain is limited to no more than max_backfills PGs at a time and your broken disk sits much longer in the cluster. On SSDs the "down OSD"-method shows a similar speed-up factor. For a security measure, don't destroy the OSD right away, wait for recovery to complete and only then destroy the OSD and throw away the disk. In case an error occurs during recovery, you can almost always still export PGs from a failed disk and inject it back into the cluster. This, however, requires to take disks out as soon as they show problems and before they fail hard. Keep a little bit of life time to have a chance to recover data. Look at the manual of ddrescue why it is important to stop IO from a failing disk as soon as possible. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block mailto:ebl...@nde.ag>> Sent: Saturday, April 27, 2024 10:29 AM To: Mary Zhang Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; Wesley Dillingham Subject: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503 If the rest of the cluster is healthy and your resiliency is configured properly, for example to sustain the loss of one or more hosts at a time, you don’t need to worry about a single disk. Just take it out and remove it (forcefully) so it doesn’t have any clients anymore. Ceph will immediately assign different primary OSDs and your clients will be happy again. ;-) Zitat von Mary Zhang mailto:maryzhang0...@gmail.com>>: > Thank you Wesley for the clear explanation between the 2 methods! > The tracker issue you mentioned https://tracker.ceph.com/issues/44400 talks > about primary-affinity. Could primary-affinity help remove an OSD with > hardware issue from the cluster gracefully? > > Thanks, > Mary > > > On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham > mailto:w...@wesdillingham.com>> > wrote: > >> What you want to do is to stop the OSD (and all its copies of data it >> contains) by stopping the OSD service immediately. The downside of this >> approach is it causes the PGs on that OSD to be degraded. But the upside is >> the OSD which has bad hardware is immediately no longer participating in >> any client IO (the source of your RGW 503s). In this situation the PGs go >> into degraded+backfilling >> >> The alternative method is to keep the failing OSD up and in the cluster >> but slowly migrate the data off of it, this would be a long drawn out >> period of time in which the failing disk would continue to serve client >> reads and also facilitate backfill but you wouldnt take a copy of the data >> out of the cluster and cause degraded PGs. In this scenario the PGs would >> be remapped+backfilling >> >> I tried to find a way to have your cake and eat it to in relation to this >> "predicament" in this tracker issue: https://tracker.ceph.com/issues/4440
[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503
Hi all, I second Eugen's recommendation. We have a cluster with large HDD OSDs where the following timings are found: - drain an OSD: 2 weeks. - down an OSD and let cluster recover: 6 hours. The drain OSD procedure is - in my experience - a complete waste of time, actually puts your cluster at higher risk of a second failure (its not guaranteed that the bad PG(s) is/are drained first) and also screws up all sorts of internal operations like scrub etc for an unnecessarily long time. The recovery procedure is much faster, because it uses all-to-all recovery while drain is limited to no more than max_backfills PGs at a time and your broken disk sits much longer in the cluster. On SSDs the "down OSD"-method shows a similar speed-up factor. For a security measure, don't destroy the OSD right away, wait for recovery to complete and only then destroy the OSD and throw away the disk. In case an error occurs during recovery, you can almost always still export PGs from a failed disk and inject it back into the cluster. This, however, requires to take disks out as soon as they show problems and before they fail hard. Keep a little bit of life time to have a chance to recover data. Look at the manual of ddrescue why it is important to stop IO from a failing disk as soon as possible. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Saturday, April 27, 2024 10:29 AM To: Mary Zhang Cc: ceph-users@ceph.io; Wesley Dillingham Subject: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503 If the rest of the cluster is healthy and your resiliency is configured properly, for example to sustain the loss of one or more hosts at a time, you don’t need to worry about a single disk. Just take it out and remove it (forcefully) so it doesn’t have any clients anymore. Ceph will immediately assign different primary OSDs and your clients will be happy again. ;-) Zitat von Mary Zhang : > Thank you Wesley for the clear explanation between the 2 methods! > The tracker issue you mentioned https://tracker.ceph.com/issues/44400 talks > about primary-affinity. Could primary-affinity help remove an OSD with > hardware issue from the cluster gracefully? > > Thanks, > Mary > > > On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham > wrote: > >> What you want to do is to stop the OSD (and all its copies of data it >> contains) by stopping the OSD service immediately. The downside of this >> approach is it causes the PGs on that OSD to be degraded. But the upside is >> the OSD which has bad hardware is immediately no longer participating in >> any client IO (the source of your RGW 503s). In this situation the PGs go >> into degraded+backfilling >> >> The alternative method is to keep the failing OSD up and in the cluster >> but slowly migrate the data off of it, this would be a long drawn out >> period of time in which the failing disk would continue to serve client >> reads and also facilitate backfill but you wouldnt take a copy of the data >> out of the cluster and cause degraded PGs. In this scenario the PGs would >> be remapped+backfilling >> >> I tried to find a way to have your cake and eat it to in relation to this >> "predicament" in this tracker issue: https://tracker.ceph.com/issues/44400 >> but it was deemed "wont fix". >> >> Respectfully, >> >> *Wes Dillingham* >> LinkedIn <http://www.linkedin.com/in/wesleydillingham> >> w...@wesdillingham.com >> >> >> >> >> On Fri, Apr 26, 2024 at 11:25 AM Mary Zhang >> wrote: >> >>> Thank you Eugen for your warm help! >>> >>> I'm trying to understand the difference between 2 methods. >>> For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph >>> Documentation >>> <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd> >>> says >>> it involves 2 steps: >>> >>>1. >>> >>>evacuating all placement groups (PGs) from the OSD >>>2. >>> >>>removing the PG-free OSD from the cluster >>> >>> For method 2, or the procedure you recommended, Adding/Removing OSDs — >>> Ceph >>> Documentation >>> < >>> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual >>> > >>> says >>> "After the OSD has been taken out of the cluster, Ceph begins rebalancing >>> the cluster by migrating placement groups out of the OSD that was removed. >>> " >>> >>> What's the difference between "evacuatin
[ceph-users] Re: Latest Doco Out Of Date?
Hi Eugen, I would ask for a slight change here: > If a client already has a capability for file-system name a and path > dir1, running fs authorize again for FS name a but path dir2, > instead of modifying the capabilities client already holds, a new > cap for dir2 will be granted The formulation "a new cap for dir2 will be granted" is very misleading. I would also read it as that the new cap is in addition to the already existing cap. I tried to modify caps with fs authorize as well in the past, because it will set caps using pool tags and the docu sounded like it will allow to modify caps. In my case, I got the same error and thought that its implementation is buggy and did it with the authtool. To be honest, when I look at the command sequence ceph fs authorize a client.x /dir1 rw ceph fs authorize a client.x /dir2 rw and it goes through without error, I would expect the client to have both permissions as a result - no matter what the documentation says. There is no "revoke caps" instruction anywhere. Revoking caps in this way is a really dangerous side effect and telling people to read the documentation about a command that should follow how other linux tools manage permissions is not the best answer. There is something called parallelism in software engineering and this command line syntax violates this in a highly un-intuitive way. The intuition of the syntax clearly is that it *adds* capabilities, its incremental. A command like this should follow how existing linux tools work so that context switching will be easier for admins. Here, the choice of the term "authorize" seems to be unlucky. A more explicit command that follows setfacl a bit could be ceph fs caps set a client.x /dir1 rw ceph fs caps modify a client.x /dir2 rw or even more parallel ceph fs setcaps a client.x /dir1 rw ceph fs setcaps -m a client.x /dir2 rw Such parallel syntax will not only avoid the reported confusion but also make it possible to implement a modify operation in the future without breaking stuff. And you can save time on the documentation, because it works like other stuff. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Wednesday, April 24, 2024 9:02 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Latest Doco Out Of Date? Hi, I believe the docs [2] are okay, running 'ceph fs authorize' will overwrite the existing caps, it will not add more caps to the client: > Capabilities can be modified by running fs authorize only in the > case when read/write permissions must be changed. > If a client already has a capability for file-system name a and path > dir1, running fs authorize again for FS name a but path dir2, > instead of modifying the capabilities client already holds, a new > cap for dir2 will be granted To add more caps you'll need to use the 'ceph auth caps' command, for example: quincy-1:~ # ceph fs authorize cephfs client.usera /dir1 rw [client.usera] key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw== quincy-1:~ # ceph auth get client.usera [client.usera] key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw== caps mds = "allow rw fsname=cephfs path=/dir1" caps mon = "allow r fsname=cephfs" caps osd = "allow rw tag cephfs data=cephfs" quincy-1:~ # ceph auth caps client.usera mds 'allow rw fsname=cephfs path=/dir1, allow rw fsname=cephfs path=/dir2' mon 'allow r fsname=cephfs' osd 'allow rw tag cephfs data=cephfs' updated caps for client.usera quincy-1:~ # ceph auth get client.usera [client.usera] key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw== caps mds = "allow rw fsname=cephfs path=/dir1, allow rw fsname=cephfs path=/dir2" caps mon = "allow r fsname=cephfs" caps osd = "allow rw tag cephfs data=cephfs" Note that I don't actually have these directories in that cephfs, it's just to demonstrate, so you'll need to make sure your caps actually work. Thanks, Eugen [2] https://docs.ceph.com/en/latest/cephfs/client-auth/#changing-rw-permissions-in-caps Zitat von Zac Dover : > It's in my list of ongoing initiatives. I'll stay up late tonight > and ask Venky directly what's going on in this instance. > > Sometime later today, I'll create an issue tracking bug and I'll > send it to you for review. Make sure that I haven't misrepresented > this issue. > > Zac > > On Wednesday, April 24th, 2024 at 2:10 PM, duluxoz wrote: > >> Hi Zac, >> >> Any movement on this? We really need to come up with an >> answer/solution - thanks >> >> Dulux-Oz >> >> On 19/04/2024 18:03, duluxoz wrote: >> >>> Cool! >>> >>> Thanks for that :-) >>> >>> On 19/04/2024 18:01, Zac Dover wrote:
[ceph-users] (deep-)scrubs blocked by backfill
Hi all, I have a technical question about scrub scheduling. I replaced a disk and it is back-filling slowly. We have set osd_scrub_during_recovery = true and still observe that scrub times continuously increase (number of PGs not scrubbed in time is continuously increasing). Investigating the situation it looks like any OSD that has a PG in states "backfill_wait" or "backfilling" is preventing scrubs to be scheduled on PGs it is a member of. However, it seems it is not quite like that. On the one hand I have never seen a PG in a state like "active+scrubbing+remapped+backfilling", so backfilling PGs at least never seem to scrub. On the other hand, it seems like more PGs are scrubbed than would be eligible if *all* OSDs with a remapped PG on it would refuse scrubs. It looks like something in between "only OSDs with a backfilling PG block requests for scrub reservations" and "all OSDs with a PG in states backfilling or backfill_wait block requests for scrub reservations". Does the position in the backfill reservation queue play a role? If anyone has insight into how scrub reservations are granted and when not in the situation of an OSD backfilling that would be great. My naive interpretation of "osd_scrub_during_recovery = true" was that scrubs proceed as if no backfill was going on. This, however, is clearly not the case. Having an answer to my question above would help me a lot to get an idea when things will go back to normal. Thanks a lot and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker
Question about HA here: I understood the documentation of the fuse NFS client such that the connection state of all NFS clients is stored on ceph in rados objects and, if using a floating IP, the NFS clients should just recover from a short network timeout. Not sure if this is what should happen with this specific HA set-up in the original request, but a fail-over of the NFS server ought to be handled gracefully by starting a new one up with the IP of the down one. Or not? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Tuesday, April 16, 2024 11:24 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker Ah, okay, thanks for the hint. In that case what I see is expected. Zitat von Robert Sander : > Hi, > > On 16.04.24 10:49, Eugen Block wrote: >> >> I believe I can confirm your suspicion, I have a test cluster on >> Reef 18.2.1 and deployed nfs without HAProxy but with keepalived [1]. >> Stopping the active NFS daemon doesn't trigger anything, the MGR >> notices that it's stopped at some point, but nothing else seems to >> happen. > > There is currently no failover for NFS. > > The ingress service (haproxy + keepalived) that cephadm deploys for > an NFS cluster does not have a health check configured. Haproxy does > not notice if a backend NFS server dies. This does not matter as > there is no failover and the NFS client cannot be "load balanced" to > another backend NFS server. > > There is no use to configure an ingress service currently without failover. > > The NFS clients have to remount the NFS share in case of their > current NFS server dies anyway. > > Regards > -- > Robert Sander > Heinlein Consulting GmbH > Schwedter Str. 8/9b, 10119 Berlin > > http://www.heinlein-support.de > > Tel: 030 / 405051-43 > Fax: 030 / 405051-19 > > Zwangsangaben lt. §35a GmbHG: > HRB 220009 B / Amtsgericht Berlin-Charlottenburg, > Geschäftsführer: Peer Heinlein -- Sitz: Berlin > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
>>> Fast write enabled would mean that the primary OSD sends #size copies to the >>> entire active set (including itself) in parallel and sends an ACK to the >>> client as soon as min_size ACKs have been received from the peers (including >>> itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow >>> for whatever reason) without suffering performance penalties immediately >>> (only after too many requests started piling up, which will show as a slow >>> requests warning). >>> >> What happens if there occurs an error on the slowest osd after the min_size >> ACK has already been send to the client? >> >This should not be different than what exists today..unless of-course if >the error happens on the local/primary osd Can this be addressed with reasonable effort? I don't expect this to be a quick-fix and it should be tested. However, beating the tail-latency statistics with the extra redundancy should be worth it. I observe fluctuations of latencies, OSDs become randomly slow for whatever reason for short time intervals and then return to normal. A reason for this could be DB compaction. I think during compaction latency tends to spike. A fast-write option would effectively remove the impact of this. Best regards and thanks for considering this! ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Performance improvement suggestion
Hi all, coming late to the party but want to ship in as well with some experience. The problem of tail latencies of individual OSDs is a real pain for any redundant storage system. However, there is a way to deal with this in an elegant way when using large replication factors. The idea is to use the counterpart of the "fast read" option that exists for EC pools and: 1) make this option available to replicated pools as well (is on the road map as far as I know), but also 2) implement an option "fast write" for all pool types. Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning). I have fast read enabled on all EC pools. This does increase the cluster-internal network traffic, which is nowadays absolutely no problem (in the good old 1G times it potentially would be). In return, the read latencies on the client side are lower and much more predictable. In effect, the user experience improved dramatically. I would really wish that such an option gets added as we use wide replication profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large replication factors (more precisely, large (size-min_size)) to mitigate the impact of slow OSDs would be awesome. It would also add some incentive to stop the ridiculous size=2 min_size=1 habit, because one gets an extra gain from replication on top of redundancy. In the long run, the ceph write path should try to deal with a-priori known different-latency connections (fast local ACK with async remote completion, was asked for a couple of times), for example, for stretched clusters where one has an internal connection for the local part and external connections for the remote parts. It would be great to have similar ways of mitigating some penalties of the slow write paths to remote sites. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Peter Grandi Sent: Wednesday, February 21, 2024 1:10 PM To: list Linux fs Ceph Subject: [ceph-users] Re: Performance improvement suggestion > 1. Write object A from client. > 2. Fsync to primary device completes. > 3. Ack to client. > 4. Writes sent to replicas. [...] As mentioned in the discussion this proposal is the opposite of what the current policy, is, which is to wait for all replicas to be written before writes are acknowledged to the client: https://github.com/ceph/ceph/blob/main/doc/architecture.rst "After identifying the target placement group, the client writes the object to the identified placement group's primary OSD. The primary OSD then [...] confirms that the object was stored successfully in the secondary and tertiary OSDs, and reports to the client that the object was stored successfully." A more revolutionary option would be for 'librados' to write in parallel to all the "active set" OSDs and report this to the primary, but that would greatly increase client-Ceph traffic, while the current logic increases traffic only among OSDs. > So I think that to maintain any semblance of reliability, > you'd need to at least wait for a commit ack from the first > replica (i.e. min_size=2). Perhaps it could be similar to 'k'+'m' for EC, that is 'k' synchronous (write completes to the client only when all at least 'k' replicas, including primary, have been committed) and 'm' asynchronous, instead of 'k' being just 1 or 2. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 6 pgs not deep-scrubbed in time
You will have to look at the output of "ceph df" and make a decision to balance "objects per PG" and "GB per PG". Increase he PG count for the pools with the worst of these two numbers most such that it balances out as much as possible. If you have pools that see significantly more user-IO than others, prioritise these. You will have to find out for your specific cluster, we can only give general guidelines. Make changes, run benchmarks, re-evaluate. Take the time for it. The better you know your cluster and your users, the better the end result will be. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michel Niyoyita Sent: Monday, January 29, 2024 2:04 PM To: Janne Johansson Cc: Frank Schilder; E Taka; ceph-users Subject: Re: [ceph-users] Re: 6 pgs not deep-scrubbed in time This is how it is set , if you suggest to make some changes please advises. Thank you. ceph osd pool ls detail pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1407 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr_devicehealth pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1393 flags hashpspool stripe_width 0 application rgw pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1394 flags hashpspool stripe_width 0 application rgw pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1395 flags hashpspool stripe_width 0 application rgw pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1396 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw pool 6 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 108802 lfor 0/0/14812 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd removed_snaps_queue [22d7~3,11561~2,11571~1,11573~1c,11594~6,1159b~f,115b0~1,115b3~1,115c3~1,115f3~1,115f5~e,11613~6,1161f~c,11637~1b,11660~1,11663~2,11673~1,116d1~c,116f5~10,11721~c] pool 7 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 94609 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd pool 8 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1399 flags hashpspool stripe_width 0 application rbd pool 9 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 108783 lfor 0/561/559 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd removed_snaps_queue [3fa~1,3fc~3,400~1,402~1] pool 10 'testbench' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 20931 lfor 0/20931/20929 flags hashpspool stripe_width 0 On Mon, Jan 29, 2024 at 2:09 PM Michel Niyoyita mailto:mico...@gmail.com>> wrote: Thank you Janne , no need of setting some flags like ceph osd set nodeep-scrub ??? Thank you On Mon, Jan 29, 2024 at 2:04 PM Janne Johansson mailto:icepic...@gmail.com>> wrote: Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita mailto:mico...@gmail.com>>: > > Thank you Frank , > > All disks are HDDs . Would like to know if I can increase the number of PGs > live in production without a negative impact to the cluster. if yes which > commands to use . Yes. "ceph osd pool set pg_num " where the number usually should be a power of two that leads to a number of PGs per OSD between 100-200. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 6 pgs not deep-scrubbed in time
Setting osd_max_scrubs = 2 for HDD OSDs was a mistake I made. The result was that PGs needed a bit more than twice as long to deep-scrub. Net effect: high scrub load, much less user IO and, last but not least, the "not deep-scrubbed in time" problem got worse, because (2+eps)/2 > 1. For spinners a consideration looking at the actually available drive performance is required, plus a few things more, like PG count, distribution etc. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Wesley Dillingham Sent: Monday, January 29, 2024 7:14 PM To: Michel Niyoyita Cc: Josh Baergen; E Taka; ceph-users Subject: [ceph-users] Re: 6 pgs not deep-scrubbed in time Respond back with "ceph versions" output If your sole goal is to eliminate the not scrubbed in time errors you can increase the aggressiveness of scrubbing by setting: osd_max_scrubs = 2 The default in pacific is 1. if you are going to start tinkering manually with the pg_num you will want to turn off the pg autoscaler on the pools you are touching. reducing the size of your PGs may make sense and help with scrubbing but if the pool has a lot of data it will take a long long time to finish. Respectfully, *Wes Dillingham* w...@wesdillingham.com LinkedIn <http://www.linkedin.com/in/wesleydillingham> On Mon, Jan 29, 2024 at 10:08 AM Michel Niyoyita wrote: > I am running ceph pacific , version 16 , ubuntu 20 OS , deployed using > ceph-ansible. > > Michel > > On Mon, Jan 29, 2024 at 4:47 PM Josh Baergen > wrote: > > > Make sure you're on a fairly recent version of Ceph before doing this, > > though. > > > > Josh > > > > On Mon, Jan 29, 2024 at 5:05 AM Janne Johansson > > wrote: > > > > > > Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita >: > > > > > > > > Thank you Frank , > > > > > > > > All disks are HDDs . Would like to know if I can increase the number > > of PGs > > > > live in production without a negative impact to the cluster. if yes > > which > > > > commands to use . > > > > > > Yes. "ceph osd pool set pg_num " > > > where the number usually should be a power of two that leads to a > > > number of PGs per OSD between 100-200. > > > > > > -- > > > May the most significant bit of your life be positive. > > > ___ > > > ceph-users mailing list -- ceph-users@ceph.io > > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 6 pgs not deep-scrubbed in time
Hi Michel, are your OSDs HDD or SSD? If they are HDD, its possible that they can't handle the deep-scrub load with default settings. In that case, have a look at this post https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/ for some basic tuning info and a script to check your scrub stamp distribution. You should also get rid of slow/failing ones. Look at smartctl output and throw out disks with remapped sectors, uncorrectable r/w errors or unusually many corrected read-write-errors (assuming you have disks with ECC). A basic calculation for deep-scrub is as follows: max number of PGs that can be scrubbed at the same time: A=#OSDs/replication factor (rounded down). Take the B=deep-scrub times from the OSD logs (grep for deep-scrub) in minutes. Your pool can deep-scrub at a max A*24*(60/B) PGs per day. For reasonable operations you should not do more than 50% of that. With that you can calculate how many days it needs to deep-scrub your PGs. Usual reasons for slow deep-scrub progress is too few PGs. With replication factor 3 and 48 OSDs you have a PG budget of ca. 3200 (ca 200/OSD) but use only 385. You should consider increasing the PG count for pools with lots of data. This should already relax the situation somewhat. Then do the calc above and tune deep-scrub times per pool such that they match with disk performance. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Michel Niyoyita Sent: Monday, January 29, 2024 7:42 AM To: E Taka Cc: ceph-users Subject: [ceph-users] Re: 6 pgs not deep-scrubbed in time Now they are increasing , Friday I tried to deep-scrubbing manually and they have been successfully done , but Monday morning I found that they are increasing to 37 , is it the best to deep-scrubbing manually while we are using the cluster? if not what is the best to do in order to address that . Best Regards. Michel ceph -s cluster: id: cb0caedc-eb5b-42d1-a34f-96facfda8c27 health: HEALTH_WARN 37 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M) mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1 osd: 48 osds: 48 up (since 11M), 48 in (since 11M) rgw: 6 daemons active (6 hosts, 1 zones) data: pools: 10 pools, 385 pgs objects: 6.00M objects, 23 TiB usage: 151 TiB used, 282 TiB / 433 TiB avail pgs: 381 active+clean 4 active+clean+scrubbing+deep io: client: 265 MiB/s rd, 786 MiB/s wr, 3.87k op/s rd, 699 op/s wr On Sun, Jan 28, 2024 at 6:14 PM E Taka <0eta...@gmail.com> wrote: > 22 is more often there than the others. Other operations may be blocked > because of a deep-scrub is not finished yet. I would remove OSD 22, just to > be sure about this: ceph orch osd rm osd.22 > > If this does not help, just add it again. > > Am Fr., 26. Jan. 2024 um 08:05 Uhr schrieb Michel Niyoyita < > mico...@gmail.com>: > >> It seems that are different OSDs as shown here . how have you managed to >> sort this out? >> >> ceph pg dump | grep -F 6.78 >> dumped all >> 6.78 44268 0 0 00 >> 1786796401180 0 10099 10099 >> active+clean 2024-01-26T03:51:26.781438+0200 107547'115445304 >> 107547:225274427 [12,36,37] 12 [12,36,37] 12 >> 106977'114532385 2024-01-24T08:37:53.597331+0200 101161'109078277 >> 2024-01-11T16:07:54.875746+0200 0 >> root@ceph-osd3:~# ceph pg dump | grep -F 6.60 >> dumped all >> 6.60 9 0 0 00 >> 179484338742 716 36 10097 10097 >> active+clean 2024-01-26T03:50:44.579831+0200 107547'153238805 >> 107547:287193139 [32,5,29] 32 [32,5,29] 32 >> 107231'152689835 2024-01-25T02:34:01.849966+0200 102171'147920798 >> 2024-01-13T19:44:26.922000+0200 0 >> 6.3a 44807 0 0 00 >> 1809690056940 0 10093 10093 >> active+clean 2024-01-26T03:53:28.837685+0200 107547'114765984 >> 107547:238170093 [22,13,11] 22 [22,13,11] 22 >> 106945'113739877 2024-01-24T04:10:17.224982+0200 102863'109559444 >> 2024-01-15T05:31:36.606478+0200 0 >> root@ceph-osd3:~# ceph pg dump | grep -F 6.5c >> 6.5c 44277 0 0 00 >> 1787649782300 0 10051 10051 >> active+clean 2024-01-26T03:55:23.339584+0200 107547'126480090 >> 107547:264432655 [22,37,30] 22 [22,37,30] 22 >> 107205'12
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Hi Özkan, > ... The client is actually at idle mode and there is no reason to fail at > all. ... if you re-read my message, you will notice that I wrote that - its not the client failing, its a false positive error flag that - is not cleared for idle clients. You seem to encounter exactly this situation and a simple echo 3 > /proc/sys/vm/drop_caches would probably have cleared the warning. There is nothing wrong with your client, its an issue with the client-MDS communication protocol that is probably still under review. You will encounter these warnings every now and then until its fixed. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)
Hi, this message is one of those that are often spurious. I don't recall in which thread/PR/tracker I read it, but the story was something like that: If an MDS gets under memory pressure it will request dentry items back from *all* clients, not just the active ones or the ones holding many of them. If you have a client that's below the min-threshold for dentries (its one of the client/mds tuning options), it will not respond. This client will be flagged as not responding, which is a false positive. I believe the devs are working on a fix to get rid of these spurious warnings. There is a "bug/feature" in the MDS that does not clear this warning flag for inactive clients. Hence, the message hangs and never disappears. I usually clear it with a "echo 3 > /proc/sys/vm/drop_caches" on the client. However, except for being annoying in the dashboard, it has no performance or otherwise negative impact. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Friday, January 26, 2024 10:05 AM To: Özkan Göksu Cc: ceph-users@ceph.io Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6) Performance for small files is more about IOPS rather than throughput, and the IOPS in your fio tests look okay to me. What you could try is to split the PGs to get around 150 or 200 PGs per OSD. You're currently at around 60 according to the ceph osd df output. Before you do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data | head'? I don't need the whole output, just to see how many objects each PG has. We had a case once where that helped, but it was an older cluster and the pool was backed by HDDs and separate rocksDB on SSDs. So this might not be the solution here, but it could improve things as well. Zitat von Özkan Göksu : > Every user has a 1x subvolume and I only have 1 pool. > At the beginning we were using each subvolume for ldap home directory + > user data. > When a user logins any docker on any host, it was using the cluster for > home and the for user related data, we was have second directory in the > same subvolume. > Time to time users were feeling a very slow home environment and after a > month it became almost impossible to use home. VNC sessions became > unresponsive and slow etc. > > 2 weeks ago, I had to migrate home to a ZFS storage and now the overall > performance is better for only user_data without home. > But still the performance is not good enough as I expected because of the > problems related to MDS. > The usage is low but allocation is high and Cpu usage is high. You saw the > IO Op/s, it's nothing but allocation is high. > > I develop a fio benchmark script and I run the script on 4x test server at > the same time, the results are below: > Script: > https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh > > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt > https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt > > While running benchmark, I take sample values for each type of iobench run. > > Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr > client: 60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr > client: 13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr > > Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr > client: 370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr > > Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr > client: 14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr > client: 6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr > > Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32 > client: 317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr > client: 2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr > client: 4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr > client: 2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr > > It seems I only have problems with the 4K,8K,16K other sector sizes. > > > > > Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı: > >> I understand that your MDS shows a high CPU usage, but other than that >> what is your performance issue? Do users
[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out
Hi, Hector also claims that he observed an incomplete acting set after *adding* an OSD. Assuming that the cluster was health OK before that, that should not happen in theory. In practice this was observed with certain definitions of crush maps. There is, for example, the issue with "choose" and "chooseleaf" not doing the same thing in situations they should. Another one was that spurious (temporary) allocations of PGs could exceed hard limits without being obvious or reported at all. Without seeing the crush maps its hard to tell what is going on. With just 3 hosts and 4 OSDs per hosts the cluster might be hitting corner cases with such a wide EC profile. Having the osdmap of the cluster in normal conditions would allow to simulate OSD downs and ups off-line and one might gain inside why crush fails to compute a complete acting set (yes, I'm not talking about the up set, I was always talking about the acting set). There might also be an issue with the PG-/OSD-map logs tracking the full history of the PGs in question. A possible way to test is to issue a re-peer command after all peering finished on a PG with incomplete acting set to see if this resolves the PG. If so, there is a temporary condition that prevents the PGs from becoming clean when going through the standard peering procedure. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Wednesday, January 24, 2024 9:45 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Degraded PGs on EC pool when marking an OSD out Hi, this topic pops up every now and then, and although I don't have definitive proof for my assumptions I still stand with them. ;-) As the docs [2] already state, it's expected that PGs become degraded after some sort of failure (setting an OSD "out" falls into that category IMO): > It is normal for placement groups to enter “degraded” or “peering” > states after a component failure. Normally, these states reflect the > expected progression through the failure recovery process. However, > a placement group that stays in one of these states for a long time > might be an indication of a larger problem. And you report that your PGs do not stay in that state but eventually recover. My understanding is as follows: PGs have to be recreated on different hosts/OSDs after setting an OSD "out". During this transition (peering) the PGs are degraded until the newly assigned OSD have noticed their new responsibility (I'm not familiar with the actual data flow). The degraded state then clears as long as the out OSD is up (its PGs are active). If you stop that OSD ("down") the PGs become and stay degraded until they have been fully recreated on different hosts/OSDs. Not sure what impacts the duration until the degraded state clears, but in my small test cluster (similar osd tree as yours) the degraded state clears after a few seconds only, but I only have a few (almost empty) PGs in the EC test pool. I guess a comment from the devs couldn't hurt to clear this up. [2] https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups Zitat von Hector Martin : > On 2024/01/22 19:06, Frank Schilder wrote: >> You seem to have a problem with your crush rule(s): >> >> 14.3d ... [18,17,16,3,1,0,NONE,NONE,12] >> >> If you really just took out 1 OSD, having 2xNONE in the acting set >> indicates that your crush rule can't find valid mappings. You might >> need to tune crush tunables: >> https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs > > Look closely: that's the *acting* (second column) OSD set, not the *up* > (first column) OSD set. It's supposed to be the *previous* set of OSDs > assigned to that PG, but inexplicably some OSDs just "fall off" when the > PGs get remapped around. > > Simply waiting lets the data recover. At no point are any of my PGs > actually missing OSDs according to the current cluster state, and CRUSH > always finds a valid mapping. Rather the problem is that the *previous* > set of OSDs just loses some entries some for some reason. > > The same problem happens when I *add* an OSD to the cluster. For > example, right now, osd.15 is out. This is the state of one pg: > > 14.3d 1044 0 0 00 > 157307567310 0 1630 0 1630 > active+clean 2024-01-22T20:15:46.684066+0900 15550'1630 > 15550:16184 [18,17,16,3,1,0,11,14,12] 18 > [18,17,16,3,1,0,11,14,12] 18 15550'1629 > 2024-01-22T20:15:46.683491+0900 0'0 > 2024-01-08T15:18:21.654679+0900 02 > periodic scrub scheduled @ 2024-01-31T
[ceph-users] List contents of stray buckets with octopus
Hi all, I need to list the contents of the stray buckets on one of our MDSes. The MDS reports 772674 stray entries. However, if I dump its cache and grep for stray I get only 216 hits. How can I get to the contents of the stray buckets? Please note that Octopus is still hit by https://tracker.ceph.com/issues/57059 so a "dump tree" will not work. In addition, I clearly don't just need the entries in cache, I need a listing of everything. How can I get that? I'm willing to run rados commands and pipe through ceph-encoder if necessary. Thanks and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out
You seem to have a problem with your crush rule(s): 14.3d ... [18,17,16,3,1,0,NONE,NONE,12] If you really just took out 1 OSD, having 2xNONE in the acting set indicates that your crush rule can't find valid mappings. You might need to tune crush tunables: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs It is possible that your low OSD count causes the "crush gives up too soon" issue. You might also consider to use a crush rule that places exactly 3 shards per host (examples were in posts just last week). Otherwise, it is not guaranteed that "... data remains available if a whole host goes down ..." because you might have 4 chunks on one of the hosts and fall below min_size (the failure domain of your crush rule for the EC profiles is OSD). To test if your crush rules can generate valid mappings, you can pull the osdmap of your cluster and use osdmaptool to experiment with it without risk of destroying anything. It allows you to try different crush rules and failure scenarios on off-line but real cluster meta-data. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Hector Martin Sent: Friday, January 19, 2024 10:12 AM To: ceph-users@ceph.io Subject: [ceph-users] Degraded PGs on EC pool when marking an OSD out I'm having a bit of a weird issue with cluster rebalances with a new EC pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD). Until now I've been using an erasure coded k=5 m=3 pool for most of my data. I've recently started to migrate to a k=5 m=4 pool, so I can configure the CRUSH rule to guarantee that data remains available if a whole host goes down (3 chunks per host, 9 total). I also moved the 5,3 pool to this setup, although by nature I know its PGs will become inactive if a host goes down (need at least k+1 OSDs to be up). I've only just started migrating data to the 5,4 pool, but I've noticed that any time I trigger any kind of backfilling (e.g. take one OSD out), a bunch of PGs in the 5,4 pool become degraded (instead of just misplaced/backfilling). This always seems to happen on that pool only, and the object count is a significant fraction of the total pool object count (it's not just "a few recently written objects while PGs were repeering" or anything like that, I know about that effect). Here are the pools: pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6 crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 14133 lfor 0/11307/11305 flags hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6 crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 14509 lfor 0/0/14234 flags hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs EC profiles: # ceph osd erasure-code-profile get ec5.3 crush-device-class= crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=5 m=3 plugin=jerasure technique=reed_sol_van w=8 # ceph osd erasure-code-profile get ec5.4 crush-device-class= crush-failure-domain=osd crush-root=default jerasure-per-chunk-alignment=false k=5 m=4 plugin=jerasure technique=reed_sol_van w=8 They both use the same CRUSH rule, which is designed to select 9 OSDs balanced across the hosts (of which only 8 slots get used for the older 5,3 pool): rule hdd-ec-x3 { id 7 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default class hdd step choose indep 3 type host step choose indep 3 type osd step emit } If I take out an OSD (14), I get something like this: health: HEALTH_WARN Degraded data redundancy: 37631/120155160 objects degraded (0.031%), 38 pgs degraded All the degraded PGs are in the 5,4 pool, and the total object count is around 50k, so this is *most* of the data in the pool becoming degraded just because I marked an OSD out (without stopping it). If I mark the OSD in again, the degraded state goes away. Example degraded PGs: # ceph pg dump | grep degraded dumped all 14.3c812 0 838 00 119250277580 0 1088 0 1088 active+recovery_wait+undersized+degraded+remapped 2024-01-19T18:06:41.786745+0900 15440'1088 15486:10772 [18,17,16,1,3,2,11,13,12] 18[18,17,16,1,3,2,11,NONE,12] 18 14537'432 2024-01-12T11:25:54.168048+0900 0'0 2024-01-08T15:18:21.654679+0900 02 periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900 2410 14.3d772 0 1602 00 113032802230 0 1283 0 1283 active+recovery_wait+unders
[ceph-users] Re: Adding OSD's results in slow ops, inactive PG's
Hi, maybe this is related. On a system with many disks I also had aio problems causing OSDs to hang. Here it was the kernel parameter fs.aio-max-nr that was way too low by default. I bumped it to fs.aio-max-nr = 1048576 (sysctl/tuned) and OSDs came up right away. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Thursday, January 18, 2024 9:46 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Adding OSD's results in slow ops, inactive PG's I'm glad to hear (or read) that it worked for you as well. :-) Zitat von Torkil Svensgaard : > On 18/01/2024 09:30, Eugen Block wrote: >> Hi, >> >>> [ceph: root@lazy /]# ceph-conf --show-config | egrep >>> osd_max_pg_per_osd_hard_ratio >>> osd_max_pg_per_osd_hard_ratio = 3.00 >> >> I don't think this is the right tool, it says: >> >>> --show-config-value Print the corresponding ceph.conf value >>> that matches the specified key. >>> Also searches >>> global defaults. >> >> I suggest to query the daemon directly: >> >> storage01:~ # ceph config set osd osd_max_pg_per_osd_hard_ratio 5 >> >> storage01:~ # ceph tell osd.0 config get osd_max_pg_per_osd_hard_ratio >> { >> "osd_max_pg_per_osd_hard_ratio": "5.00" >> } > > Copy that, verified to be 5 now. > >>> Daemons are running but those last OSDs won't come online. >>> I've tried upping bdev_aio_max_queue_depth but it didn't seem to >>> make a difference. >> >> I don't have any good idea for that right now except what you >> already tried. Which values for bdev_aio_max_queue_depth have you >> tried? > > The previous value was 1024, I bumped it to 4096. > > A couple of the OSDs seemingly stuck on the aio thing has now come > to, so I went ahead and added the rest. Some of them came in right > away, some are stuck on the aio thing. Hopefully they will recover > eventually. > > Thanks you again for the osd_max_pg_per_osd_hard_ratio suggestion, > that seems to have solved the core issue =) > > Mvh. > > Torkil > >> >> Zitat von Torkil Svensgaard : >> >>> On 18/01/2024 07:48, Eugen Block wrote: >>>> Hi, >>>> >>>>> -3281> 2024-01-17T14:57:54.611+ 7f2c6f7ef540 0 osd.431 >>>>> 2154828 load_pgs opened 750 pgs <--- >>>> >>>> I'd say that's close enough to what I suspected. ;-) Not sure why >>>> the "maybe_wait_for_max_pg" message isn't there but I'd give it a >>>> try with a higher osd_max_pg_per_osd_hard_ratio. >>> >>> Might have helped, not quite sure. >>> >>> I've set these since I wasn't sure which one was the right one?: >>> >>> " >>> ceph config dump | grep osd_max_pg_per_osd_hard_ratio >>> globaladvanced osd_max_pg_per_osd_hard_ratio >>> 5.00 osd advanced >>> osd_max_pg_per_osd_hard_ratio5.00 >>> " >>> >>> Restarted MONs and MGRs. Still getting this with ceph-conf though: >>> >>> " >>> [ceph: root@lazy /]# ceph-conf --show-config | egrep >>> osd_max_pg_per_osd_hard_ratio >>> osd_max_pg_per_osd_hard_ratio = 3.00 >>> " >>> >>> I re-added a couple small SSD OSDs and they came in just fine. I >>> then added a couple HDD OSDs and they also came in after a bit of >>> aio_submit spam. I added a couple more and have now been looking >>> at this for 40 minutes: >>> >>> >>> " >>> ... >>> >>> 2024-01-18T07:42:01.789+ 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 10 >>> 2024-01-18T07:42:01.808+ 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 4 >>> 2024-01-18T07:42:01.819+ 7f735d1b8700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 82 >>> 2024-01-18T07:42:07.499+ 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 6 >>> 2024-01-18T07:42:07.542+ 7f734fa04700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 8 >>> 2024-01-18T07:42:07.554+ 7f735d1b8700 -1 bdev(0x56295d586400 >>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 108 >>> ... >>> " >>> >&g
[ceph-users] Re: Performance impact of Heterogeneous environment
For multi- vs. single-OSD per flash drive decision the following test might be useful: We found dramatic improvements using multiple OSDs per flash drive with octopus *if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one and this thread is effectively sequentializing otherwise async IO if saturated. There was a dev discussion about having more kv_sync_threads per OSD daemon by splitting up rocks-dbs for PGs, but I don't know if this ever materialized. My guess is that for good NVMe drives it is possible that a single kv_sync_thread can saturate the device and there will be no advantage of having more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments usually are better, because the on-disk controller requires concurrency to saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with iodepth=1. With good NVME drives I have seen fio-tests with direct IO saturate the drive with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that and I could imagine that here 1 OSD/drive is sufficient. For such drives, storage access quickly becomes CPU bound, so some benchmarking taking all system properties into account is required. If you are already CPU bound (too many NVMe drives per core, many standard servers with 24+ NVMe drives have that property) there is no point adding extra CPU load with more OSD daemons. Don't just look at single disks, look at the whole system. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Bailey Allison Sent: Thursday, January 18, 2024 12:36 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Performance impact of Heterogeneous environment +1 to this, great article and great research. Something we've been keeping a very close eye on ourselves. Overall we've mostly settled on the old keep it simple stupid methodology with good results. Especially as the benefits have gotten less beneficial the more recent your ceph version, and have been rocking with single OSD/NVMe, but as always everything is workload dependant and there is sometimes a need for doubling up Regards, Bailey > -Original Message- > From: Maged Mokhtar > Sent: January 17, 2024 4:59 PM > To: Mark Nelson ; ceph-users@ceph.io > Subject: [ceph-users] Re: Performance impact of Heterogeneous > environment > > Very informative article you did Mark. > > IMHO if you find yourself with very high per-OSD core count, it may be logical > to just pack/add more nvmes per host, you'd be getting the best price per > performance and capacity. > > /Maged > > > On 17/01/2024 22:00, Mark Nelson wrote: > > It's a little tricky. In the upstream lab we don't strictly see an > > IOPS or average latency advantage with heavy parallelism by running > > muliple OSDs per NVMe drive until per-OSD core counts get very high. > > There does seem to be a fairly consistent tail latency advantage even > > at moderately low core counts however. Results are here: > > > > https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/ > > > > Specifically for jitter, there is probably an advantage to using 2 > > cores per OSD unless you are very CPU starved, but how much that > > actually helps in practice for a typical production workload is > > questionable imho. You do pay some overhead for running 2 OSDs per > > NVMe as well. > > > > > > Mark > > > > > > On 1/17/24 12:24, Anthony D'Atri wrote: > >> Conventional wisdom is that with recent Ceph releases there is no > >> longer a clear advantage to this. > >> > >>> On Jan 17, 2024, at 11:56, Peter Sabaini wrote: > >>> > >>> One thing that I've heard people do but haven't done personally with > >>> fast NVMes (not familiar with the IronWolf so not sure if they > >>> qualify) is partition them up so that they run more than one OSD > >>> (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth. > >>> See > >>> https://ceph.com/community/bluestore-default-vs-tuned- > performance-co > >>> mparison/ > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > >> email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email > to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Recomand number of k and m erasure code
I would like to add here a detail that is often overlooked: maintainability under degraded conditions. For production systems I would recommend to use EC profiles with at least m=3. The reason being that if you have a longer problem with a node that is down and m=2 it is not possible to do any maintenance on the system without loosing write access. Don't trust what users claim they are willing to tolerate - at least get it in writing. Once a problem occurs they will be at your door step no matter what they said before. Similarly, when doing a longer maintenance task and m=2, any disk fail during maintenance will imply loosing write access. Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being unavailable simultaneously while service is fully operational. That can be a life saver in many situations. An additional reason for larger m is systematic failures of drives if your vendor doesn't mix drives from different batches and factories. If a batch has a systematic production error, failures are no longer statistically independent. In such a situation, if one drives fails the likelihood that more drives fail at the same time is very high. Having a larger number of parity shards increases the chances of recovering from such events. For similar reasons I would recommend to deploy 5 MONs instead of 3. My life got so much better after having the extra redundancy. As some background, in our situation we experience(d) somewhat heavy maintenance operations including modifying/updating ceph nodes (hardware, not software), exchanging Racks, switches, cooling and power etc. This required longer downtime and/or moving of servers and moving the ceph hardware was the easiest compared with other systems due to the extra redundancy bits in it. We had no service outages during such operations. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Anthony D'Atri Sent: Saturday, January 13, 2024 5:36 PM To: Phong Tran Thanh Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Recomand number of k and m erasure code There are nuances, but in general the higher the sum of m+k, the lower the performance, because *every* operation has to hit that many drives, which is especially impactful with HDDs. So there’s a tradeoff between storage efficiency and performance. And as you’ve seen, larger parity groups especially mean slower recovery/backfill. There’s also a modest benefit to choosing values of m and k that have small prime factors, but I wouldn’t worry too much about that. You can find EC efficiency tables on the net: https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html I should really add a table to the docs, making a note to do that. There’s a nice calculator at the OSNEXUS site: https://www.osnexus.com/ceph-designer The overhead factor is (k+m) / k So for a 4,2 profile, that’s 6 / 4 == 1.5 For 6,2, 8 / 6 = 1.33 For 10,2, 12 / 10 = 1.2 and so forth. As k increases, the incremental efficiency gain sees diminishing returns, but performance continues to decrease. Think of m as the number of copies you can lose without losing data, and m-1 as the number you can lose / have down and still have data *available*. I also suggest that the number of failure domains — in your cases this means OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9. With RBD and many CephFS implementations, we mostly have relatively large RADOS objects that are striped over many OSDs. When using RGW especially, one should attend to average and median S3 object size. There’s an analysis of the potential for space amplification in the docs so I won’t repeat it here in detail. This sheet https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253 visually demonstrates this. Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually small objects — if you have a lot of S3 objects in the multiples of KB size, you waste a significant fraction of underlying storage. This is exacerbated by EC, and the larger the sum of k+m, the more waste. When people ask me about replication vs EC and EC profile, the first question I ask is what they’re storing. When EC isn’t a non-starter, I tend to recommend 4,2 as a profile until / unless someone has specific needs and can understand the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not going overboard on the performance hit. If you care about your data, do not set m=1. If you need to survive the loss of many drives, say if your cluster is across multiple buildings or sites, choose a larger value of k. There are people running profiles like 4,6 because they have unusual and specific needs. > On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh wrote: > > Hi ceph user! > > I need to determine whi
[ceph-users] Re: 3 DC with 4+5 EC not quite working
Is it maybe this here: https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon I always have to tweak the num-tries parameters. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Torkil Svensgaard Sent: Friday, January 12, 2024 10:17 AM To: Frédéric Nass Cc: ceph-users@ceph.io; Ruben Vestergaard Subject: [ceph-users] Re: 3 DC with 4+5 EC not quite working On 12-01-2024 09:35, Frédéric Nass wrote: > > Hello Torkil, Hi Frédéric > We're using the same ec scheme than yours with k=5 and m=4 over 3 DCs with > the below rule: > > > rule ec54 { > id 3 > type erasure > min_size 3 > max_size 9 > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class hdd > step choose indep 0 type datacenter > step chooseleaf indep 3 type host > step emit > } > > Works fine. The only difference I see with your EC rule is the fact that we > set min_size and max_size but I doubt this has anything to do with your > situation. Great, thanks. I wonder if we might need to tweak the min_size. I think I tried lowering it to no avail and then set it back to 5 after editing the crush rule. > Since the cluster still complains about "Pool cephfs.hdd.data has 1024 > placement groups, should have 2048", did you run "ceph osd pool set > cephfs.hdd.data pgp_num 2048" right after running "ceph osd pool set > cephfs.hdd.data pg_num 2048"? [1] > > Might be that the pool still has 1024 PGs. Hmm coming from RHCS we didn't do this as: " RHCS 4.x and 5.x does not require the pgp_num value to be set. This will be done by ceph-mgr automatically. For RHCS 4.x and 5.x, only the pg_num is required to be incremented for the necessary pools. " So I only did "ceph osd pool set cephfs.hdd.data pgp_num 2048" and let the mgr handle the rest. I had a watch running to see how it went and the pool was up to something like 1922 PGs when it got stuck. As I read the documentation[1] this shouldn't get us stuck like we did, but we would have to set the pgp_num eventually to get it to rebalance? Mvh. Torkil [1] https://docs.ceph.com/en/quincy/rados/operations/placement-groups/#setting-the-number-of-pgs > > > Regards, > Frédéric. > > [1] > https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups > > > > -Message original- > > De: Torkil > à: ceph-users > Cc: Ruben > Envoyé: vendredi 12 janvier 2024 09:00 CET > Sujet : [ceph-users] 3 DC with 4+5 EC not quite working > > We are looking to create a 3 datacenter 4+5 erasure coded pool but can't > quite get it to work. Ceph version 17.2.7. These are the hosts (there > will eventually be 6 hdd hosts in each datacenter): > > -33 886.00842 datacenter 714 > -7 209.93135 host ceph-hdd1 > > -69 69.86389 host ceph-flash1 > -6 188.09579 host ceph-hdd2 > > -3 233.57649 host ceph-hdd3 > > -12 184.54091 host ceph-hdd4 > -34 824.47168 datacenter DCN > -73 69.86389 host ceph-flash2 > -2 201.78067 host ceph-hdd5 > > -81 288.26501 host ceph-hdd6 > > -31 264.56207 host ceph-hdd7 > > -36 1284.48621 datacenter TBA > -77 69.86389 host ceph-flash3 > -21 190.83224 host ceph-hdd8 > > -29 199.08838 host ceph-hdd9 > > -11 193.85382 host ceph-hdd10 > > -9 237.28154 host ceph-hdd11 > > -26 187.19536 host ceph-hdd12 > > -4 206.37102 host ceph-hdd13 > > We did this: > > ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd > plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default > crush-failure-domain=datacenter crush-device-class=hdd > > ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd > ceph osd pool set cephfs.hdd.data allow_ec_overwrites true > ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn > > Didn't quite work: > > " > [WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg > incomplete > pg 33.0 is creating+incomplete, acting > [104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool > cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for > 'incomplete') > " > > I then manually changed the crush rule from this: > > " > rule cephfs.hdd.data { > id 7 > type erasure > step set_chooseleaf_tries 5 > step set_choose_tries 100 > step take default class hdd > step chooseleaf indep 0 type datacenter > step emit > } > " > > To this: > > " > rule cephfs.hdd.data { > id 7 > type erasure > step set_chooseleaf_tries 5 > step set_ch
[ceph-users] Re: Rack outage test failing when nodes get integrated again
Hi Steve, I also observed that setting mon_osd_reporter_subtree_level to anything else than host leads to incorrect behavior. In our case, I actually observed the opposite. I had mon_osd_reporter_subtree_level=datacenter (we have 3 DCs in the crush tree). After cutting off a single host with ifdown - also a network cut-off albeit not via firewall rules, I observed that not all OSDs on that host were marked down (neither was the host), leading to blocked IO. I didn't wait for very long (only a few minutes, less than 5), because its a production system. I also didn't find the time to file a tracker issue. I observed this with mimic, but since you report it for Pacific I'm pretty sure its affecting all versions. My guess is that this is not part of the CI testing, at least not in a way that covers network cut-off. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Steve Baker Sent: Thursday, January 11, 2024 8:45 AM To: ceph-users@ceph.io Subject: [ceph-users] Rack outage test failing when nodes get integrated again Hi, we're currently testing a ceph (v 16.2.14) cluster, 3 mon nodes, 6 osd nodes à 8 nvme ssd osds distributed over 3 racks. Daemons are deployed in containers with cephadm / podman. We got 2 pools on it, one with 3x replication and min_size=2, one with an EC (k3m3). With 1 mon node and 2 osd nodes in each rack, the crush rules are configured in a way (for the 3x pool chooseleaf_firstn rack, for the ec pool choose_indep 3 rack / chooseleaf_indep 2 host), so that a full rack can go down while the cluster stays accessible for client operations. Other options we have set are mon_osd_down_out_subtree_limit=host so that in case of a host/rack outage the cluster does not automatically start to backfill, but will continue to run in a degraded state until human interaction comes to fix it. Also we set mon_osd_reporter_subtree_level=rack. We tested - while under (synthetic test-)client load - what happens if we take a full rack (one mon node and 2 osd nodes) out of the cluster. We did that using iptables to block the nodes of the rack from other nodes of the cluster (global and cluster network), as well as from the clients. As expected, the remainder of the cluster continues to run in a degraded state without to start any backfilling or recovery processes. All client requests gets served while the rack is out. But then a strange thing happens when we take the rack (1mon node, 2 osd nodes) back into the cluster again by deleting all firewall rules with iptables -F at once. Some osds get integrated in the cluster again immediatelly but some others remain in state "down" for exactly 10 minutes. These osds that stay down for the 10 minutes are in a state where they still seem to not be able to reach other osd nodes (see heartbeat_check logs below). After these 10 minutes have passed, these osds come up as well but then at exactly that time, many PGs get stuck in peering state and other osds that were all the time in the cluster get slow requests and the cluster blocks client traffic (I think it's just the PGs stuck in peering soaking all the client threads). Then, exactly 45 minutes after the nodes of the rack were made reachable by iptables -F again, the situation recovers, peering succeeds and client load gets handled again. We have repeated this test several times and it's always exactly the same 10 min "down interval" and a 45 min affected client requests. When we integrate the nodes into the cluster again one after another with a delay of some minutes inbetween, this does not happen at all. I wonder what's happening there. It must be some kind of split-brain situation having to do with blocking the nodes using iptables but not rebooting them completelly. The 10 min and 45 min intervals I described occure every time. For the 10 minutes, some osds stay down after the hosts got integrated again. It's not all of the 16 osds from the 2 osd hosts that got integrated again but just some of them. Which ones varies randomly. Sometimes it's also only just one. We also observerd, the longer the hosts were out of the cluster, the more osds are affected. Then even after they get up again after 10 minutes, it takes another 45 minutes until the stuck peering situation resolves. Also during these 45 minutes, we see slow ops on osds thet remained into the cluster. See here some OSD logs that get written after the reintegration: 2024-01-04T08:25:03.856+ 7f369132b700 -1 monclient: _check_auth_rotating possible clock skew, rotating keys expired way too early (before 2024-01-04T07:25:03.860426+) 2024-01-04T08:25:06.556+ 7f3682882700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.0 down, but it is still running 2024-01-04T08:25:06.556+ 7f3682882700 0 log_channel(clus
[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?
Quick answers: * ... osd_deep_scrub_randomize_ratio ... but not on Octopus: is it still a valid parameter? Yes, this parameter exists and can be used to prevent premature deep-scrubs. The effect is dramatic. * ... essentially by playing with osd_scrub_min_interval,... The main parameter is actually osd_deep_scrub_randomize_ratio, all other parameters have less effect in terms of scrub load. osd_scrub_min_interval is the second most important parameter and needs increasing for large SATA-/NL-SAS-HDDs. For sufficiently fast drives the default of 24h is good (although might be a bit aggressive/paranoid). * Another small question: you opt for osd_max_scrubs=1 just to make sure your I/O is not adversely affected by scrubbing, or is there a more profound reason for that? Well, not affecting user-IO too much is a quite profound reason and many admins try to avoid scrubbing at all when users are on the system. It makes IO somewhat unpredictable and can trigger user complaints. However, there is another profound reason: for HDDs it increases deep-scrub load (that is, interference with user IO) a lot while it actually slows down the deep-scrubbing. HDDs can't handle the implied random IO of concurrent deep-scrubs well. On my system I saw that with osd_max_scrubs=2 the scrub time for a PG increased a bit more than double. In other words: more scrub load, less scrub progress = useless, do not do this. I plan to document the script a bit more and am waiting for some deep-scrub histograms to converge to equilibrium. This takes months for our large pools, but I would like to have the numbers for an example of how it should look like. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Fulvio Galeazzi Sent: Monday, January 8, 2024 4:21 PM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: How to configure something like osd_deep_scrub_min_interval? Hallo Frank, just found this post, thank you! I have also been puzzled/struggling with scrub/deep-scrub and found your post very useful: will give this a try, soon. One thing, first: I am using Octopus, too, but I cannot find any documentation about osd_deep_scrub_randomize_ratio. I do see that in past releases, but not on Octopus: is it still a valid parameter? Let me check whether I understood your procedure: you optimize scrub time distribution essentially by playing with osd_scrub_min_interval, thus "forcing" the automated algorithm to preferentially select older-scrubbed PGs, am I correct? Another small question: you opt for osd_max_scrubs=1 just to make sure your I/O is not adversely affected by scrubbing, or is there a more profound reason for that? Thanks! Fulvio On 12/13/23 13:36, Frank Schilder wrote: > Hi all, > > since there seems to be some interest, here some additional notes. > > 1) The script is tested on octopus. It seems that there was a change in the > output of ceph commands used and it might need some tweaking to get it to > work on other versions. > > 2) If you want to give my findings a shot, you can do so in a gradual way. > The most important change is setting osd_deep_scrub_randomize_ratio=0 (with > osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the > requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp > younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the > one change to test, all other settings have less impact. The script will not > report some numbers at the end, but the histogram will be correct. Let it run > a few deep-scrub-interval rounds until the histogram is evened out. > > If you start your test after using osd_max_scrubs>1 for a while -as I did - > you will need a lot of patience and might need to mute some scrub warnings > for a while. > > 3) The changes are mostly relevant for large HDDs that take a long time to > deep-scrub (many small objects). The overall load reduction, however, is > useful in general. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Fulvio Galeazzi GARR-Net Department tel.: +39-334-6533-250 skype: fgaleazzi70 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?
Hi all, another quick update: please use this link to download the script: https://github.com/frans42/ceph-goodies/blob/main/scripts/pool-scrub-report The one I sent originally does not follow latest. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?
Hi all, since there seems to be some interest, here some additional notes. 1) The script is tested on octopus. It seems that there was a change in the output of ceph commands used and it might need some tweaking to get it to work on other versions. 2) If you want to give my findings a shot, you can do so in a gradual way. The most important change is setting osd_deep_scrub_randomize_ratio=0 (with osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the one change to test, all other settings have less impact. The script will not report some numbers at the end, but the histogram will be correct. Let it run a few deep-scrub-interval rounds until the histogram is evened out. If you start your test after using osd_max_scrubs>1 for a while -as I did - you will need a lot of patience and might need to mute some scrub warnings for a while. 3) The changes are mostly relevant for large HDDs that take a long time to deep-scrub (many small objects). The overall load reduction, however, is useful in general. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: increasing number of (deep) scrubs
Yes, octopus. -- Frank = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Szabo, Istvan (Agoda) Sent: Wednesday, December 13, 2023 6:13 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: increasing number of (deep) scrubs Hi, You are on octopus right? Istvan Szabo Staff Infrastructure Engineer --- Agoda Services Co., Ltd. e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com> --- From: Frank Schilder Sent: Tuesday, December 12, 2023 7:33 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: increasing number of (deep) scrubs Email received from the internet. If in doubt, don't click any link nor open any attachment ! Hi all, if you follow this thread, please see the update in "How to configure something like osd_deep_scrub_min_interval?" (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/). I found out how to tune the scrub machine and I posted a quick update in the other thread, because the solution was not to increase the number of scrubs, but to tune parameters. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________ From: Frank Schilder Sent: Monday, January 9, 2023 9:14 AM To: Dan van der Ster Cc: ceph-users@ceph.io Subject: Re: [ceph-users] increasing number of (deep) scrubs Hi Dan, thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away. Some more info. These 2 pools are data pools for a large FS. Unfortunately, we have a large percentage of small files, which is a pain for recovery and seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to increase the warning interval already to 2 weeks. With all the warning grace parameters this means that we manage to deep scrub everything about every month. I need to plan for 75% utilisation and a 3 months period is a bit far on the risky side. Our data is to a large percentage cold data. Client reads will not do the check for us, we need to combat bit-rot pro-actively. The reasons I'm interested in parameters initiating more scrubs while also converting more scrubs into deep scrubs are, that 1) scrubs seem to complete very fast. I almost never catch a PG in state "scrubbing", I usually only see "deep scrubbing". 2) I suspect the low deep-scrub count is due to a low number of deep-scrubs scheduled and not due to conflicting per-OSD deep scrub reservations. With the OSD count we have and the distribution over 12 servers I would expect at least a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing now. It ought to be possible to schedule more PGs for deep scrub than actually are. 3) Every OSD having only 1 deep scrub active seems to have no measurable impact on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it would already help a lot. Once this is working, I can eventually increase osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) scrub scheduling looks a bit harder and schedules more eligible PGs per time unit. If we can get deep scrubbing up to an average of 42PGs completing per hour with keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to complete a deep scrub with 75% full OSDs in about 30 days. This is the current tail-time with 25% utilisation. I believe currently a deep scrub of a PG in these pools takes 2-3 hours. Its just a gut feeling from some repair and deep-scrub commands, I would need to check logs for more precise info. Increasing osd_max_scrubs would then be a further and not the only option to push for more deep scrubbing. My expectation would be that values of 2-3 are fine due to the increasingly higher percentage of cold data for which no interference with client IO will happen. Hope that makes sense and there is a way beyond bumping osd_max_scrubs to increase the number of scheduled and executed deep scrubs. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 05 January 2023 15:36 To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] increasing number of (deep) scrubs Hi Frank, What is your current osd_max_scrubs, and why don't you want to increase it? With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11 OSDs, so at a minimum it could take 3-4x the amount of time to scrub the data than if those were replicated pools. If you want the scrub to complete in t
[ceph-users] Re: increasing number of (deep) scrubs
Hi all, if you follow this thread, please see the update in "How to configure something like osd_deep_scrub_min_interval?" (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/). I found out how to tune the scrub machine and I posted a quick update in the other thread, because the solution was not to increase the number of scrubs, but to tune parameters. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Frank Schilder Sent: Monday, January 9, 2023 9:14 AM To: Dan van der Ster Cc: ceph-users@ceph.io Subject: Re: [ceph-users] increasing number of (deep) scrubs Hi Dan, thanks for your answer. I don't have a problem with increasing osd_max_scrubs (=1 at the moment) as such. I would simply prefer a somewhat finer grained way of controlling scrubbing than just doubling or tripling it right away. Some more info. These 2 pools are data pools for a large FS. Unfortunately, we have a large percentage of small files, which is a pain for recovery and seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to increase the warning interval already to 2 weeks. With all the warning grace parameters this means that we manage to deep scrub everything about every month. I need to plan for 75% utilisation and a 3 months period is a bit far on the risky side. Our data is to a large percentage cold data. Client reads will not do the check for us, we need to combat bit-rot pro-actively. The reasons I'm interested in parameters initiating more scrubs while also converting more scrubs into deep scrubs are, that 1) scrubs seem to complete very fast. I almost never catch a PG in state "scrubbing", I usually only see "deep scrubbing". 2) I suspect the low deep-scrub count is due to a low number of deep-scrubs scheduled and not due to conflicting per-OSD deep scrub reservations. With the OSD count we have and the distribution over 12 servers I would expect at least a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing now. It ought to be possible to schedule more PGs for deep scrub than actually are. 3) Every OSD having only 1 deep scrub active seems to have no measurable impact on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it would already help a lot. Once this is working, I can eventually increase osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) scrub scheduling looks a bit harder and schedules more eligible PGs per time unit. If we can get deep scrubbing up to an average of 42PGs completing per hour with keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to complete a deep scrub with 75% full OSDs in about 30 days. This is the current tail-time with 25% utilisation. I believe currently a deep scrub of a PG in these pools takes 2-3 hours. Its just a gut feeling from some repair and deep-scrub commands, I would need to check logs for more precise info. Increasing osd_max_scrubs would then be a further and not the only option to push for more deep scrubbing. My expectation would be that values of 2-3 are fine due to the increasingly higher percentage of cold data for which no interference with client IO will happen. Hope that makes sense and there is a way beyond bumping osd_max_scrubs to increase the number of scheduled and executed deep scrubs. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: 05 January 2023 15:36 To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] increasing number of (deep) scrubs Hi Frank, What is your current osd_max_scrubs, and why don't you want to increase it? With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11 OSDs, so at a minimum it could take 3-4x the amount of time to scrub the data than if those were replicated pools. If you want the scrub to complete in time, you need to increase the amount of scrub slots accordingly. On the other hand, IMHO the 1-week deadline for deep scrubs is often much too ambitious for large clusters -- increasing the scrub intervals is one solution, or I find it simpler to increase mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio until you find a ratio that works for your cluster. Of course, all of this can impact detection of bit-rot, which anyway can be covered by client reads if most data is accessed periodically. But if the cluster is mostly idle or objects are generally not read, then it would be preferable to increase slots osd_max_scrubs. Cheers, Dan On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder wrote: > > Hi all, > > we are using 16T and 18T spinning drives as OSDs and I'm observing that they > are not scrubbed as often as I would like. It looks like too few scr
[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?
d since 4 intervals ( 96h) [1 scrubbing] 31% 619 PGs not deep-scrubbed since 5 intervals (120h) 39% 660 PGs not deep-scrubbed since 6 intervals (144h) 47% 726 PGs not deep-scrubbed since 7 intervals (168h) [1 scrubbing] 57% 743 PGs not deep-scrubbed since 8 intervals (192h) 65% 727 PGs not deep-scrubbed since 9 intervals (216h) [1 scrubbing] 73% 656 PGs not deep-scrubbed since 10 intervals (240h) 75% 107 PGs not deep-scrubbed since 11 intervals (264h) 82% 626 PGs not deep-scrubbed since 12 intervals (288h) 90% 588 PGs not deep-scrubbed since 13 intervals (312h) 94% 388 PGs not deep-scrubbed since 14 intervals (336h) [1 scrubbing] 96% 129 PGs not deep-scrubbed since 15 intervals (360h) 2 scrubbing+deep 98% 207 PGs not deep-scrubbed since 16 intervals (384h) 2 scrubbing+deep 99% 79 PGs not deep-scrubbed since 17 intervals (408h) 1 scrubbing+deep 100% 10 PGs not deep-scrubbed since 18 intervals (432h) 2 scrubbing+deep 8192 PGs out of 8192 reported, 0 missing, 7 scrubbing+deep, 0 unclean. con-fs2-data2 scrub_min_interval=66h (11i/84%/625PGs÷i) con-fs2-data2 scrub_max_interval=168h (7d) con-fs2-data2 deep_scrub_interval=336h (14d/~89%/~520PGs÷d) osd.338 osd_scrub_interval_randomize_ratio=0.363636 scrubs start after: 66h..90h osd.338 osd_deep_scrub_randomize_ratio=0.00 osd.338 osd_max_scrubs=1 osd.338 osd_scrub_backoff_ratio=0.931900 rec. this pool: .9319 (class hdd, size 11) mon.ceph-01 mon_warn_pg_not_scrubbed_ratio=0.50 warn: 10.5d (42.0i) mon.ceph-01 mon_warn_pg_not_deep_scrubbed_ratio=0.75 warn: 24.5d Best regards, merry Christmas and a happy new year to everyone! ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs (meta) data inconsistent
Hi Xiubo, I will update the case. I'm afraid this will have to wait a little bit though. I'm too occupied for a while and also don't have a test cluster that would help speed things up. I will update you, please keep the tracker open. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Tuesday, December 5, 2023 1:58 AM To: Frank Schilder; Gregory Farnum Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent Frank, By using your script I still couldn't reproduce it. Locally my python version is 3.9.16, and I didn't have other VMs to test python other versions. Could you check the tracker to provide the debug logs ? Thanks - Xiubo On 12/1/23 21:08, Frank Schilder wrote: > Hi Xiubo, > > I uploaded a test script with session output showing the issue. When I look > at your scripts, I can't see the stat-check on the second host anywhere. > Hence, I don't really know what you are trying to compare. > > If you want me to run your test scripts on our system for comparison, please > include the part executed on the second host explicitly in an ssh-command. > Running your scripts alone in their current form will not reproduce the issue. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Xiubo Li > Sent: Monday, November 27, 2023 3:59 AM > To: Frank Schilder; Gregory Farnum > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent > > > On 11/24/23 21:37, Frank Schilder wrote: >> Hi Xiubo, >> >> thanks for the update. I will test your scripts in our system next week. >> Something important: running both scripts on a single client will not >> produce a difference. You need 2 clients. The inconsistency is between >> clients, not on the same client. For example: > Frank, > > Yeah, I did this with 2 different kclients. > > Thanks > >> Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs >> >> Test 1 >> - on host1: execute shutil.copy2 >> - execute ls -l /mnt/kcephfs/ on host1 and host2: same result >> >> Test 2 >> - on host1: shutil.copy >> - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 >> while correct on host 1 >> >> Your scripts only show output of one host, but the inconsistency requires >> two hosts for observation. The stat information is updated on host1, but not >> synchronized to host2 in the second test. In case you can't reproduce that, >> I will append results from our system to the case. >> >> Also it would be important to know the python and libc versions. We observe >> this only for newer versions of both. >> >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Xiubo Li >> Sent: Thursday, November 23, 2023 3:47 AM >> To: Frank Schilder; Gregory Farnum >> Cc: ceph-users@ceph.io >> Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent >> >> I just raised one tracker to follow this: >> https://tracker.ceph.com/issues/63510 >> >> Thanks >> >> - Xiubo >> >> > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC Profiles & DR
Hi, the post linked in the previous message is a good source for different approaches. To provide some first-hand experience, I was operating a pool with a 6+2 EC profile on 4 hosts for a while (until we got more hosts) and the "subdivide a physical host into 2 crush-buckets" approach is actually working best (I basically tried all the approaches described in the linked post and they all had pitfalls). Procedure is more or less: - add second (logical) host bucket for each physical host by suffixing the host name with "-B" (ceph osd crush add-bucket ) - move half the OSDs per host to this new host bucket (ceph osd crush move osd.ID host=HOSTNAME-B) - make this location persist reboot of the OSDs (ceph config set osd.ID crush_location host=HOSTNAME-B") This will allow you to move OSDs back easily when you get more hosts and can afford the recommended 1 shard per host. It will also show which and where OSDs are moved to with a simple "ceph config dump | grep crush_location". Bets of all, you don't have to fiddle around with crush maps and hope they do what you want. Just use failure domain host and you are good. No more than 2 host buckets per physical host means no more than 2 shards per physical host with default placement rules. I was operating this set-up with min_size=6 and feeling bad about it due to the reduced maintainability (risk of data loss during maintenance). Its not great really, but sometimes there is no way around it. I was happy when I got the extra hosts. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Curt Sent: Wednesday, December 6, 2023 3:56 PM To: Patrick Begou Cc: ceph-users@ceph.io Subject: [ceph-users] Re: EC Profiles & DR Hi Patrick, Yes K and M are chunks, but the default crush map is a chunk per host, which is probably the best way to do it, but I'm no expert. I'm not sure why you would want to do a crush map with 2 chunks per host and min size 4 as it' s just asking for trouble at some point, in my opinion. Anyway, take a look at this post if your interested in doing 2 chunks per host it will give you an idea of crushmap setup, https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NB3M22GNAC7VNWW7YBVYTH6TBZOYLTWA/ . Regards, Curt ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] ceph df reports incorrect stats
74.27673 host ceph-09 -23 1075.67920 host ceph-10 -15 1067.16492 host ceph-11 -25 1080.21912 host ceph-12 -83 1061.17480 host ceph-13 -85 1047.70276 host ceph-14 -87 1079.02820 host ceph-15 -136 1012.55048 host ceph-16 -139 1073.61475 host ceph-17 -261 1125.57202 host ceph-23 -262 1054.32227 host ceph-24 -148 885.49133 datacenter MultiSite -65 86.16304 host ceph-04 -67 101.50623 host ceph-05 -69 104.85805 host ceph-06 -71 96.39923 host ceph-07 -81 97.54230 host ceph-18 -94 98.48271 host ceph-19 -4 97.20181 host ceph-20 -64 99.77657 host ceph-21 -66 103.56137 host ceph-22 -49 885.49133 datacenter ServerRoom -55 885.49133 room SR-113 -65 86.16304 host ceph-04 -67 101.50623 host ceph-05 -69 104.85805 host ceph-06 -71 96.39923 host ceph-07 -81 97.54230 host ceph-18 -94 98.48271 host ceph-19 -4 97.20181 host ceph-20 -64 99.77657 host ceph-21 -66 103.56137 host ceph-22 -1 0 root default Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [ext] CephFS pool not releasing space after data deletion
Hi Mathias, have you made any progress on this? Did the capacity become available eventually? Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Kuhring, Mathias Sent: Friday, October 27, 2023 3:52 PM To: ceph-users@ceph.io; Frank Schilder Subject: Re: [ext] [ceph-users] CephFS pool not releasing space after data deletion Dear ceph users, We are wondering, if this might be the same issue as with this bug: https://tracker.ceph.com/issues/52581 Except that we seem to have been snapshots dangling on the old pool. And the bug report snapshots dangling on the new pool. But maybe it's both? I mean, once the global root layout was created to a new pool, the new pool became in charge for snapshooting at least of new data, right? What about data which is overwritten? Is there a conflict of responsibility? We do have similar listings of snaps with "ceph osd pool ls detail", I think: 0|0[root@osd-1 ~]# ceph osd pool ls detail | grep -B 1 removed_snaps_queue pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 115 pgp_num 107 pg_num_target 32 pgp_num_target 32 autoscale_mode on last_change 803558 lfor 0/803250/803248 flags hashpspool,selfmanaged_snaps stripe_width 0 expected_num_objects 1 application cephfs removed_snaps_queue [3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1] -- pool 3 'hdd_ec' erasure profile hdd_ec size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 803558 lfor 0/87229/87229 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application cephfs removed_snaps_queue [3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1] -- pool 20 'hdd_ec_8_2_pool' erasure profile hdd_ec_8_2_profile size 10 min_size 9 crush_rule 5 object_hash rjenkins pg_num 8192 pgp_num 8192 autoscale_mode off last_change 803558 lfor 0/0/681917 flags hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 32768 application cephfs removed_snaps_queue [3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1] Here, pool hdd_ec_8_2_pool is the one we recently assigned to the root layout. Pool hdd_ec is the one which was assigned before and which won't release space (at least where I know of). Is this removed_snaps_queue the same as removed_snaps in the bug issue (i.e. the label was renamed)? And is it normal that all queues list the same info or should this be different per pool? Might this be related to pools having now share responsibility over some snaps due to layout changes? And for the big question: How can I actually trigger/speedup the removal of those snaps? I find the removed_snaps/removed_snaps_queue mentioned a few times in the user list. But never with some conclusive answer how to deal with them. And the only mentions in the docs are just change logs. I also looked into and started cephfs stray scrubbing: https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub But according to the status output, no scrubbing is actually active. I would appreciate any further ideas. Thanks a lot. Best Wishes, Mathias On 10/23/2023 12:42 PM, Kuhring, Mathias wrote: > Dear Ceph users, > > Our CephFS is not releasing/freeing up space after deleting hundreds of > terabytes of data. > By now, this drives us in a "nearfull" osd/pool situation and thus > throttles IO. > > We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) > quincy (stable). > > Recently, we moved a bunch of data to a new pool with better EC. > This was done by adding a new EC pool to the FS. > Then assigning the FS root to the new EC pool via the directory layout xattr > (so all new data is written to the new pool). > And finally copying old data to new folders. > > I swapped the data as follows to remain the old directory structures. > I also made snapshots for validation purposes. > > So basically: > cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool > mkdir mymount/mydata/.snap/tovalidate > mkdir mymount/new/mydata/.snap/tovalidate > mv mymount/mydata/ mymount/old/ > mv mymount/new/mydata mymount/ > > I could see the increase of data in the new pool as expected (ceph df). > I compared the snapshots with hashdeep to make sure the new data is alright. > > Then I went ahead deleting the old data, basically: > rmdir mymount/old/mydata/.snap/* # this also included a bunch of other > older snapshots > rm -r mymount/old/mydata > > At first we had a bunch of PGs with snaptrim/snaptrim_wait. > But they are done for quite
[ceph-users] Re: ceph fs (meta) data inconsistent
Hi Xiubo, I uploaded a test script with session output showing the issue. When I look at your scripts, I can't see the stat-check on the second host anywhere. Hence, I don't really know what you are trying to compare. If you want me to run your test scripts on our system for comparison, please include the part executed on the second host explicitly in an ssh-command. Running your scripts alone in their current form will not reproduce the issue. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Monday, November 27, 2023 3:59 AM To: Frank Schilder; Gregory Farnum Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent On 11/24/23 21:37, Frank Schilder wrote: > Hi Xiubo, > > thanks for the update. I will test your scripts in our system next week. > Something important: running both scripts on a single client will not produce > a difference. You need 2 clients. The inconsistency is between clients, not > on the same client. For example: Frank, Yeah, I did this with 2 different kclients. Thanks > Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs > > Test 1 > - on host1: execute shutil.copy2 > - execute ls -l /mnt/kcephfs/ on host1 and host2: same result > > Test 2 > - on host1: shutil.copy > - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 while > correct on host 1 > > Your scripts only show output of one host, but the inconsistency requires two > hosts for observation. The stat information is updated on host1, but not > synchronized to host2 in the second test. In case you can't reproduce that, I > will append results from our system to the case. > > Also it would be important to know the python and libc versions. We observe > this only for newer versions of both. > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Xiubo Li > Sent: Thursday, November 23, 2023 3:47 AM > To: Frank Schilder; Gregory Farnum > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent > > I just raised one tracker to follow this: > https://tracker.ceph.com/issues/63510 > > Thanks > > - Xiubo > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs (meta) data inconsistent
Hi Xiubo, thanks for the update. I will test your scripts in our system next week. Something important: running both scripts on a single client will not produce a difference. You need 2 clients. The inconsistency is between clients, not on the same client. For example: Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs Test 1 - on host1: execute shutil.copy2 - execute ls -l /mnt/kcephfs/ on host1 and host2: same result Test 2 - on host1: shutil.copy - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 while correct on host 1 Your scripts only show output of one host, but the inconsistency requires two hosts for observation. The stat information is updated on host1, but not synchronized to host2 in the second test. In case you can't reproduce that, I will append results from our system to the case. Also it would be important to know the python and libc versions. We observe this only for newer versions of both. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Thursday, November 23, 2023 3:47 AM To: Frank Schilder; Gregory Farnum Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent I just raised one tracker to follow this: https://tracker.ceph.com/issues/63510 Thanks - Xiubo ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered
Hi Dennis, I have to ask a clarifying question. If I understand the intend of osd_fast_fail_on_connection_refused correctly, an OSD that receives a connection_refused should get marked down fast to avoid unnecessarily long wait times. And *only* OSDs that receive connection refused. In your case, did booting up the server actually create a network route for all other OSDs to the wrong network as well? In other words, did it act as a gateway and all OSDs received connection refused messages and not just the ones on the critical host? If so, your observation would be expected. If not, then there is something wrong with the down reporting that should be looked at. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Friday, November 24, 2023 1:20 PM To: Denis Krienbühl; Burkhard Linke Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered Hi Denis. > The mon then propagates that failure, without taking any other reports into > consideration: Exactly. I cannot imagine that this change of behavior is intended. The configs on OSD down reporting ought to be honored in any failure situation. Since you already investigated the relevant code lines, please update/create the tracker with your findings. Hope a dev looks at this. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Denis Krienbühl Sent: Friday, November 24, 2023 12:04 PM To: Burkhard Linke Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered > On 24 Nov 2023, at 11:49, Burkhard Linke > wrote: > > This should not be case in the reported situation unless setting > osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true > changes this behaviour. In our tests it does change the behavior. Usually the mons take mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In our tests, this is the case if an OSD heartbeat is dropped and the OSD is still able to talk to the mons. However, if the OSD heartbeat is rejected, in our case because of an unrelated firewall change, the OSD sends an immediate failure to the mon: https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434 ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph github.com The mon then propagates that failure, without taking any other reports into consideration: https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367 ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph github.com This is fine when a single OSD goes down and everything else is okay. It then has the intended effect of getting rid of the OSD fast. The assumption presumably being: If a host can answer with a rejection to the OSD heartbeat, it is only the OSD that is affected. In our case however, a network change caused rejections from an entirely different host (a gateway), while a network path to the mons was still available. In this case, Ceph does not apply the safe-guards it usually does. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered
Hi Denis. > The mon then propagates that failure, without taking any other reports into > consideration: Exactly. I cannot imagine that this change of behavior is intended. The configs on OSD down reporting ought to be honored in any failure situation. Since you already investigated the relevant code lines, please update/create the tracker with your findings. Hope a dev looks at this. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Denis Krienbühl Sent: Friday, November 24, 2023 12:04 PM To: Burkhard Linke Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered > On 24 Nov 2023, at 11:49, Burkhard Linke > wrote: > > This should not be case in the reported situation unless setting > osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true > changes this behaviour. In our tests it does change the behavior. Usually the mons take mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In our tests, this is the case if an OSD heartbeat is dropped and the OSD is still able to talk to the mons. However, if the OSD heartbeat is rejected, in our case because of an unrelated firewall change, the OSD sends an immediate failure to the mon: https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434 ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph github.com The mon then propagates that failure, without taking any other reports into consideration: https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367 ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph github.com This is fine when a single OSD goes down and everything else is okay. It then has the intended effect of getting rid of the OSD fast. The assumption presumably being: If a host can answer with a rejection to the OSD heartbeat, it is only the OSD that is affected. In our case however, a network change caused rejections from an entirely different host (a gateway), while a network path to the mons was still available. In this case, Ceph does not apply the safe-guards it usually does. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered
Hi Denis, I would agree with you that a single misconfigured host should not take out healthy hosts under any circumstances. I'm not sure if your incident is actually covered by the devs comments, it is quite possible that you observed an unintended side effect that is a bug in handling the connection error. I think the intention is to shut down fast the OSDs with connection refused (where timeouts are not required) and not other OSDs. A bug report with tracker seems warranted. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Denis Krienbühl Sent: Friday, November 24, 2023 9:01 AM To: ceph-users Subject: [ceph-users] Full cluster outage when ECONNREFUSED is triggered Hi We’ve recently had a serious outage at work, after a host had a network problem: - We rebooted a single host in a cluster of fifteen hosts across three racks. - The single host had a bad network configuration after booting, causing it to send some packets to the wrong network. - One network still worked and offered a connection to the mons. - The other network connection was bad. Packets were refused, not dropped. - Due to osd_fast_fail_on_connection_refused=true, the broken host forced the mons to take all other OSDs down (immediate failure). - Only after shutting down the faulty host, was it possible to start the shut down OSDs, to restore the cluster. We have since solved the problem by removing the default route that caused the packets to end up in the wrong network, where they were summarily rejected by a firewall. That is, we made sure that packets would be dropped in the future, not rejected. Still, I figured I’ll send this experience of ours to this mailing list, as this seems to be something others might encounter as well. In the following PR, that introduced osd_fast_fail_on_connection_refused, there’s this description: > This changeset adds additional handler (handle_refused()) to the dispatchers > and code that detects when connection attempt fails with ECONNREFUSED error > (connection refused) which is a clear indication that host is alive, but > daemon isn't, so daemons can instantly mark the other side as undoubtly > downed without the need for grace timer. And this comment: > As for flapping, we discussed it on ceph-devel ml > and came to conclusion that it requires either broken firewall or network > configuration to cause this, and these are more serious issues that should > be resolved first before worrying about OSDs flapping (either way, flapping > OSDs could be good for getting someone's attention). https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558 It has left us wondering if these are the right assumptions. An ECONNREFUSED condition can bring down a whole cluster, and I wonder if there should be some kind of safe-guard to ensure that this is avoided. One badly configured host should generally not be able do that, and if the packets are dropped, instead of refused, the cluster notices that the OSD down reports come only from one host, and acts accordingly. What do you think? Does this warrant a change in Ceph? I’m happy to provide details and create a ticket. Cheers, Denis ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: mds slow request with “failed to authpin, subtree is being exported"
There are some unhandled race conditions in the MDS cluster in rare circumstances. We had this issue with mimic and octopus and it went away after manually pinning sub-dirs to MDS ranks; see https://docs.ceph.com/en/nautilus/cephfs/multimds/?highlight=dir%20pin#manually-pinning-directory-trees-to-a-particular-rank. This has the added advantage that one can bypass the internal load-balancer, which was horrible for our work loads. I have a related post about ephemeral pinning on this list one-two years ago. You should be able to find it. Short story: after manually pinning all user directories to ranks, all our problems disappeared and performance improved a lot. MDS load dropped from 130% average to 10-20%. So did memory consumption and cache recycling. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Wednesday, November 22, 2023 12:30 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: mds slow request with “failed to authpin, subtree is being exported" Hi, we've seen this a year ago in a Nautilus cluster with multi-active MDS as well. It turned up only once within several years and we decided not to look too closely at that time. How often do you see it? Is it reproducable? In that case I'd recommend to create a tracker issue. Regards, Eugen Zitat von zxcs : > HI, Experts, > > we are using cephfs with 16.2.* with multi active mds, and > recently, we have two nodes mount with ceph-fuse due to the old os > system. > > and one nodes run a python script with `glob.glob(path)`, and > another client doing `cp` operation on the same path. > > then we see some log about `mds slow request`, and logs complain > “failed to authpin, subtree is being exported" > > then need to restart mds, > > > our question is, does there any dead lock? how can we avoid this > and how to fix it without restart mds(it will influence other users) ? > > > Thanks a ton! > > > xz > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to use hardware
Hi Simon, we are using something similar for ceph-fs. For a backup system your setup can work, depending on how you back up. While HDD pools have poor IOP/s performance, they are very good for streaming workloads. If you are using something like Borg backup that writes huge files sequentially, a HDD back-end should be OK. Here some things to consider and try out: 1. You really need to get a bunch of enterprise SSDs with power loss protection for the FS meta data pool (disable write cache if enabled, this will disable volatile write cache and switch to protected caching). We are using (formerly Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to raise performance. Place the meta-data pool and the primary data pool on these disks. Create a secondary data pool on the HDDs and assign it to the root *before* creating anything on the FS (see the recommended 3-pool layout for ceph file systems in the docs). I would not even consider running this without SSDs. 1 such SSD per host is the minimum, 2 is better. If Borg or whatever can make use of a small fast storage directory, assign a sub-dir of the root to the primary data pool. 2. Calculate with sufficient extra disk space. As long as utilization stays below 60-70% bluestore will try to make large object writes sequential, which is really important for HDDs. On our cluster we currently have 40% utilization and I get full HDD bandwidth out for large sequential reads/writes. Make sure your backup application makes large sequential IO requests. 3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use ephemeral pinning. Depending on the CPUs you are using, 48 cores can be plenty. The latest generation Intel Xeon Scalable Processors is so efficient with ceph that 1HT per HDD is more than enough. 4. 3 MON+MGR nodes are sufficient. You can do something else with the remaining 2 nodes. Of course, you can use them as additional MON+MGR nodes. We also use 5 and it improves maintainability a lot. Something more exotic if you have time: 5. To improve sequential performance further, you can experiment with larger min_alloc_sizes for OSDs (on creation time, you will need to scrap and re-deploy the cluster to test different values). Every HDD has a preferred IO-size for which random IO achieves nearly the same band-with as sequential writes. (But see 7.) 6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is the preferred IO size (usually between 256K-1M). (But see 7.) 7. Important: large min_alloc_sizes are only good if your workload *never* modifies files, but only replaces them. A bit like a pool without EC overwrite enabled. The implementation of EC overwrites has a "feature" that can lead to massive allocation amplification. If your backup workload does modifications to files instead of adding new+deleting old, do *not* experiment with options 5.-7. Instead, use the default and make sure you have sufficient unused capacity to increase the chances for large bluestore writes (keep utilization below 60-70% and just buy extra disks). A workload with large min_alloc_sizes has to be S3-like, only upload, download and delete are allowed. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Anthony D'Atri Sent: Saturday, November 18, 2023 3:24 PM To: Simon Kepp Cc: Albert Shih; ceph-users@ceph.io Subject: [ceph-users] Re: How to use hardware Common motivations for this strategy include the lure of unit economics and RUs. Often ultra dense servers can’t fill racks anyway due to power and weight limits. Here the osd_memory_target would have to be severely reduced to avoid oomkilling. Assuming the OSDs are top load LFF HDDs with expanders, the HBA will be a bottleck as well. I’ve suffered similar systems for RGW. All the clever juggling in the world could not override the math, and the solution was QLC. “We can lose 4 servers” Do you realize that your data would then be unavailable ? When you lose even one, you will not be able to restore redundancy and your OSDs likely will oomkill. If you’re running CephFS, how are you provisioning fast OSDs for the metadata pool? Are the CPUs high-clock for MDS responsiveness? Even given the caveats this seems like a recipe for at best disappointment. At the very least add RAM. 8GB per OSD plus ample for other daemons. Better would be 3x normal additional hosts for the others. > On Nov 17, 2023, at 8:33 PM, Simon Kepp wrote: > > I know that your question is regarding the service servers, but may I ask, > why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6) > (= 50 OSDs per node)? > This is possible to do, but sounds like the nodes were designe
[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?
to the time interval for which about 70% of PGs were scrubbed (leaving 30% eligible), the allocation of deep-scrub states is much much better. I expect both tails to get shorter and the overall deep-scrub load to go down as well. I hope to reach a state where I only need to issue a few deep-scrubs manually per day to get everything scrubbed within 1 week and deep-scrubbed within 3-4 weeks. For now I will wait what effect the global settings have on the SSD pools and what the HDD pool converges to. This will need 1-2 months observations and I will report back when significant changes show up. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Wednesday, November 15, 2023 11:14 AM To: ceph-users@ceph.io Subject: [ceph-users] How to configure something like osd_deep_scrub_min_interval? Hi folks, I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a likely cause of why the distribution of last_deep_scrub_stamps is so weird. I wrote a small script to extract a histogram of scrubs by "days not scrubbed" (more precisely, intervals not scrubbed; see code) to find out how (deep-) scrub times are distributed. Output below. What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, while they try to deep-scrub every 7-14 days. In other words, OSDs that have been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep state. However, what I see is completely different. There seems to be no distinction between scrub- and deep-scrub start times. This is really unexpected as nobody would try to deep-scrub HDDs every day. Weekly to bi-weekly is normal, specifically for large drives. Is there a way to configure something like osd_deep_scrub_min_interval (no, I don't want to run cron jobs for scrubbing yet)? In the output below, I would like to be able to configure a minimum period of 1-2 weeks before the next deep-scrub happens. How can I do that? The observed behavior is very unusual for RAID systems (if its not a bug in the report script). With this behavior its not surprising that people complain about "not deep-scrubbed in time" messages and too high deep-scrub IO load when such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days again already. Sample output: # scrub-report dumped pgs Scrub report: 4121 PGs not scrubbed since 1 intervals (6h) 3831 PGs not scrubbed since 2 intervals (6h) 4012 PGs not scrubbed since 3 intervals (6h) 3986 PGs not scrubbed since 4 intervals (6h) 2998 PGs not scrubbed since 5 intervals (6h) 1488 PGs not scrubbed since 6 intervals (6h) 909 PGs not scrubbed since 7 intervals (6h) 771 PGs not scrubbed since 8 intervals (6h) 582 PGs not scrubbed since 9 intervals (6h) 2 scrubbing 431 PGs not scrubbed since 10 intervals (6h) 333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing 265 PGs not scrubbed since 12 intervals (6h) 195 PGs not scrubbed since 13 intervals (6h) 116 PGs not scrubbed since 14 intervals (6h) 78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing 72 PGs not scrubbed since 16 intervals (6h) 37 PGs not scrubbed since 17 intervals (6h) 5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 19.1233* 14.40e* 33 PGs not scrubbed since 20 intervals (6h) 23 PGs not scrubbed since 21 intervals (6h) 16 PGs not scrubbed since 22 intervals (6h) 12 PGs not scrubbed since 23 intervals (6h) 8 PGs not scrubbed since 24 intervals (6h) 2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3* 4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 14.1ed* 5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 19.1788* 19.16c0* 6 PGs not scrubbed since 28 intervals (6h) 2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d* 3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a* 1 PGs not scrubbed since 32 intervals (6h) 19.133f* 1 PGs not scrubbed since 33 intervals (6h) 19.1103* 3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248* 1 PGs not scrubbed since 39 intervals (6h) 19.1984* 1 PGs not scrubbed since 41 intervals (6h) 14.449* 1 PGs not scrubbed since 44 intervals (6h) 19.179f* Deep-scrub report: 3723 PGs not deep-scrubbed since 1 intervals (24h) 4621 PGs not deep-scrubbed since 2 intervals (24h) 8 scrubbing+deep 3588 PGs not deep-scrubbed since 3 intervals (24h) 8 scrubbing+deep 2929 PGs not deep-scrubbed since 4 intervals (24h) 3 scrubbing+deep 1705 PGs not deep-scrubbed since 5 intervals (24h) 4 scrubbing+deep 1904 PGs not deep-scrubbed since 6 intervals (24h) 5 scrubbing+deep 1540 PGs not deep-scrubbed since 7 intervals (24h) 7 scrubbing+deep 1304 PGs not deep-scrubbed since 8 interva
[ceph-users] How to configure something like osd_deep_scrub_min_interval?
ce 21 intervals (24h) 19.1322* 19.10f6* 1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc* 1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f* PGs marked with a * are on busy OSDs and not eligible for scrubbing. The script (pasted here because attaching doesn't work): # cat bin/scrub-report #!/bin/bash # Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h. # Print how many PGs have not been (deep-)scrubbed since #intervals. ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json echo "" T0="$(date +%s)" scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [ .pgid, (.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*6)|ceil), (.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*24)|ceil), .state, (.acting | join(" ")) ] | @tsv ' /root/.cache/ceph/pgs_dump.json)" # less <<<"$scrub_info" # 1 2 3 45..NF # pg_id scrub-ints deep-scrub-ints status acting[] awk <<<"$scrub_info" '{ for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i if($4 == "active+clean") { si_mx=si_mx<$2 ? $2 : si_mx dsi_mx=dsi_mx<$3 ? $3 : dsi_mx pg_sn[$2]++ pg_sn_ids[$2]=pg_sn_ids[$2] " " $1 pg_dsn[$3]++ pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1 } else if($4 ~ /scrubbing\+deep/) { deep_scrubbing[$3]++ for(i=5; i<=NF; ++i) osd[$i]="busy" } else if($4 ~ /scrubbing/) { scrubbing[$2]++ for(i=5; i<=NF; ++i) osd[$i]="busy" } else { unclean[$2]++ unclean_d[$3]++ si_mx=si_mx<$2 ? $2 : si_mx dsi_mx=dsi_mx<$3 ? $3 : dsi_mx pg_sn[$2]++ pg_sn_ids[$2]=pg_sn_ids[$2] " " $1 pg_dsn[$3]++ pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1 for(i=5; i<=NF; ++i) osd[$i]="busy" } } END { print "Scrub report:" for(si=1; si<=si_mx; ++si) { if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue; printf("%7d PGs not scrubbed since %2d intervals (6h)", pg_sn[si], si) if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si]) if(unclean[si]) printf(" %d unclean", unclean[si]) if(pg_sn[si]<=5) { split(pg_sn_ids[si], pgs) osds_busy=0 for(pg in pgs) { split(pg_osds[pgs[pg]], osds) for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1 if(osds_busy) printf(" %s*", pgs[pg]) if(!osds_busy) printf(" %s", pgs[pg]) } } printf("\n") } print "" print "Deep-scrub report:" for(dsi=1; dsi<=dsi_mx; ++dsi) { if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && unclean_d[dsi]==0) continue; printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", pg_dsn[dsi], dsi) if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", deep_scrubbing[dsi]) if(unclean_d[dsi]) printf(" %d unclean", unclean_d[dsi]) if(pg_dsn[dsi]<=5) { split(pg_dsn_ids[dsi], pgs) osds_busy=0 for(pg in pgs) { split(pg_osds[pgs[pg]], osds) for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1 if(osds_busy) printf(" %s*", pgs[pg]) if(!osds_busy) printf(" %s", pgs[pg]) } } printf("\n") } print "" print "PGs marked with a * are on busy OSDs and not eligible for scrubbing." } ' Don't forget the last "'" when copy-pasting. Thanks for any pointers. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs (meta) data inconsistent
>>> It looks like the cap update request was dropped to the ground in MDS. >>> [...] >>> If you can reproduce it, then please provide the mds logs by setting: >>> [...] >> I can do a test with MDS logs on high level. Before I do that, looking at >> the python >> findings above, is this something that should work on ceph or is it a python >> issue? > > Not sure yet. I need to understand what exactly shutil.copy does in kclient. Thanks! Will wait for further instructions. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Xiubo Li Sent: Friday, November 10, 2023 3:14 AM To: Frank Schilder; Gregory Farnum Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent On 11/10/23 00:18, Frank Schilder wrote: > Hi Xiubo, > > I will try to answer questions from all your 3 e-mails here together with > some new information we have. > > New: The problem occurs in newer python versions when using the shutil.copy > function. There is also a function shutil.copy2 for which the problem does > not show up. Copy2 behaves a bit like "cp -p" while copy is like "cp". The > only code difference (linux) between these 2 functions is that copy calls > copyfile+copymode while copy2 calls copyfile+copystat. For now we asked our > users to use copy2 to avoid the issue. > > The copyfile function calls _fastcopy_sendfile on linux, which in turn calls > os.sendfile, which seems to be part of libc: > > #include > ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); > > I'm wondering if using this function requires explicit meta-data updates or > should be safe on ceph-fs. I'm also not sure if a user-space client even > supports this function (seems to be meaningless). Should this function be > safe to use on ceph kclient? I didn't foresee any limit for this in kclient. The shutil.copy will only copy the contents of the file, while the shutil.copy2 will also copy the metadata. I need to know what exactly they do in kclient for shutil.copy and shutil.copy2. > Answers to questions: > >> BTW, have you test the ceph-fuse with the same test ? Is also the same ? > I don't have fuse clients available, so can't test right now. > >> Have you tried other ceph version ? > We are in the process of deploying a new test cluster, the old one is > scrapped already. I can't test this at the moment. > >> It looks like the cap update request was dropped to the ground in MDS. >> [...] >> If you can reproduce it, then please provide the mds logs by setting: >> [...] > I can do a test with MDS logs on high level. Before I do that, looking at the > python findings above, is this something that should work on ceph or is it a > python issue? Not sure yet. I need to understand what exactly shutil.copy does in kclient. Thanks - Xiubo > > Thanks for your help! > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: ceph fs (meta) data inconsistent
Hi Xiubo, I will try to answer questions from all your 3 e-mails here together with some new information we have. New: The problem occurs in newer python versions when using the shutil.copy function. There is also a function shutil.copy2 for which the problem does not show up. Copy2 behaves a bit like "cp -p" while copy is like "cp". The only code difference (linux) between these 2 functions is that copy calls copyfile+copymode while copy2 calls copyfile+copystat. For now we asked our users to use copy2 to avoid the issue. The copyfile function calls _fastcopy_sendfile on linux, which in turn calls os.sendfile, which seems to be part of libc: #include ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); I'm wondering if using this function requires explicit meta-data updates or should be safe on ceph-fs. I'm also not sure if a user-space client even supports this function (seems to be meaningless). Should this function be safe to use on ceph kclient? Answers to questions: > BTW, have you test the ceph-fuse with the same test ? Is also the same ? I don't have fuse clients available, so can't test right now. > Have you tried other ceph version ? We are in the process of deploying a new test cluster, the old one is scrapped already. I can't test this at the moment. > It looks like the cap update request was dropped to the ground in MDS. > [...] > If you can reproduce it, then please provide the mds logs by setting: > [...] I can do a test with MDS logs on high level. Before I do that, looking at the python findings above, is this something that should work on ceph or is it a python issue? Thanks for your help! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS stuck in rejoin
Hi Xiubo, great! I'm not sure if we observed this particular issue, but we did have the oldest_client_tid updates not advancing message in a context that might re related. If this fix is not too large, it would be really great if it could be included in the last Pacific point release. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Wednesday, November 8, 2023 1:38 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: MDS stuck in rejoin Hi Frank, Recently I found a new possible case could cause this, please see https://github.com/ceph/ceph/pull/54259. This is just a ceph side fix, after this we need to fix it in kclient too, which hasn't done yet. Thanks - Xiubo On 8/8/23 17:44, Frank Schilder wrote: > Dear Xiubo, > > the nearfull pool is an RBD pool and has nothing to do with the file system. > All pools for the file system have plenty of capacity. > > I think we have an idea what kind of workload caused the issue. We had a user > run a computation that reads the same file over and over again. He started > 100 such jobs in parallel and our storage servers were at 400% load. I saw > 167K read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. > 11K. Clearly, most of this was served from RAM. > > It is possible that this extreme load situation triggered a race that > remained undetected/unreported. There is literally no related message in any > logs near the time the warning started popping up. It shows up out of nowhere. > > We asked the user to change his workflow to use local RAM disk for the input > files. I don't think we can reproduce the problem anytime soon. > > About the bug fixes, I'm eagerly waiting for this and another one. Any idea > when they might show up in distro kernels? > > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Xiubo Li > Sent: Tuesday, August 8, 2023 2:57 AM > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: MDS stuck in rejoin > > > On 8/7/23 21:54, Frank Schilder wrote: >> Dear Xiubo, >> >> I managed to collect some information. It looks like there is nothing in the >> dmesg log around the time the client failed to advance its TID. I collected >> short snippets around the critical time below. I have full logs in case you >> are interested. Its large files, I will need to do an upload for that. >> >> I also have a dump of "mds session ls" output for clients that showed the >> same issue later. Unfortunately, no consistent log information for a single >> incident. >> >> Here the summary, please let me know if uploading the full package makes >> sense: >> >> - Status: >> >> On July 29, 2023 >> >> ceph status/df/pool stats/health detail at 01:05:14: >> cluster: >> health: HEALTH_WARN >> 1 pools nearfull >> >> ceph status/df/pool stats/health detail at 01:05:28: >> cluster: >> health: HEALTH_WARN >> 1 clients failing to advance oldest client/flush tid >> 1 pools nearfull > Okay, then this could be the root cause. > > If the pool nearful it could block flushing the journal logs to the pool > and then the MDS couldn't safe reply to the requests and then block them > like this. > > Could you fix the pool nearful issue first and then check could you see > it again ? > > >> [...] >> >> On July 31, 2023 >> >> ceph status/df/pool stats/health detail at 10:36:16: >> cluster: >> health: HEALTH_WARN >> 1 clients failing to advance oldest client/flush tid >> 1 pools nearfull >> >> cluster: >> health: HEALTH_WARN >> 1 pools nearfull >> >> - client evict command (date, time, command): >> >> 2023-07-31 10:36 ceph tell mds.ceph-11 client evict id=145678457 >> >> We have a 1h time difference between the date stamp of the command and the >> dmesg date stamps. However, there seems to be a weird 10min delay from >> issuing the evict command until it shows up in dmesg on the client. >> >> - dmesg: >> >> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey >> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey >> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey >> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey >> [Fri Jul 28 12:59:14 2023] beegfs: enabling un
[ceph-users] Re: ceph fs (meta) data inconsistent
Hi Gregory and Xiubo, we have a smoking gun. The error shows up when using python's shutil.copy function. It affects newer versions of python3. Here some test results (quoted e-mail from our user): > I now have a minimal example that reproduces the error: > > echo test > myfile.txt > ml Python/3.10.4-GCCcore-11.3.0-bare > python -c "import shutil; shutil.copy('myfile.txt','myfile2.txt')" > ls -l | grep myfile > srun --partition fatq --time 01:00:00 --nodes 1 --exclusive --pty bash > ls -l | grep myfile > > The problem occurs when I copy a file via python > (Python/3.10.4-GCCcore-11.3.0-bare) > The problems does not occur when running the script with the default python > > [user1@host1 ~]$ echo test > myfile.txt > [user1@host1 ~]$ ml Python/3.10.4-GCCcore-11.3.0-bare > [user1@host1 ~]$ python -c "import shutil; > shutil.copy('myfile.txt','myfile2.txt')" > [user1@host1 ~]$ ls -l | grep myfile > -rw-rw 1 user1 user1 5 Nov 3 07:23 myfile2.txt > -rw-rw 1 user1 user1 5 Nov 3 07:23 myfile.txt > [user1@host1 ~]$ srun --partition fatq --time 01:00:00 --nodes 1 --exclusive > --pty bash > [user1@host3 ~]$ ls -l | grep myfile > -rw-rw 1 user1 user1 0 Nov 3 07:23 myfile2.txt > -rw-rw 1 user1 user1 5 Nov 3 07:23 myfile.txt > Just tried with some other python versions > > Ok > Python/3.6.4-intel-2018a > Python/3.7.4-GCCcore-8.3.0 > > Fail > Python/3.8.2-GCCcore-9.3.0 > Python/3.9.5-GCCcore-10.3.0-bare > Python/3.9.5-GCCcore-10.3.0 > Python/3.10.4-GCCcore-11.3.0-bare > Python/3.10.4-GCCcore-11.3.0 > Python/3.10.8-GCCcore-12.2.0-bare These are easybuild python modules using different gcc versions to build. The default version of python referred to is Python 2.7.5. Is this a known problem with python3 and is there a patch we can apply? I wonder how python manages to break the file system so consistently. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Thursday, November 2, 2023 12:12 PM To: Gregory Farnum; Xiubo Li Cc: ceph-users@ceph.io Subject: [ceph-users] Re: ceph fs (meta) data inconsistent Hi all, the problem re-appeared in the following way. After moving the problematic folder out and copying it back, all files showed the correct sizes. Today, we observe that the issue is back in the copy that was fine yesterday: [user1@host11 h2lib]$ ls -l total 37198 -rw-rw 1 user1 user111641 Nov 2 11:32 dll_wrapper.py [user2@host2 h2lib]# ls -l total 44 -rw-rw. 1 user1 user1 0 Nov 2 11:28 dll_wrapper.py It is correct in the snapshot though: [user2@host2 h2lib]# cd .snap [user2@host2 .snap]# ls _2023-11-02_000611+0100_daily_1 [user2@host2 .snap]# cd _2023-11-02_000611+0100_daily_1/ [user2@host2 _2023-11-02_000611+0100_daily_1]# ls -l total 37188 -rw-rw. 1 user1 user111641 Nov 1 13:30 dll_wrapper.py It seems related to the path, bot the inode number. Could the re-appearance of the 0 length have been triggered by taking the snapshot? We plan to reboot later today the server where the file was written. Until hen we can do diagnostics while the issue is visible. Please let us know what information we can provide. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Gregory Farnum Sent: Wednesday, November 1, 2023 4:57 PM To: Frank Schilder; Xiubo Li Cc: ceph-users@ceph.io Subject: Re: [ceph-users] ceph fs (meta) data inconsistent We have seen issues like this a few times and they have all been kernel client bugs with CephFS’ internal “capability” file locking protocol. I’m not aware of any extant bugs like this in our code base, but kernel patches can take a long and winding path before they end up on deployed systems. Most likely, if you were to restart some combination of the client which wrote the file and the client(s) reading it, the size would propagate correctly. As long as you’ve synced the data, it’s definitely present in the cluster. Adding Xiubo, who has worked on these and may have other comments. -Greg On Wed, Nov 1, 2023 at 7:16 AM Frank Schilder mailto:fr...@dtu.dk>> wrote: Dear fellow cephers, today we observed a somewhat worrisome inconsistency on our ceph fs. A file created on one host showed up as 0 length on all other hosts: [user1@host1 h2lib]$ ls -lh total 37M -rw-rw 1 user1 user1 12K Nov 1 11:59 dll_wrapper.py [user2@host2 h2lib]# ls -l total 34 -rw-rw. 1 user1 user1 0 Nov 1 11:59 dll_wrapper.py [user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test [user1@host1 h2lib]$ ls -l total 37199 -rw-rw 1 user1 user111641 Nov 1 11:59 dll_wrapper.py -rw-rw 1 user1 user111641 Nov
[ceph-users] Re: ceph fs (meta) data inconsistent
Hi all, the problem re-appeared in the following way. After moving the problematic folder out and copying it back, all files showed the correct sizes. Today, we observe that the issue is back in the copy that was fine yesterday: [user1@host11 h2lib]$ ls -l total 37198 -rw-rw 1 user1 user111641 Nov 2 11:32 dll_wrapper.py [user2@host2 h2lib]# ls -l total 44 -rw-rw. 1 user1 user1 0 Nov 2 11:28 dll_wrapper.py It is correct in the snapshot though: [user2@host2 h2lib]# cd .snap [user2@host2 .snap]# ls _2023-11-02_000611+0100_daily_1 [user2@host2 .snap]# cd _2023-11-02_000611+0100_daily_1/ [user2@host2 _2023-11-02_000611+0100_daily_1]# ls -l total 37188 -rw-rw. 1 user1 user111641 Nov 1 13:30 dll_wrapper.py It seems related to the path, bot the inode number. Could the re-appearance of the 0 length have been triggered by taking the snapshot? We plan to reboot later today the server where the file was written. Until hen we can do diagnostics while the issue is visible. Please let us know what information we can provide. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Gregory Farnum Sent: Wednesday, November 1, 2023 4:57 PM To: Frank Schilder; Xiubo Li Cc: ceph-users@ceph.io Subject: Re: [ceph-users] ceph fs (meta) data inconsistent We have seen issues like this a few times and they have all been kernel client bugs with CephFS’ internal “capability” file locking protocol. I’m not aware of any extant bugs like this in our code base, but kernel patches can take a long and winding path before they end up on deployed systems. Most likely, if you were to restart some combination of the client which wrote the file and the client(s) reading it, the size would propagate correctly. As long as you’ve synced the data, it’s definitely present in the cluster. Adding Xiubo, who has worked on these and may have other comments. -Greg On Wed, Nov 1, 2023 at 7:16 AM Frank Schilder mailto:fr...@dtu.dk>> wrote: Dear fellow cephers, today we observed a somewhat worrisome inconsistency on our ceph fs. A file created on one host showed up as 0 length on all other hosts: [user1@host1 h2lib]$ ls -lh total 37M -rw-rw 1 user1 user1 12K Nov 1 11:59 dll_wrapper.py [user2@host2 h2lib]# ls -l total 34 -rw-rw. 1 user1 user1 0 Nov 1 11:59 dll_wrapper.py [user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test [user1@host1 h2lib]$ ls -l total 37199 -rw-rw 1 user1 user111641 Nov 1 11:59 dll_wrapper.py -rw-rw 1 user1 user111641 Nov 1 13:10 dll_wrapper.py.test [user2@host2 h2lib]# ls -l total 45 -rw-rw. 1 user1 user1 0 Nov 1 11:59 dll_wrapper.py -rw-rw. 1 user1 user1 11641 Nov 1 13:10 dll_wrapper.py.test Executing a sync on all these hosts did not help. However, deleting the problematic file and replacing it with a copy seemed to work around the issue. We saw this with ceph kclients of different versions, it seems to be on the MDS side. How can this happen and how dangerous is it? ceph fs status (showing ceph version): # ceph fs status con-fs2 - 1662 clients === RANK STATE MDS ACTIVITY DNSINOS 0active ceph-15 Reqs: 14 /s 2307k 2278k 1active ceph-11 Reqs: 159 /s 4208k 4203k 2active ceph-17 Reqs:3 /s 4533k 4501k 3active ceph-24 Reqs:3 /s 4593k 4300k 4active ceph-14 Reqs:1 /s 4228k 4226k 5active ceph-13 Reqs:5 /s 1994k 1782k 6active ceph-16 Reqs:8 /s 5022k 4841k 7active ceph-23 Reqs:9 /s 4140k 4116k POOL TYPE USED AVAIL con-fs2-meta1 metadata 2177G 7085G con-fs2-meta2 data 0 7085G con-fs2-data data1242T 4233T con-fs2-data-ec-ssddata 706G 22.1T con-fs2-data2 data3409T 3848T STANDBY MDS ceph-10 ceph-08 ceph-09 ceph-12 MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) There is no health issue: # ceph status cluster: id: abc health: HEALTH_WARN 3 pgs not deep-scrubbed in time services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 9w) mgr: ceph-25(active, since 7w), standbys: ceph-26, ceph-01, ceph-03, ceph-02 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1279 up (since 2d), 1279 in (since 5d) task status: data: pools: 14 pools, 25065 pgs objects: 2.20G objects, 3.9 PiB usage: 4.9 PiB used, 8.2 PiB / 13 PiB avail pgs: 25039 active+clean 26active+clean+scrubbing+deep io: client: 799 MiB/s rd, 55 MiB/s wr, 3.12k op/s rd, 1.82k op/s wr The inconsistency seems undiagnosed, I couldn't find anything interesting in the cluster log. What should I look for and where? I moved the folder to another location for diagnosis. Unfortunately, I don't have 2 clients any more s
[ceph-users] ceph fs (meta) data inconsistent
Dear fellow cephers, today we observed a somewhat worrisome inconsistency on our ceph fs. A file created on one host showed up as 0 length on all other hosts: [user1@host1 h2lib]$ ls -lh total 37M -rw-rw 1 user1 user1 12K Nov 1 11:59 dll_wrapper.py [user2@host2 h2lib]# ls -l total 34 -rw-rw. 1 user1 user1 0 Nov 1 11:59 dll_wrapper.py [user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test [user1@host1 h2lib]$ ls -l total 37199 -rw-rw 1 user1 user111641 Nov 1 11:59 dll_wrapper.py -rw-rw 1 user1 user111641 Nov 1 13:10 dll_wrapper.py.test [user2@host2 h2lib]# ls -l total 45 -rw-rw. 1 user1 user1 0 Nov 1 11:59 dll_wrapper.py -rw-rw. 1 user1 user1 11641 Nov 1 13:10 dll_wrapper.py.test Executing a sync on all these hosts did not help. However, deleting the problematic file and replacing it with a copy seemed to work around the issue. We saw this with ceph kclients of different versions, it seems to be on the MDS side. How can this happen and how dangerous is it? ceph fs status (showing ceph version): # ceph fs status con-fs2 - 1662 clients === RANK STATE MDS ACTIVITY DNSINOS 0active ceph-15 Reqs: 14 /s 2307k 2278k 1active ceph-11 Reqs: 159 /s 4208k 4203k 2active ceph-17 Reqs:3 /s 4533k 4501k 3active ceph-24 Reqs:3 /s 4593k 4300k 4active ceph-14 Reqs:1 /s 4228k 4226k 5active ceph-13 Reqs:5 /s 1994k 1782k 6active ceph-16 Reqs:8 /s 5022k 4841k 7active ceph-23 Reqs:9 /s 4140k 4116k POOL TYPE USED AVAIL con-fs2-meta1 metadata 2177G 7085G con-fs2-meta2 data 0 7085G con-fs2-data data1242T 4233T con-fs2-data-ec-ssddata 706G 22.1T con-fs2-data2 data3409T 3848T STANDBY MDS ceph-10 ceph-08 ceph-09 ceph-12 MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) There is no health issue: # ceph status cluster: id: abc health: HEALTH_WARN 3 pgs not deep-scrubbed in time services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 9w) mgr: ceph-25(active, since 7w), standbys: ceph-26, ceph-01, ceph-03, ceph-02 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1279 up (since 2d), 1279 in (since 5d) task status: data: pools: 14 pools, 25065 pgs objects: 2.20G objects, 3.9 PiB usage: 4.9 PiB used, 8.2 PiB / 13 PiB avail pgs: 25039 active+clean 26active+clean+scrubbing+deep io: client: 799 MiB/s rd, 55 MiB/s wr, 3.12k op/s rd, 1.82k op/s wr The inconsistency seems undiagnosed, I couldn't find anything interesting in the cluster log. What should I look for and where? I moved the folder to another location for diagnosis. Unfortunately, I don't have 2 clients any more showing different numbers, I see a 0 length now everywhere for the moved folder. I'm pretty sure though that the file still is non-zero length. Thanks for any pointers. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: find PG with large omap object
Hi Ramin, that's an interesting approach for part of the problem. Two comments: 1) We don't have RGW services, we observe large omap objects in a ceph fs meta-data pool. Hence, the script doesn't help here. 2) The command "radosgw-admin" has the very annoying property of creating pools without asking. It would be great if you could add a sanity check that confirms that RGW services are actually present *before* executing any radosgw-admin command and exiting if none are present. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Ramin Najjarbashi Sent: Tuesday, October 31, 2023 11:22 AM To: Frank Schilder Cc: Eugen Block; ceph-users@ceph.io Subject: Re: [ceph-users] Re: find PG with large omap object https://gist.github.com/RaminNietzsche/b8702014c333f3a44a995d5a6d4a56be On Mon, Oct 16, 2023 at 4:27 PM Frank Schilder mailto:fr...@dtu.dk>> wrote: Hi Eugen, the warning threshold is per omap object, not per PG (which apparently has more than 1 omap object). Still, I misread the numbers by 1 order, which means that the difference between the last 2 entries is about 10, which does point towards a large omap object in PG 12.193. I issued a deep-scrub on this PG and the warning is resolved. Thanks and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block mailto:ebl...@nde.ag>> Sent: Monday, October 16, 2023 2:41 PM To: Frank Schilder Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io> Subject: Re: [ceph-users] Re: find PG with large omap object Hi Frank, > # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] | > sort_by(.nk)' pgs.dump | tail > }, > { > "id": "12.17b", > "nk": 1493776 > }, > { > "id": "12.193", > "nk": 1583589 > } those numbers are > 1 million and the warning threshold is 200k. So a warning is expected. Zitat von Frank Schilder mailto:fr...@dtu.dk>>: > Hi Eugen, > > thanks for the one-liner :) I'm afraid I'm in the same position as > before though. > > I dumped all PGs to a file and executed these 2 commands: > > # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_bytes}] > | sort_by(.nk)' pgs.dump | tail > }, > { > "id": "12.193", > "nk": 1002401056 > }, > { > "id": "21.0", > "nk": 1235777228 > } > ] > > # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] | > sort_by(.nk)' pgs.dump | tail > }, > { > "id": "12.17b", > "nk": 1493776 > }, > { > "id": "12.193", > "nk": 1583589 > } > ] > > Neither is beyond the warn limit and pool 12 is indeed the pool > where the warnings came from. OK, now back to the logs: > > # zgrep -i 'Large omap object found. Object:' /var/log/ceph/ceph.log-* > /var/log/ceph/ceph.log-20231008.gz:2023-10-05T01:25:14.581962+0200 > osd.592 (osd.592) 104 : cluster [WRN] Large omap object found. > Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key > count: 21 Size (bytes): 230080309 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T04:33:02.678879+0200 > osd.949 (osd.949) 6897 : cluster [WRN] Large omap object found. > Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key > count: 200243 Size (bytes): 230307097 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T07:22:40.512228+0200 > osd.988 (osd.988) 4365 : cluster [WRN] Large omap object found. > Object: 12:eb96322f:::637.:head PG: 12.f44c69d7 (12.1d7) Key > count: 200329 Size (bytes): 230310393 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T15:08:03.785186+0200 > osd.50 (osd.50) 4549 : cluster [WRN] Large omap object found. > Object: 12:08fb0eb7:::635.:head PG: 12.ed70df10 (12.110) Key > count: 200183 Size (bytes): 230150641 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T16:37:12.901470+0200 > osd.18 (osd.18) 7011 : cluster [WRN] Large omap object found. > Object: 12:d6758956:::634.:head PG: 12.6a91ae6b (12.6b) Key > count: 200247 Size (bytes): 230343371 > /var/log/ceph/ceph.log-20231008.gz:2023-10-08T01:25:16.125068+0200 > osd.980 (osd.980) 308 : cluster [WRN] Large omap object found. > Object: 12:63f985e7:::639.:head PG: 12.e7a19fc6 (12.1c6) Key > count: 200160 Size (bytes): 230179282 > /var/log/ceph/ceph.log-20231015:2023-10-09T00:51:32.587849+0200 > osd.563 (osd.563) 3661 : cluster [WRN] Large omap object found. > Object: 12:44346421:::632.:head PG: 12.8
[ceph-users] Combining masks in ceph config
Hi all, I have a case where I want to set options for a set of HDDs under a common sub-tree with root A. I have also HDDs in another disjoint sub-tree with root B. Therefore, I would like to do something like ceph config set osd/class:hdd,datacenter:A option value The above does not give a syntax error, but I'm also not sure it does the right thing. Does the above mean "class:hdd and datacenter:A" or does it mean "for OSDs with device class 'hdd,datacenter:A'"? Thanks and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure
I just tried what sending SIGSTOP and SIGCONT do. After stopping the process 3 caps were returned. After resuming the process these 3 caps were allocated again. There seems to be a large number of stale caps that are not released. While the process was stopped the kworker thread continued to show 2% CPU usage even though there was no file IO going on. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Thursday, October 19, 2023 10:02 AM To: Stefan Kooman; ceph-users@ceph.io Subject: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure Hi Stefan, the jobs ended and the warning disappeared as expected. However, a new job started and the warning showed up again. There is something very strange going on and, maybe, you can help out here: We have a low client CAPS limit configured for performance reasons: # ceph config dump | grep client [...] mds advanced mds_max_caps_per_client 65536 The job in question holds more than that: # ceph tell mds.0 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | tail [...] {"id":172249397,"h":"sn272...","addr":"client.172249397 v1:192.168.57.143:0/195146548","fs":"/hpc/home","caps":105417,"req":1442} This CAPS allocation is stable over time, the number doesn't change (I queried multiple times with several minutes interval). My guess is that the MDS message is not about cache pressure but rather about caps trimming. We do have clients that regularly exceed the limit though without MDS warnings. My guess is that these return at least some CAPS on request and are, therefore, not flagged. The client above seems to sit on a fixed set of CAPS that doesn't change and this causes the warning to show up. The strange thing now is that very few files (on ceph fs) are actually open on the client: [USER@sn272 ~]$ lsof -u USER | grep -e /home -e /groups -e /apps | wc -l 170 The kworker thread is at about 3% CPU and should be able to release CAPS. I'm wondering why it doesn't happen though. I also don't believe that 170 open files can allocate 105417 client caps. Questions: - Why does the client have so many caps allocated? Is there another way than open files that requires allocations? - Is there a way to find out what these caps are for? - We will look at the code (its python+miniconda), any pointers what to look for? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Tuesday, October 17, 2023 11:27 AM To: Stefan Kooman; ceph-users@ceph.io Subject: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure Hi Stefan, probably. Its 2 compute nodes and there are jobs running. Our epilogue script will drop the caches, at which point I indeed expect the warning to disappear. We have no time limit on these nodes though, so this can be a while. I was hoping there was an alternative to that, say, a user-level command that I could execute on the client without possibly affecting other users jobs. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Stefan Kooman Sent: Tuesday, October 17, 2023 11:13 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] stuck MDS warning: Client HOST failing to respond to cache pressure On 17-10-2023 09:22, Frank Schilder wrote: > Hi all, > > I'm affected by a stuck MDS warning for 2 clients: "failing to respond to > cache pressure". This is a false alarm as no MDS is under any cache pressure. > The warning is stuck already for a couple of days. I found some old threads > about cases where the MDS does not update flags/triggers for this warning in > certain situations. Dating back to luminous and I'm probably hitting one of > these. > > In these threads I could find a lot except for instructions for how to clear > this out in a nice way. Is there something I can do on the clients to clear > this warning? I don't want to evict/reboot just because of that. echo 2 > /proc/sys/vm/drop_caches on the clients does that help? Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure
Hi Stefan, the jobs ended and the warning disappeared as expected. However, a new job started and the warning showed up again. There is something very strange going on and, maybe, you can help out here: We have a low client CAPS limit configured for performance reasons: # ceph config dump | grep client [...] mds advanced mds_max_caps_per_client 65536 The job in question holds more than that: # ceph tell mds.0 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | tail [...] {"id":172249397,"h":"sn272...","addr":"client.172249397 v1:192.168.57.143:0/195146548","fs":"/hpc/home","caps":105417,"req":1442} This CAPS allocation is stable over time, the number doesn't change (I queried multiple times with several minutes interval). My guess is that the MDS message is not about cache pressure but rather about caps trimming. We do have clients that regularly exceed the limit though without MDS warnings. My guess is that these return at least some CAPS on request and are, therefore, not flagged. The client above seems to sit on a fixed set of CAPS that doesn't change and this causes the warning to show up. The strange thing now is that very few files (on ceph fs) are actually open on the client: [USER@sn272 ~]$ lsof -u USER | grep -e /home -e /groups -e /apps | wc -l 170 The kworker thread is at about 3% CPU and should be able to release CAPS. I'm wondering why it doesn't happen though. I also don't believe that 170 open files can allocate 105417 client caps. Questions: - Why does the client have so many caps allocated? Is there another way than open files that requires allocations? - Is there a way to find out what these caps are for? - We will look at the code (its python+miniconda), any pointers what to look for? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Tuesday, October 17, 2023 11:27 AM To: Stefan Kooman; ceph-users@ceph.io Subject: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure Hi Stefan, probably. Its 2 compute nodes and there are jobs running. Our epilogue script will drop the caches, at which point I indeed expect the warning to disappear. We have no time limit on these nodes though, so this can be a while. I was hoping there was an alternative to that, say, a user-level command that I could execute on the client without possibly affecting other users jobs. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ____ From: Stefan Kooman Sent: Tuesday, October 17, 2023 11:13 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] stuck MDS warning: Client HOST failing to respond to cache pressure On 17-10-2023 09:22, Frank Schilder wrote: > Hi all, > > I'm affected by a stuck MDS warning for 2 clients: "failing to respond to > cache pressure". This is a false alarm as no MDS is under any cache pressure. > The warning is stuck already for a couple of days. I found some old threads > about cases where the MDS does not update flags/triggers for this warning in > certain situations. Dating back to luminous and I'm probably hitting one of > these. > > In these threads I could find a lot except for instructions for how to clear > this out in a nice way. Is there something I can do on the clients to clear > this warning? I don't want to evict/reboot just because of that. echo 2 > /proc/sys/vm/drop_caches on the clients does that help? Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes
Hi Zakhar, since its a bit beyond of the scope of basic, could you please post the complete ceph.conf config section for these changes for reference? Thanks! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zakhar Kirpichenko Sent: Wednesday, October 18, 2023 6:14 AM To: Eugen Block Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Ceph 16.2.x mon compactions, disk writes Many thanks for this, Eugen! I very much appreciate yours and Mykola's efforts and insight! Another thing I noticed was a reduction of RocksDB store after the reduction of the total PG number by 30%, from 590-600 MB: 65M 3675511.sst 65M 3675512.sst 65M 3675513.sst 65M 3675514.sst 65M 3675515.sst 65M 3675516.sst 65M 3675517.sst 65M 3675518.sst 62M 3675519.sst to about half of the original size: -rw-r--r-- 1 167 167 7218886 Oct 13 16:16 3056869.log -rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst -rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst -rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst Then when I restarted the monitors one by one before adding compression, RocksDB store reduced even further. I am not sure why and what exactly got automatically removed from the store: -rw-r--r-- 1 167 167 841960 Oct 18 03:31 018779.log -rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst -rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst Then I have enabled LZ4 and LZ4HC compression in our small production cluster (6 nodes, 96 OSDs) on 3 out of 5 monitors: compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression. I specifically went for LZ4 and LZ4HC because of the balance between compression/decompression speed and impact on CPU usage. The compression doesn't seem to affect the cluster in any negative way, the 3 monitors with compression are operating normally. The effect of the compression on RocksDB store size and disk writes is quite noticeable: Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB writes over 5 minutes: -rw-r--r-- 1 167 167 4227337 Oct 18 03:58 3080868.log -rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst -rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst # du -hs /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/; iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon 155M /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/ 2471602 be/4 167 6.05 M473.24 M 0.00 % 0.16 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:low0] 2471633 be/4 167 188.00 K 40.91 M 0.00 % 0.02 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [ms_dispatch] 2471603 be/4 167 16.00 K 24.16 M 0.00 % 0.01 % ceph-mon -n mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:high0] Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of writes over 5 minutes: -rw-r--r-- 1 167 167 5766659 Oct 18 03:56 3723355.log -rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst # du -hs /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/; iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon 60M /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/ 2052031 be/4 1671040.00 K 83.48 M 0.00 % 0.01 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:low0] 2052062 be/4 167 0.00 B 40.79 M 0.00 % 0.01 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [ms_dispatch] 2052032 be/4 167 16.00 K 4.68 M 0.00 % 0.00 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster-log-to-stderr=true [rocksdb:high0] 2052052 be/4 167 44.00 K 0.00 B 0.00 % 0.00 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug --default-mon-cluster-log-to-file=false --default-mon-cluster
[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure
Hi Loïc, thanks for the pointer. Its kind of the opposite extreme to dropping just everything. I need to know the file name that is in cache. I'm looking for a middle way, say, "drop_caches -u USER" that drops all caches of files owned by user USER. This way I could try dropping caches for a bunch of users who are *not* running a job. I guess I have to wait for the jobs to end. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Loïc Tortay Sent: Tuesday, October 17, 2023 3:40 PM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure On 17/10/2023 11:27, Frank Schilder wrote: > Hi Stefan, > > probably. Its 2 compute nodes and there are jobs running. Our epilogue script > will drop the caches, at which point I indeed expect the warning to > disappear. We have no time limit on these nodes though, so this can be a > while. I was hoping there was an alternative to that, say, a user-level > command that I could execute on the client without possibly affecting other > users jobs. > Hello, If you know the names of the files to flush from the cache (from /proc/$PID/fd, lsof, batch job script, ...), you can use something like https://github.com/tortay/cache-toys/blob/master/drop-from-pagecache.c on the client. See comments line 16 to 22 of the source code for caveats/limitations. Loïc. -- | Loīc Tortay - IN2P3 Computing Centre | ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure
Hi Stefan, probably. Its 2 compute nodes and there are jobs running. Our epilogue script will drop the caches, at which point I indeed expect the warning to disappear. We have no time limit on these nodes though, so this can be a while. I was hoping there was an alternative to that, say, a user-level command that I could execute on the client without possibly affecting other users jobs. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Stefan Kooman Sent: Tuesday, October 17, 2023 11:13 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] stuck MDS warning: Client HOST failing to respond to cache pressure On 17-10-2023 09:22, Frank Schilder wrote: > Hi all, > > I'm affected by a stuck MDS warning for 2 clients: "failing to respond to > cache pressure". This is a false alarm as no MDS is under any cache pressure. > The warning is stuck already for a couple of days. I found some old threads > about cases where the MDS does not update flags/triggers for this warning in > certain situations. Dating back to luminous and I'm probably hitting one of > these. > > In these threads I could find a lot except for instructions for how to clear > this out in a nice way. Is there something I can do on the clients to clear > this warning? I don't want to evict/reboot just because of that. echo 2 > /proc/sys/vm/drop_caches on the clients does that help? Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] stuck MDS warning: Client HOST failing to respond to cache pressure
Hi all, I'm affected by a stuck MDS warning for 2 clients: "failing to respond to cache pressure". This is a false alarm as no MDS is under any cache pressure. The warning is stuck already for a couple of days. I found some old threads about cases where the MDS does not update flags/triggers for this warning in certain situations. Dating back to luminous and I'm probably hitting one of these. In these threads I could find a lot except for instructions for how to clear this out in a nice way. Is there something I can do on the clients to clear this warning? I don't want to evict/reboot just because of that. Thanks and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: find PG with large omap object
Hi Eugen, the warning threshold is per omap object, not per PG (which apparently has more than 1 omap object). Still, I misread the numbers by 1 order, which means that the difference between the last 2 entries is about 10, which does point towards a large omap object in PG 12.193. I issued a deep-scrub on this PG and the warning is resolved. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Monday, October 16, 2023 2:41 PM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: find PG with large omap object Hi Frank, > # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] | > sort_by(.nk)' pgs.dump | tail > }, > { > "id": "12.17b", > "nk": 1493776 > }, > { > "id": "12.193", > "nk": 1583589 > } those numbers are > 1 million and the warning threshold is 200k. So a warning is expected. Zitat von Frank Schilder : > Hi Eugen, > > thanks for the one-liner :) I'm afraid I'm in the same position as > before though. > > I dumped all PGs to a file and executed these 2 commands: > > # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_bytes}] > | sort_by(.nk)' pgs.dump | tail > }, > { > "id": "12.193", > "nk": 1002401056 > }, > { > "id": "21.0", > "nk": 1235777228 > } > ] > > # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] | > sort_by(.nk)' pgs.dump | tail > }, > { > "id": "12.17b", > "nk": 1493776 > }, > { > "id": "12.193", > "nk": 1583589 > } > ] > > Neither is beyond the warn limit and pool 12 is indeed the pool > where the warnings came from. OK, now back to the logs: > > # zgrep -i 'Large omap object found. Object:' /var/log/ceph/ceph.log-* > /var/log/ceph/ceph.log-20231008.gz:2023-10-05T01:25:14.581962+0200 > osd.592 (osd.592) 104 : cluster [WRN] Large omap object found. > Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key > count: 21 Size (bytes): 230080309 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T04:33:02.678879+0200 > osd.949 (osd.949) 6897 : cluster [WRN] Large omap object found. > Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key > count: 200243 Size (bytes): 230307097 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T07:22:40.512228+0200 > osd.988 (osd.988) 4365 : cluster [WRN] Large omap object found. > Object: 12:eb96322f:::637.:head PG: 12.f44c69d7 (12.1d7) Key > count: 200329 Size (bytes): 230310393 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T15:08:03.785186+0200 > osd.50 (osd.50) 4549 : cluster [WRN] Large omap object found. > Object: 12:08fb0eb7:::635.:head PG: 12.ed70df10 (12.110) Key > count: 200183 Size (bytes): 230150641 > /var/log/ceph/ceph.log-20231008.gz:2023-10-07T16:37:12.901470+0200 > osd.18 (osd.18) 7011 : cluster [WRN] Large omap object found. > Object: 12:d6758956:::634.:head PG: 12.6a91ae6b (12.6b) Key > count: 200247 Size (bytes): 230343371 > /var/log/ceph/ceph.log-20231008.gz:2023-10-08T01:25:16.125068+0200 > osd.980 (osd.980) 308 : cluster [WRN] Large omap object found. > Object: 12:63f985e7:::639.:head PG: 12.e7a19fc6 (12.1c6) Key > count: 200160 Size (bytes): 230179282 > /var/log/ceph/ceph.log-20231015:2023-10-09T00:51:32.587849+0200 > osd.563 (osd.563) 3661 : cluster [WRN] Large omap object found. > Object: 12:44346421:::632.:head PG: 12.84262c22 (12.22) Key > count: 200325 Size (bytes): 230481029 > /var/log/ceph/ceph.log-20231015:2023-10-09T15:35:28.803117+0200 > osd.949 (osd.949) 7088 : cluster [WRN] Large omap object found. > Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key > count: 200327 Size (bytes): 230404872 > /var/log/ceph/ceph.log-20231015:2023-10-09T18:51:35.615096+0200 > osd.592 (osd.592) 461 : cluster [WRN] Large omap object found. > Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key > count: 200228 Size (bytes): 230347361 > > The warnings report a key count > 20, but none of the PGs in the > dump does. Apparently, all these PGs were (deep-) scrubbed already > and the omap key count was updated (or am I misunderstanding > something here). I still don't know and can neither conclude which > PG the warning originates from. As far as I can tell, the warning > should not be there. > > Do you have an idea how to continue diagnosis from here apart from > just t
[ceph-users] Re: find PG with large omap object
Hi Eugen, thanks for the one-liner :) I'm afraid I'm in the same position as before though. I dumped all PGs to a file and executed these 2 commands: # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_bytes}] | sort_by(.nk)' pgs.dump | tail }, { "id": "12.193", "nk": 1002401056 }, { "id": "21.0", "nk": 1235777228 } ] # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] | sort_by(.nk)' pgs.dump | tail }, { "id": "12.17b", "nk": 1493776 }, { "id": "12.193", "nk": 1583589 } ] Neither is beyond the warn limit and pool 12 is indeed the pool where the warnings came from. OK, now back to the logs: # zgrep -i 'Large omap object found. Object:' /var/log/ceph/ceph.log-* /var/log/ceph/ceph.log-20231008.gz:2023-10-05T01:25:14.581962+0200 osd.592 (osd.592) 104 : cluster [WRN] Large omap object found. Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key count: 21 Size (bytes): 230080309 /var/log/ceph/ceph.log-20231008.gz:2023-10-07T04:33:02.678879+0200 osd.949 (osd.949) 6897 : cluster [WRN] Large omap object found. Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key count: 200243 Size (bytes): 230307097 /var/log/ceph/ceph.log-20231008.gz:2023-10-07T07:22:40.512228+0200 osd.988 (osd.988) 4365 : cluster [WRN] Large omap object found. Object: 12:eb96322f:::637.:head PG: 12.f44c69d7 (12.1d7) Key count: 200329 Size (bytes): 230310393 /var/log/ceph/ceph.log-20231008.gz:2023-10-07T15:08:03.785186+0200 osd.50 (osd.50) 4549 : cluster [WRN] Large omap object found. Object: 12:08fb0eb7:::635.:head PG: 12.ed70df10 (12.110) Key count: 200183 Size (bytes): 230150641 /var/log/ceph/ceph.log-20231008.gz:2023-10-07T16:37:12.901470+0200 osd.18 (osd.18) 7011 : cluster [WRN] Large omap object found. Object: 12:d6758956:::634.:head PG: 12.6a91ae6b (12.6b) Key count: 200247 Size (bytes): 230343371 /var/log/ceph/ceph.log-20231008.gz:2023-10-08T01:25:16.125068+0200 osd.980 (osd.980) 308 : cluster [WRN] Large omap object found. Object: 12:63f985e7:::639.:head PG: 12.e7a19fc6 (12.1c6) Key count: 200160 Size (bytes): 230179282 /var/log/ceph/ceph.log-20231015:2023-10-09T00:51:32.587849+0200 osd.563 (osd.563) 3661 : cluster [WRN] Large omap object found. Object: 12:44346421:::632.:head PG: 12.84262c22 (12.22) Key count: 200325 Size (bytes): 230481029 /var/log/ceph/ceph.log-20231015:2023-10-09T15:35:28.803117+0200 osd.949 (osd.949) 7088 : cluster [WRN] Large omap object found. Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key count: 200327 Size (bytes): 230404872 /var/log/ceph/ceph.log-20231015:2023-10-09T18:51:35.615096+0200 osd.592 (osd.592) 461 : cluster [WRN] Large omap object found. Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key count: 200228 Size (bytes): 230347361 The warnings report a key count > 20, but none of the PGs in the dump does. Apparently, all these PGs were (deep-) scrubbed already and the omap key count was updated (or am I misunderstanding something here). I still don't know and can neither conclude which PG the warning originates from. As far as I can tell, the warning should not be there. Do you have an idea how to continue diagnosis from here apart from just trying a deep scrub on all PGs in the list from the log? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Monday, October 16, 2023 1:41 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: find PG with large omap object Hi, not sure if this is what you need, but if you know the pool id (you probably should) you could try this, it's from an Octopus test cluster (assuming the warning was for the number of keys, not bytes): $ ceph -f json pg dump pgs 2>/dev/null | jq -r '.pg_stats[] | select (.pgid | startswith("17.")) | .pgid + " " + "\(.stat_sum.num_omap_keys)"' 17.6 191 17.7 759 17.4 358 17.5 0 17.2 177 17.3 1 17.0 375 17.1 176 If you don't know the pool you could sort the ouput by the second column and see which PG has the largest number of omap_keys. Regards, Eugen Zitat von Frank Schilder : > Hi all, > > we had a bunch of large omap object warnings after a user deleted a > lot of files on a ceph fs with snapshots. After the snapshots were > rotated out, all but one of these warnings disappeared over time. > However, one warning is stuck and I wonder if its something else. > > Is there a reasonable way (say, one-liner with no more than 120 > characters) to get ceph to tell me which PG this is coming from? I > just want to issue a deep scrub to check if it disappears and going > through th
[ceph-users] find PG with large omap object
Hi all, we had a bunch of large omap object warnings after a user deleted a lot of files on a ceph fs with snapshots. After the snapshots were rotated out, all but one of these warnings disappeared over time. However, one warning is stuck and I wonder if its something else. Is there a reasonable way (say, one-liner with no more than 120 characters) to get ceph to tell me which PG this is coming from? I just want to issue a deep scrub to check if it disappears and going through the logs and querying every single object for its key count seems a bit of a hassle for something that ought to be part of "ceph health detail". Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Please help collecting stats of Ceph monitor disk writes
> so that an average number of compactions per hour can be calculated For this you need the time window length for the counts. That's not automatic with the command you sent. For example, my MON log file started at 4am in the morning, there was a logrotate. So the assumption that the current time and count will give a correct average for the day starting at 00:00 failed in my situation. To get a correct count for today I would have had to combine 2 log files and send you the current time and count: # grep -i "manual compaction from" /var/log/ceph/ceph-mon.ceph-01.log-20231013 /var/log/ceph/ceph-mon.ceph-01.log | grep "2023-10-13" | wc -l 1226 # grep -i "manual compaction from" /var/log/ceph/ceph-mon.ceph-01.log | grep "2023-10-13" | wc -l 912 # date Fri Oct 13 16:07:57 CEST 2023 Apart from that, I agree that a shorter time window will do. Its the start time of the log that's missing. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Please help collecting stats of Ceph monitor disk writes
Hi Zakhar, I'm pretty sure you wanted the #manual compactions for an entire day, not from whenever the log starts to current time, which is most often not 23:59. You need to get the date from the previous day and make sure the log contains a full 00:00-23:59 window. 1) iotop results: TID PRIO USER DISK READ DISK WRITE SWAPIN IOCOMMAND 2256 be/4 ceph 0.00 B 17.48 M 0.00 % 0.80 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [safe_timer] 2230 be/4 ceph 0.00 B 1514.19 M 0.00 % 0.37 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [rocksdb:low0] 2250 be/4 ceph 0.00 B 36.23 M 0.00 % 0.15 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [fn_monstore] 2231 be/4 ceph 0.00 B 50.52 M 0.00 % 0.02 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [rocksdb:high0] 2225 be/4 ceph 0.00 B120.00 K 0.00 % 0.00 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [log] 2) manual compactions (over a full 24h window): 1882 3) monitor store.db size: 616M 4) cluster version and status: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) cluster: id: xxx health: HEALTH_WARN 1 large omap objects services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 7w) mgr: ceph-25(active, since 4w), standbys: ceph-26, ceph-01, ceph-03, ceph-02 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1282 up (since 27h), 1282 in (since 3w) task status: data: pools: 14 pools, 25065 pgs objects: 2.18G objects, 3.9 PiB usage: 4.8 PiB used, 8.3 PiB / 13 PiB avail pgs: 25037 active+clean 26active+clean+scrubbing+deep 2 active+clean+scrubbing io: client: 1.7 GiB/s rd, 1013 MiB/s wr, 3.02k op/s rd, 1.78k op/s wr Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes
Oh wow! I never bothered looking, because on our hardware the wear is so low: # iotop -ao -bn 2 -d 300 Total DISK READ : 0.00 B/s | Total DISK WRITE : 6.46 M/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 6.47 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IOCOMMAND 2230 be/4 ceph 0.00 B 1818.71 M 0.00 % 0.46 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [rocksdb:low0] 2256 be/4 ceph 0.00 B 19.27 M 0.00 % 0.43 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [safe_timer] 2250 be/4 ceph 0.00 B 42.38 M 0.00 % 0.26 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [fn_monstore] 2231 be/4 ceph 0.00 B 58.36 M 0.00 % 0.01 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [rocksdb:high0] 644 be/3 root 0.00 B576.00 K 0.00 % 0.00 % [jbd2/sda3-8] 2225 be/4 ceph 0.00 B128.00 K 0.00 % 0.00 % ceph-mon --cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 --mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [log] 1637141 be/4 root 0.00 B 0.00 B 0.00 % 0.00 % [kworker/u113:2-flush-8:0] 1636453 be/4 root 0.00 B 0.00 B 0.00 % 0.00 % [kworker/u112:0-ceph0] 1560 be/4 root 0.00 B 20.00 K 0.00 % 0.00 % rsyslogd -n [in:imjournal] 1561 be/4 root 0.00 B 56.00 K 0.00 % 0.00 % rsyslogd -n [rs:main Q:Reg] 1.8GB every 5 minutes, thats 518GB per day. The 400G drives we have are rated 10DWPD and with the 6-drives RAID10 config this gives plenty of life-time. I guess this write load will kill any low-grade SSD (typical bood devices, even enterprise ones) specifically if its smaller drives and the controller doesn't reallocate cells according to remaining write endurance. I guess there was a reason for the recommendations by Dell. I always thought that the recent recommendation for MON store storage in the ceph docs are a "bit unrealistic", apparently both, in size and in performance (including endurance). Well, I guess you need to look for write intensive drives with decent specs. If you do, also go for sufficient size. This will absorb temporary usage peaks that can be very large and also provide extra endurance with SSDs with good controllers. I also think the recommendations on the ceph docs deserve a reality check. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zakhar Kirpichenko Sent: Wednesday, October 11, 2023 4:30 PM To: Eugen Block Cc: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Ceph 16.2.x mon compactions, disk writes Eugen, Thanks for your response. May I ask what numbers you're referring to? I am not referring to monitor store.db sizes. I am specifically referring to writes monitors do to their store.db file by frequently rotating and replacing them with new versions during compactions. The size of the store.db remains more or less the same. This is a 300s iotop snippet, sorted by aggregated disk writes: Total DISK READ:35.56 M/s | Total DISK WRITE:23.89 M/s Current DISK READ: 35.64 M/s | Current DISK WRITE: 24.09 M/s TID PRIO USER DISK READ DISK WRITE> SWAPIN IOCOMMAND 4919 be/4 167 16.75 M 2.24 G 0.00 % 1.34 % ceph-mon -n mon.ceph03 -f --setuser ceph --setgr~lt-mon-cluster-log-to-stderr=true [rocksdb:low0] 15122 be/4 167 0.00 B652.91 M 0.00 % 0.27 % ceph-osd -n osd.31 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug [bstore_kv_sync] 17073 be/4 167 0.00 B651.86 M 0.00 % 0.27 % ceph-osd -n osd.32 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug [bstore_kv_sync] 17268 be/4 167 0.00 B490.86 M 0.00 % 0.18 % ceph-osd -n osd.25 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug [bstore_kv_sync] 18032 be/4 167 0.00 B463.57 M 0.00 % 0.17 % ceph-osd -n osd.26 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug [bstore_kv_sync] 16855 be/4 167 0.00 B402.86 M 0.00 % 0.15 % ceph-osd -n osd.22 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug [bstore_kv_sync] 17406 be/4 167 0.00 B387.03 M 0.00 % 0.14 % ceph-osd -n osd.27 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug [bstore_kv_sync] 17932 be/4 167 0.00 B375.42 M 0.00 % 0.13 % ceph-osd -n osd.29 -f -
[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes
I need to ask here: where exactly do you observe the hundreds of GB written per day? Are the mon logs huge? Is it the mon store? Is your cluster unhealthy? We have an octopus cluster with 1282 OSDs, 1650 ceph fs clients and about 800 librbd clients. Per week our mon logs are about 70M, the cluster logs about 120M , the audit logs about 70M and I see between 100-200Kb/s writes to the mon store. That's in the lower-digit GB range per day. Hundreds of GB per day sound completely over the top on a healthy cluster, unless you have MGR modules changing the OSD/cluster map continuously. Is autoscaler running and doing stuff? Is balancer running and doing stuff? Is backfill going on? Is recovery going on? Is your ceph version affected by the "excessive logging to MON store" issue that was present starting with pacific but should have been addressed by now? @Eugen: Was there not an option to limit logging to the MON store? For information to readers, we followed old recommendations from a Dell white paper for building a ceph cluster and have a 1TB Raid10 array on 6x write intensive SSDs for the MON stores. After 5 years we are below 10% wear. Average size of the MON store for a healthy cluster is 500M-1G, but we have seen this ballooning to 100+GB in degraded conditions. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Zakhar Kirpichenko Sent: Wednesday, October 11, 2023 12:00 PM To: Eugen Block Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Ceph 16.2.x mon compactions, disk writes Thank you, Eugen. I'm interested specifically to find out whether the huge amount of data written by monitors is expected. It is eating through the endurance of our system drives, which were not specced for high DWPD/TBW, as this is not a documented requirement, and monitors produce hundreds of gigabytes of writes per day. I am looking for ways to reduce the amount of writes, if possible. /Z On Wed, 11 Oct 2023 at 12:41, Eugen Block wrote: > Hi, > > what you report is the expected behaviour, at least I see the same on > all clusters. I can't answer why the compaction is required that > often, but you can control the log level of the rocksdb output: > > ceph config set mon debug_rocksdb 1/5 (default is 4/5) > > This reduces the log entries and you wouldn't see the manual > compaction logs anymore. There are a couple more rocksdb options but I > probably wouldn't change too much, only if you know what you're doing. > Maybe Igor can comment if some other tuning makes sense here. > > Regards, > Eugen > > Zitat von Zakhar Kirpichenko : > > > Any input from anyone, please? > > > > On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko > wrote: > > > >> Any input from anyone, please? > >> > >> It's another thing that seems to be rather poorly documented: it's > unclear > >> what to expect, what 'normal' behavior should be, and what can be done > >> about the huge amount of writes by monitors. > >> > >> /Z > >> > >> On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko > wrote: > >> > >>> Hi, > >>> > >>> Monitors in our 16.2.14 cluster appear to quite often run "manual > >>> compaction" tasks: > >>> > >>> debug 2023-10-09T09:30:53.888+ 7f48a329a700 4 rocksdb: > EVENT_LOG_v1 > >>> {"time_micros": 1696843853892760, "job": 64225, "event": > "flush_started", > >>> "num_memtables": 1, "num_entries": 715, "num_deletes": 251, > >>> "total_data_size": 3870352, "memory_usage": 3886744, "flush_reason": > >>> "Manual Compaction"} > >>> debug 2023-10-09T09:30:53.904+ 7f4899286700 4 rocksdb: > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction > >>> starting > >>> debug 2023-10-09T09:30:53.908+ 7f48a3a9b700 4 rocksdb: (Original > Log > >>> Time 2023/10/09-09:30:53.910204) > [db_impl/db_impl_compaction_flush.cc:2516] > >>> [default] Manual compaction from level-0 to level-5 from 'paxos .. > 'paxos; > >>> will stop at (end) > >>> debug 2023-10-09T09:30:53.908+ 7f4899286700 4 rocksdb: > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction > >>> starting > >>> debug 2023-10-09T09:30:53.908+ 7f4899286700 4 rocksdb: > >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction > >>> starting > >>> debug 2023-10-09T09:30:53.908+ 7f4899286700 4 rocksdb: > >>> [db_impl
[ceph-users] Re: backfill_wait preventing deep scrubs
Thanks! Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mykola Golub Sent: Thursday, September 21, 2023 4:53 PM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] backfill_wait preventing deep scrubs On Thu, Sep 21, 2023 at 1:57 PM Frank Schilder wrote: > > Hi all, > > I replaced a disk in our octopus cluster and it is rebuilding. I noticed that > since the replacement there is no scrubbing going on. Apparently, an OSD > having a PG in backfill_wait state seems to block deep scrubbing all other > PGs on that OSD as well - at least this is how it looks. > > Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. > A total of 144 PGs needed backfilling (were remapped after replacing the > disk). After about 2 days we are down to 115 backfill_wait + 3 backfilling. > It will take a bit more than a week to complete. > > There is plenty of time and IOP/s available to deep-scrub PGs on the side, > but since the backfill started there is zero scrubbing/deep scrubbing going > on and "PGs not deep scrubbed in time" messages are piling up. > > Is there a way to allow (deep) scrub in this situation? ceph config set osd osd_scrub_during_recovery true -- Mykola Golub ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] backfill_wait preventing deep scrubs
Hi all, I replaced a disk in our octopus cluster and it is rebuilding. I noticed that since the replacement there is no scrubbing going on. Apparently, an OSD having a PG in backfill_wait state seems to block deep scrubbing all other PGs on that OSD as well - at least this is how it looks. Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. A total of 144 PGs needed backfilling (were remapped after replacing the disk). After about 2 days we are down to 115 backfill_wait + 3 backfilling. It will take a bit more than a week to complete. There is plenty of time and IOP/s available to deep-scrub PGs on the side, but since the backfill started there is zero scrubbing/deep scrubbing going on and "PGs not deep scrubbed in time" messages are piling up. Is there a way to allow (deep) scrub in this situation? Thanks and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: libceph: mds1 IP+PORT wrong peer at address
Hi, in our case the problem was on the client side. When you write "logs from a host", do you mean an OSD host or a host where client connections come from? Its not clear from your problem description *who* is requesting the wrong peer. The reason for this message is that something tries to talk to an older instance of a daemon. The number after the / is a nonce that is assigned after each restart to be able to distinguish different instances of the same service (say, OSD or MDS). Usually, these get updated on peering. If its clients that are stuck, you probably need to reboot the client hosts. Please specify what is trying to reach the outdated OSD instances. Then a relevant developer is more likely to look at it. Since its not MDS-kclient interaction it might be useful to open a new case. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: U S Sent: Tuesday, September 19, 2023 5:35 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: libceph: mds1 IP+PORT wrong peer at address Hi, We are unable to resolve these issues and OSD restarts have made the ceph cluster unusable. We are wondering if downgrading ceph version from 18.2.0 to 17.2.6. Please let us know if this is supported and if so, please point me to the procedure to do the same. Thanks. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS daemons don't report any more
Hi Patrick, I'm not sure that its exactly the same issue. I observed that "ceph tell mds.xyz session ls" had all counters 0. On Friday before we had a power loss on a rack that took out a JBOD with a few meta-data disks and I suspect that the reporting of zeroes started after this crash. No hard evidence though. I uploaded all logs with a bit of explanation to tag 1c022c43-04a7-419d-bdb0-e33c97ef06b8. I don't have any more detail than that recorded. It was 3 other MDSes that restarted on fail of rank 0, not 5 as I wrote before. The readme.txt contains pointers to find the info in the logs. We didn't have any user complaints. Therefore, I'm reasonably confident that the file system was actually accessible the whole time (from Friday afternoon until Sunday night when I restarted everything). I hope you can find something useful. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Patrick Donnelly Sent: Monday, September 11, 2023 7:51 PM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: MDS daemons don't report any more Hello Frank, On Mon, Sep 11, 2023 at 8:39 AM Frank Schilder wrote: > > Update: I did a systematic fail of all MDSes that didn't report, starting > with the stand-by daemons and continuing from high to low ranks. One by one > they started showing up again with version and stats and the fail went as > usual with one exception: rank 0. It might be https://tracker.ceph.com/issues/24403 > The moment I failed rank 0 it took 5 other MDSes down with it. Can you be more precise about what happened? Can you share logs? > This is, in fact, the second time I have seen such an event, failing an MDS > crashes others. Given the weird observation in my previous e-mail together > with what I saw when restarting everything, does this indicate a problem with > data integrity or is this an annoying yet harmless bug? It sounds like an annoyance but certainly one we'd like to track down. Keep in mind that the "fix" for https://tracker.ceph.com/issues/24403 is not going to Octopus. You need to upgrade and Pacific will soon be EOL. > Thanks for any help! > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Frank Schilder > Sent: Sunday, September 10, 2023 12:39 AM > To: ceph-users@ceph.io > Subject: [ceph-users] MDS daemons don't report any more > > Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to > report to the cluster any more: > > # ceph fs status > con-fs2 - 1625 clients > === > RANK STATE MDS ACTIVITY DNSINOS > 0active ceph-16 Reqs:0 /s 0 0 > 1active ceph-09 Reqs: 128 /s 4251k 4250k > 2active ceph-17 Reqs:0 /s 0 0 > 3active ceph-15 Reqs:0 /s 0 0 > 4active ceph-24 Reqs: 269 /s 3567k 3567k > 5active ceph-11 Reqs:0 /s 0 0 > 6active ceph-14 Reqs:0 /s 0 0 > 7active ceph-23 Reqs:0 /s 0 0 > POOL TYPE USED AVAIL >con-fs2-meta1 metadata 2169G 7081G >con-fs2-meta2 data 0 7081G > con-fs2-data data1248T 4441T > con-fs2-data-ec-ssddata 705G 22.1T >con-fs2-data2 data3172T 4037T > STANDBY MDS > ceph-08 > ceph-10 > ceph-12 > ceph-13 > VERSION >DAEMONS > None > ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12 > ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus > (stable)ceph-09, ceph-24, ceph-08, ceph-13 > > Version is "none" for these and there are no stats. Ceph versions reports > only 4 MDSes out of the 12. 8 are not shown at all: > > [root@gnosis ~]# ceph versions > { > "mon": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) > octopus (stable)": 5 > }, > "mgr": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) > octopus (stable)": 5 > }, > "osd": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) > octopus (stable)": 1282 > }, > "mds": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) > octopus (stable)": 4 > }, > "overall": { > "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632a
[ceph-users] Re: MDS daemons don't report any more
Update: I did a systematic fail of all MDSes that didn't report, starting with the stand-by daemons and continuing from high to low ranks. One by one they started showing up again with version and stats and the fail went as usual with one exception: rank 0. The moment I failed rank 0 it took 5 other MDSes down with it. This is, in fact, the second time I have seen such an event, failing an MDS crashes others. Given the weird observation in my previous e-mail together with what I saw when restarting everything, does this indicate a problem with data integrity or is this an annoying yet harmless bug? Thanks for any help! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Sunday, September 10, 2023 12:39 AM To: ceph-users@ceph.io Subject: [ceph-users] MDS daemons don't report any more Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more: # ceph fs status con-fs2 - 1625 clients === RANK STATE MDS ACTIVITY DNSINOS 0active ceph-16 Reqs:0 /s 0 0 1active ceph-09 Reqs: 128 /s 4251k 4250k 2active ceph-17 Reqs:0 /s 0 0 3active ceph-15 Reqs:0 /s 0 0 4active ceph-24 Reqs: 269 /s 3567k 3567k 5active ceph-11 Reqs:0 /s 0 0 6active ceph-14 Reqs:0 /s 0 0 7active ceph-23 Reqs:0 /s 0 0 POOL TYPE USED AVAIL con-fs2-meta1 metadata 2169G 7081G con-fs2-meta2 data 0 7081G con-fs2-data data1248T 4441T con-fs2-data-ec-ssddata 705G 22.1T con-fs2-data2 data3172T 4037T STANDBY MDS ceph-08 ceph-10 ceph-12 ceph-13 VERSION DAEMONS None ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12 ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)ceph-09, ceph-24, ceph-08, ceph-13 Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all: [root@gnosis ~]# ceph versions { "mon": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "mgr": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "osd": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282 }, "mds": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4 }, "overall": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296 } } Ceph status reports everything as up and OK: [root@gnosis ~]# ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w) mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs data: pools: 14 pools, 25065 pgs objects: 2.14G objects, 3.7 PiB usage: 4.7 PiB used, 8.4 PiB / 13 PiB avail pgs: 79908208/18438361040 objects misplaced (0.433%) 23063 active+clean 1225 active+clean+snaptrim_wait 317 active+remapped+backfill_wait 250 active+remapped+backfilling 208 active+clean+snaptrim 2 active+clean+scrubbing+deep io: client: 596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr recovery: 8.7 GiB/s, 3.41k objects/s My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help. Any ideas what is going on here? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MGR executes config rm all the time
Thanks for getting back. We are on octopus, looks like I can't do much about it. I hope it is just spam and harmless otherwise. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Mykola Golub Sent: Sunday, September 10, 2023 3:04 PM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] MGR executes config rm all the time Hi Frank. It should be fixed in recent versions. See [1] and the related backport tickets. [1] https://tracker.ceph.com/issues/57726 On Sun, Sep 10, 2023 at 11:16 AM Frank Schilder wrote: > > Hi all, > > I recently started observing that our MGR seems to execute the same "config > rm" commands over and over again, in the audit log: > > 9/10/23 10:03:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' > cmd=[{"prefix":"config > rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: > dispatch > > 9/10/23 10:03:24 AM[INF]from='mgr.252911336 192.168.32.68:0/63' > entity='mgr.ceph-25' cmd=[{"prefix":"config > rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: > dispatch > > 9/10/23 10:03:19 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' > cmd=[{"prefix":"config > rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: > dispatch > > 9/10/23 10:03:19 AM[INF]from='mgr.252911336 192.168.32.68:0/63' > entity='mgr.ceph-25' cmd=[{"prefix":"config > rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: > dispatch > > 9/10/23 10:02:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' > cmd=[{"prefix":"config > rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: > dispatch > > We don't have mirroring, ts a single cluster. What is going on here and how > can I stop that? I already restarted all MGR daemons to no avail. > > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io -- Mykola Golub ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MGR executes config rm all the time
Hi all, I recently started observing that our MGR seems to execute the same "config rm" commands over and over again, in the audit log: 9/10/23 10:03:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch 9/10/23 10:03:24 AM[INF]from='mgr.252911336 192.168.32.68:0/63' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch 9/10/23 10:03:19 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: dispatch 9/10/23 10:03:19 AM[INF]from='mgr.252911336 192.168.32.68:0/63' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: dispatch 9/10/23 10:02:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' cmd=[{"prefix":"config rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: dispatch We don't have mirroring, ts a single cluster. What is going on here and how can I stop that? I already restarted all MGR daemons to no avail. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MDS daemons don't report any more
Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report to the cluster any more: # ceph fs status con-fs2 - 1625 clients === RANK STATE MDS ACTIVITY DNSINOS 0active ceph-16 Reqs:0 /s 0 0 1active ceph-09 Reqs: 128 /s 4251k 4250k 2active ceph-17 Reqs:0 /s 0 0 3active ceph-15 Reqs:0 /s 0 0 4active ceph-24 Reqs: 269 /s 3567k 3567k 5active ceph-11 Reqs:0 /s 0 0 6active ceph-14 Reqs:0 /s 0 0 7active ceph-23 Reqs:0 /s 0 0 POOL TYPE USED AVAIL con-fs2-meta1 metadata 2169G 7081G con-fs2-meta2 data 0 7081G con-fs2-data data1248T 4441T con-fs2-data-ec-ssddata 705G 22.1T con-fs2-data2 data3172T 4037T STANDBY MDS ceph-08 ceph-10 ceph-12 ceph-13 VERSION DAEMONS None ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12 ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)ceph-09, ceph-24, ceph-08, ceph-13 Version is "none" for these and there are no stats. Ceph versions reports only 4 MDSes out of the 12. 8 are not shown at all: [root@gnosis ~]# ceph versions { "mon": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "mgr": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 5 }, "osd": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1282 }, "mds": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 4 }, "overall": { "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)": 1296 } } Ceph status reports everything as up and OK: [root@gnosis ~]# ceph status cluster: id: e4ece518-f2cb-4708-b00f-b6bf511e91d9 health: HEALTH_OK services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w) mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, ceph-26 mds: con-fs2:8 4 up:standby 8 up:active osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs data: pools: 14 pools, 25065 pgs objects: 2.14G objects, 3.7 PiB usage: 4.7 PiB used, 8.4 PiB / 13 PiB avail pgs: 79908208/18438361040 objects misplaced (0.433%) 23063 active+clean 1225 active+clean+snaptrim_wait 317 active+remapped+backfill_wait 250 active+remapped+backfilling 208 active+clean+snaptrim 2 active+clean+scrubbing+deep io: client: 596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr recovery: 8.7 GiB/s, 3.41k objects/s My first thought is that the status module failed. However, I don't manage to restart it (always on). An MGR fail-over did not help. Any ideas what is going on here? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?
Hi, I did something like that in the past. If you have a sufficient amount of cold data in general and you can bring the OSDs back with their original IDs, recovery was significantly faster than rebalancing. It really depends how trivial the version update per object is. In my case it could re-use thousands of clean objects per dirty object. If you are unsure its probably best to do a wipe + rebalance. What can take quite a while at the beginning is the osdmap update if they were down for such a long time. The first boot until they show up as "in" will take a while. Set norecover and norebalance until you see in the OSD log that they have the latest OSD map version. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Malte Stroem Sent: Wednesday, September 6, 2023 4:16 PM To: ceph-m...@rikdvk.mailer.me; ceph-users@ceph.io Subject: [ceph-users] Re: Is it possible (or meaningful) to revive old OSDs? Hi ceph-m...@rikdvk.mailer.me, you could squeeze the OSDs back in but it does not make sense. Just clean the disks with dd for example and add them as new disks to your cluster. Best, Malte Am 04.09.23 um 09:39 schrieb ceph-m...@rikdvk.mailer.me: > Hello, > > I have a ten node cluster with about 150 OSDs. One node went down a while > back, several months. The OSDs on the node have been marked as down and out > since. > > I am now in the position to return the node to the cluster, with all the OS > and OSD disks. When I boot up the now working node, the OSDs do not start. > > Essentially , it seems to complain with "fail[ing]to load OSD map for > [various epoch]s, got 0 bytes". > > I'm guessing the OSDs on disk maps are so old, they can't get back into the > cluster? > > My questions are whether it's possible or worth it to try to squeeze these > OSDs back in or to just replace them. And if I should just replace them, > what's the best way? Manually remove [1] and recreate? Replace [2]? Purge in > dashboard? > > [1] > https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#removing-osds-manual > [2] > https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#replacing-an-osd > > Many thanks! > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: A couple OSDs not starting after host reboot
Hi Alison, I have observed exactly that with OSDs "converted" from ceph-disk to ceph-volume. Someone thought it would be a great idea to store the /dev-device name in the config instead of the uuid or any other stable device path: # cat /etc/ceph/osd/287-2eaf591b-bced-4097-9499-5fda071c6161.json { ... "block": { "path": "/dev/disk/by-partuuid/0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649", "uuid": "0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649" }, ... "data": { "path": "/dev/sdm1", "uuid": "2eaf591b-bced-4097-9499-5fda071c6161" }, ... } Funnily enough, it has the by-uuid path stored as well, but the /dev path is actually used during activation. My "fix" is to re-generate the OSD-json just before every ceph-disk OSD start. You seem to be using LVM OSDs already, so this is a bit weird (can't be the exact same issue). Still, I would not be surprised if you are bitten by something similar, some stored config (cache) overrides the actual drive location. It is really a bliss that the developers implemented a check that a partition actually points to the data with the correct OSD ID, otherwise our cluster would be rigged by now. I would start by using low-level commands (ceph-volume) directly to see if the issue is low-level or sits in some higher-level interface. Log-in to the OSD node and check what "ceph-volume inventory" says and if you can manually activate/deactivate the OSD on disk (be careful to include the --no-systemd option everywhere to avoid unintended change of persistent configurations). Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: apeis...@fnal.gov Sent: Friday, August 25, 2023 10:29 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: A couple OSDs not starting after host reboot Hi, Thank you for your reply. I don’t think the device names changed, but ceph seems to be confused about which device the OSD is on. It’s reporting that there are 2 OSDs on the same device although this is not true. ceph device ls-by-host | grep sdu ATA_HGST_HUH728080ALN600_VJH4GLUX sdu osd.665 ATA_HGST_HUH728080ALN600_VJH60MAX sdu osd.657 The osd.665 is actually on device sdm. Could this be the cause of the issue? Is there a way to correct it? Thanks, Alison ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Client failing to respond to capability release
Hi Dhairya, this is the thing, the client appeared to be responsive and worked fine (file system was on-line and responsive as usual). There was something off though; see my response to Eugen. Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dhairya Parmar Sent: Wednesday, August 23, 2023 9:05 AM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Client failing to respond to capability release Hi Frank, This usually happens when the client is buggy/unresponsive. This warning is triggered when the client fails to respond to MDS's request to release caps in time which is determined by session_timeout(defaults to 60 secs). Did you make any config changes? Dhairya Parmar Associate Software Engineer, CephFS Red Hat Inc.<https://www.redhat.com/> dpar...@redhat.com<mailto:dpar...@redhat.com> [https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]<https://www.redhat.com/> On Tue, Aug 22, 2023 at 9:12 PM Frank Schilder mailto:fr...@dtu.dk>> wrote: Hi all, I have this warning the whole day already (octopus latest cluster): HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not deep-scrubbed in time [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability release mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 145698301 mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511877 mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511887 mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 231250695 If I look at the session info from mds.1 for these clients I see this: # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 189511877 -e 189511887 -e 231250695 {"id":189511887,"h":"sn350.hpc.ait.dtu.dk<http://sn350.hpc.ait.dtu.dk>","addr":"client.189511887 v1:192.168.57.221:0/4262844211<http://192.168.57.221:0/4262844211>","fs":"/hpc/groups","caps":2,"req":0} {"id":231250695,"h":"sn403.hpc.ait.dtu.dk<http://sn403.hpc.ait.dtu.dk>","addr":"client.231250695 v1:192.168.58.18:0/1334540218<http://192.168.58.18:0/1334540218>","fs":"/hpc/groups","caps":3,"req":0} {"id":189511877,"h":"sn463.hpc.ait.dtu.dk<http://sn463.hpc.ait.dtu.dk>","addr":"client.189511877 v1:192.168.58.78:0/3535879569<http://192.168.58.78:0/3535879569>","fs":"/hpc/groups","caps":4,"req":0} {"id":145698301,"h":"sn352.hpc.ait.dtu.dk<http://sn352.hpc.ait.dtu.dk>","addr":"client.145698301 v1:192.168.57.223:0/2146607320<http://192.168.57.223:0/2146607320>","fs":"/hpc/groups","caps":7,"req":0} We have mds_min_caps_per_client=4096, so it looks like the limit is well satisfied. Also, the file system is pretty idle at the moment. Why and what exactly is the MDS complaining about here? Thanks and best regards. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io> To unsubscribe send an email to ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Client failing to respond to capability release
Hi Eugen, thanks for that :D This time it was something different. Possibly a bug in the kclient. On these nodes I found sync commands stuck in D-state. I guess a file/dir was not possible to sync or there was some kind of corruption of buffer data. We had to reboot the servers to clear that out. On first inspection these clients looked OK. Only some deeper debugging revealed that something was off. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Eugen Block Sent: Wednesday, August 23, 2023 8:55 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Client failing to respond to capability release Hi, pointing you to your own thread [1] ;-) [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HFILR5NMUCEZH7TJSGSACPI4P23XTULI/ Zitat von Frank Schilder : > Hi all, > > I have this warning the whole day already (octopus latest cluster): > > HEALTH_WARN 4 clients failing to respond to capability release; 1 > pgs not deep-scrubbed in time > [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to > capability release > mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc > failing to respond to capability release client_id: 145698301 > mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc > failing to respond to capability release client_id: 189511877 > mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc > failing to respond to capability release client_id: 189511887 > mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc > failing to respond to capability release client_id: 231250695 > > If I look at the session info from mds.1 for these clients I see this: > > # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: > .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, > caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep > -e 145698301 -e 189511877 -e 189511887 -e 231250695 > {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 > v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0} > {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 > v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0} > {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 > v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0} > {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 > v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0} > > We have mds_min_caps_per_client=4096, so it looks like the limit is > well satisfied. Also, the file system is pretty idle at the moment. > > Why and what exactly is the MDS complaining about here? > > Thanks and best regards. > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Client failing to respond to capability release
Hi all, I have this warning the whole day already (octopus latest cluster): HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not deep-scrubbed in time [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability release mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 145698301 mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511877 mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 189511887 mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to respond to capability release client_id: 231250695 If I look at the session info from mds.1 for these clients I see this: # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: .client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 189511877 -e 189511887 -e 231250695 {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0} {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0} {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0} {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0} We have mds_min_caps_per_client=4096, so it looks like the limit is well satisfied. Also, the file system is pretty idle at the moment. Why and what exactly is the MDS complaining about here? Thanks and best regards. = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: snaptrim number of objects
Hi Angelo, was this cluster upgraded (major version upgrade) before these issues started? We observed that with certain paths of a major version upgrade and the only way to fix that was to re-deploy all OSDs step by step. You can try a rocks-DB compaction first. If that doesn't help, rebuilding the OSDs might be the only way out. You should also confirm that all ceph-daemons are on the same version and that require-osd-release is reporting the same major version as well: ceph report | jq '.osdmap.require_osd_release' Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Angelo Hongens Sent: Saturday, August 19, 2023 9:58 AM To: Patrick Donnelly Cc: ceph-users@ceph.io Subject: [ceph-users] Re: snaptrim number of objects On 07/08/2023 18:04, Patrick Donnelly wrote: >> I'm trying to figure out what's happening to my backup cluster that >> often grinds to a halt when cephfs automatically removes snapshots. > > CephFS does not "automatically" remove snapshots. Do you mean the > snap_schedule mgr module? Yup. >> Almost all OSD's go to 100% CPU, ceph complains about slow ops, and >> CephFS stops doing client i/o. > > What health warnings do you see? You can try configuring snap trim: > > https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep Mostly a looot of SLOW_OPS. And I guess as a result of that MDS_CLIENT_LATE_RELEASE, MDS_CLIENT_OLDEST_TID, MDS_SLOW_METADATA_IO, MDS_TRIM warnings. >> That won't explain why my cluster bogs down, but at least it gives >> some visibility. Running 17.2.6 everywhere by the way. > > Please let us know how configuring snaptrim helps or not. > When I set nosnaptrim, all I/O immediately restores. When I unset nosnaptrim, i/o stops again. One of the symptoms is that OSD's go to about 350% cpu per daemon. I got the feeling for a while that setting osd_snap_trim_sleep_ssd to 1 helped. I have 120 HDD osd's with wal/journal on ssd, does it even use this value? Everything seemed stable, but eventually another few days passed, and suddenly removing a snapshot brought the cluster down again. So I guess that wasn't the cause. Now what I'm trying to do is set osd_max_trimming_pgs to 0 for all disks, and slowly setting it to 1 for a few osd's. This seems to work for a while, but still it brings the cluster down every now and then, and if not, the cluster is so slow it's almost unusable. This whole troubleshooting process is taking weeks. I just noticed that when 'the problem occurs', a lot of OSD's on a host (15 osd's per host) start using a lot of CPU, even though for example only 3 OSD's on this machine have their osd_max_trimming_pgs set to 1, the rest to 0. Disk doesn't seem to be the bottleneck. Restarting the daemons seems to solve the problem for a while, although the high cpu usage pops up on a different osd node every time. I am at a loss here. I'm almost thinking it's some kind of bug in the osd daemons, but I have no idea how to troubleshoot this. Angelo. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS stuck in rejoin
Dear Xiubo, the nearfull pool is an RBD pool and has nothing to do with the file system. All pools for the file system have plenty of capacity. I think we have an idea what kind of workload caused the issue. We had a user run a computation that reads the same file over and over again. He started 100 such jobs in parallel and our storage servers were at 400% load. I saw 167K read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 11K. Clearly, most of this was served from RAM. It is possible that this extreme load situation triggered a race that remained undetected/unreported. There is literally no related message in any logs near the time the warning started popping up. It shows up out of nowhere. We asked the user to change his workflow to use local RAM disk for the input files. I don't think we can reproduce the problem anytime soon. About the bug fixes, I'm eagerly waiting for this and another one. Any idea when they might show up in distro kernels? Thanks and best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Tuesday, August 8, 2023 2:57 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: MDS stuck in rejoin On 8/7/23 21:54, Frank Schilder wrote: > Dear Xiubo, > > I managed to collect some information. It looks like there is nothing in the > dmesg log around the time the client failed to advance its TID. I collected > short snippets around the critical time below. I have full logs in case you > are interested. Its large files, I will need to do an upload for that. > > I also have a dump of "mds session ls" output for clients that showed the > same issue later. Unfortunately, no consistent log information for a single > incident. > > Here the summary, please let me know if uploading the full package makes > sense: > > - Status: > > On July 29, 2023 > > ceph status/df/pool stats/health detail at 01:05:14: >cluster: > health: HEALTH_WARN > 1 pools nearfull > > ceph status/df/pool stats/health detail at 01:05:28: >cluster: > health: HEALTH_WARN > 1 clients failing to advance oldest client/flush tid > 1 pools nearfull Okay, then this could be the root cause. If the pool nearful it could block flushing the journal logs to the pool and then the MDS couldn't safe reply to the requests and then block them like this. Could you fix the pool nearful issue first and then check could you see it again ? > [...] > > On July 31, 2023 > > ceph status/df/pool stats/health detail at 10:36:16: >cluster: > health: HEALTH_WARN > 1 clients failing to advance oldest client/flush tid > 1 pools nearfull > >cluster: > health: HEALTH_WARN > 1 pools nearfull > > - client evict command (date, time, command): > > 2023-07-31 10:36 ceph tell mds.ceph-11 client evict id=145678457 > > We have a 1h time difference between the date stamp of the command and the > dmesg date stamps. However, there seems to be a weird 10min delay from > issuing the evict command until it shows up in dmesg on the client. > > - dmesg: > > [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey > [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey > [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey > [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey > [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey > [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey > [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3 > [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed > (con state OPEN) > [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed > (con state OPEN) > [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed > (con state OPEN) > [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start > [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start > [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start > [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success > [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success > [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success > [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start > [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start > [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start > [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success > [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success > [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success > [Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22 This is a known bug and we have fixed it i
[ceph-users] Re: MDS stuck in rejoin
1 09:46:41 2023] ceph: mds1 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds0 [Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start [Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset [Mon Jul 31 09:46:41 2023] libceph: reset on mds5 [Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start [Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied [Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Monday, July 31, 2023 12:14 PM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: MDS stuck in rejoin On 7/31/23 16:50, Frank Schilder wrote: > Hi Xiubo, > > its a kernel client. I actually made a mistake when trying to evict the > client and my command didn't do anything. I did another evict and this time > the client IP showed up in the blacklist. Furthermore, the warning > disappeared. I asked for the dmesg logs from the client node. Yeah, after the client's sessions are closed the corresponding warning should be cleared. Thanks > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > > From: Xiubo Li > Sent: Monday, July 31, 2023 4:12 AM > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: MDS stuck in rejoin > > Hi Frank, > > On 7/30/23 16:52, Frank Schilder wrote: >> Hi Xiubo, >> >> it happened again. This time, we might be able to pull logs from the client >> node. Please take a look at my intermediate action below - thanks! >> >> I am in a bit of a calamity, I'm on holidays with terrible network >> connection and can't do much. My first priority is securing the cluster to >> avoid damage caused by this issue. I did an MDS evict by client ID on the >> MDS reporting the warning with the client ID reported in the warning. For >> some reason the client got blocked on 2 MDSes after this command, one of >> these is an ordinary stand-by daemon. Not sure if this is expected. >> >> Main question: is this sufficient to prevent any damaging IO on the cluster? >> I'm thinking here about the MDS eating through all its RAM until it crashes >> hard in an irrecoverable state (that was described as a consequence in an >> old post about this warning). If this is a safe state, I can keep it in this >> state until I return from holidays. > Yeah, I think so. > > BTW, are u using the kclients or user space clients ? I checked both > kclient and libcephfs, it seems buggy in libcephfs, which could cause > this issue. But for kclient it's okay till now. > > Thanks > > - Xiubo > >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> >> From: Xiubo Li >> Sent: Friday, July 28, 2023 11:37 AM >> To: Frank Schilder; ceph-users@ceph.io >> Subject: Re: [ceph-users] Re: MDS stuck in rejoin >> >> >> On 7/26/23 22:13, Frank Schilder wrote: >>> Hi Xiubo. >>> >>>> ... I am more interested in the kclient side logs. Just want to >>>> know why that oldest request got stuck so long. >>> I'm afraid I'm a bad admin in this case. I don't have logs from the host >>> any more, I would have needed the output of dmesg and this is gone. In case >>> it happens again I will try to pull the info out. >>> >>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent >>> than our situation. We had no problems with the MDSes, the cache didn't >>> grow and the relevant one was also not put into read-only mode. It was just >>> this warning showing all the time, health was OK otherwise. I think the >>> warning was there for at least 16h before I failed the MDS. >>> >>> The MDS log contains nothing, this is the only line mentioning this client: >>> >>> 2023-07-20T00:22:05.518+0200 7fe13df59700 0 log_channel(cluster) log [WRN] >>> : client.145678382 does not advance its oldest_clie
[ceph-users] Re: MDS stuck in rejoin
Hi Xiubo, its a kernel client. I actually made a mistake when trying to evict the client and my command didn't do anything. I did another evict and this time the client IP showed up in the blacklist. Furthermore, the warning disappeared. I asked for the dmesg logs from the client node. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Monday, July 31, 2023 4:12 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: MDS stuck in rejoin Hi Frank, On 7/30/23 16:52, Frank Schilder wrote: > Hi Xiubo, > > it happened again. This time, we might be able to pull logs from the client > node. Please take a look at my intermediate action below - thanks! > > I am in a bit of a calamity, I'm on holidays with terrible network connection > and can't do much. My first priority is securing the cluster to avoid damage > caused by this issue. I did an MDS evict by client ID on the MDS reporting > the warning with the client ID reported in the warning. For some reason the > client got blocked on 2 MDSes after this command, one of these is an ordinary > stand-by daemon. Not sure if this is expected. > > Main question: is this sufficient to prevent any damaging IO on the cluster? > I'm thinking here about the MDS eating through all its RAM until it crashes > hard in an irrecoverable state (that was described as a consequence in an old > post about this warning). If this is a safe state, I can keep it in this > state until I return from holidays. Yeah, I think so. BTW, are u using the kclients or user space clients ? I checked both kclient and libcephfs, it seems buggy in libcephfs, which could cause this issue. But for kclient it's okay till now. Thanks - Xiubo > > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ____ > From: Xiubo Li > Sent: Friday, July 28, 2023 11:37 AM > To: Frank Schilder; ceph-users@ceph.io > Subject: Re: [ceph-users] Re: MDS stuck in rejoin > > > On 7/26/23 22:13, Frank Schilder wrote: >> Hi Xiubo. >> >>> ... I am more interested in the kclient side logs. Just want to >>> know why that oldest request got stuck so long. >> I'm afraid I'm a bad admin in this case. I don't have logs from the host any >> more, I would have needed the output of dmesg and this is gone. In case it >> happens again I will try to pull the info out. >> >> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent >> than our situation. We had no problems with the MDSes, the cache didn't grow >> and the relevant one was also not put into read-only mode. It was just this >> warning showing all the time, health was OK otherwise. I think the warning >> was there for at least 16h before I failed the MDS. >> >> The MDS log contains nothing, this is the only line mentioning this client: >> >> 2023-07-20T00:22:05.518+0200 7fe13df59700 0 log_channel(cluster) log [WRN] >> : client.145678382 does not advance its oldest_client_tid (16121616), 10 >> completed requests recorded in session > Okay, if so it's hard to say and dig out what has happened in client why > it didn't advance the tid. > > Thanks > > - Xiubo > > >> Best regards, >> = >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS stuck in rejoin
Hi Xiubo, it happened again. This time, we might be able to pull logs from the client node. Please take a look at my intermediate action below - thanks! I am in a bit of a calamity, I'm on holidays with terrible network connection and can't do much. My first priority is securing the cluster to avoid damage caused by this issue. I did an MDS evict by client ID on the MDS reporting the warning with the client ID reported in the warning. For some reason the client got blocked on 2 MDSes after this command, one of these is an ordinary stand-by daemon. Not sure if this is expected. Main question: is this sufficient to prevent any damaging IO on the cluster? I'm thinking here about the MDS eating through all its RAM until it crashes hard in an irrecoverable state (that was described as a consequence in an old post about this warning). If this is a safe state, I can keep it in this state until I return from holidays. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Friday, July 28, 2023 11:37 AM To: Frank Schilder; ceph-users@ceph.io Subject: Re: [ceph-users] Re: MDS stuck in rejoin On 7/26/23 22:13, Frank Schilder wrote: > Hi Xiubo. > >> ... I am more interested in the kclient side logs. Just want to >> know why that oldest request got stuck so long. > I'm afraid I'm a bad admin in this case. I don't have logs from the host any > more, I would have needed the output of dmesg and this is gone. In case it > happens again I will try to pull the info out. > > The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent > than our situation. We had no problems with the MDSes, the cache didn't grow > and the relevant one was also not put into read-only mode. It was just this > warning showing all the time, health was OK otherwise. I think the warning > was there for at least 16h before I failed the MDS. > > The MDS log contains nothing, this is the only line mentioning this client: > > 2023-07-20T00:22:05.518+0200 7fe13df59700 0 log_channel(cluster) log [WRN] : > client.145678382 does not advance its oldest_client_tid (16121616), 10 > completed requests recorded in session Okay, if so it's hard to say and dig out what has happened in client why it didn't advance the tid. Thanks - Xiubo > Best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS stuck in rejoin
Hi Xiubo. > ... I am more interested in the kclient side logs. Just want to > know why that oldest request got stuck so long. I'm afraid I'm a bad admin in this case. I don't have logs from the host any more, I would have needed the output of dmesg and this is gone. In case it happens again I will try to pull the info out. The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent than our situation. We had no problems with the MDSes, the cache didn't grow and the relevant one was also not put into read-only mode. It was just this warning showing all the time, health was OK otherwise. I think the warning was there for at least 16h before I failed the MDS. The MDS log contains nothing, this is the only line mentioning this client: 2023-07-20T00:22:05.518+0200 7fe13df59700 0 log_channel(cluster) log [WRN] : client.145678382 does not advance its oldest_client_tid (16121616), 10 completed requests recorded in session Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS stuck in rejoin
Hi Xiubo, I seem to have gotten your e-mail twice. Its a very old kclient. It was in that state when I came to work in the morning and I looked at it in the afternoon. Was hoping the problem would clear by itself. It was probably a compute job that crashed it, its a compute node in our HPC cluster. I didn't look at dmesg, but I can take a look at he MDS log when I'm back at work. I guess you are interested in the time before the warning started appearing. The MDS was probably stuck 5-10 minutes. This is very lng on our cluster. An MDS failover usually takes 30s only. Also I couldn't see any progress in the MDS log, that's why I decided to fail the rank a second time. During the time it was stuck in rejoin it didn't log any other messages than he MDS map update messages. I don't think it was doing anything at all. Some background: Before I started recovery I found an old post stating that the MDS_CLIENT_OLDEST_TID warning is quite serious, because the affected MDS will experience growing cache allocation and eventually fail in a practically irrecoverable state. I did not observe unusual RAM consumption and there were no MDS large cache messages either. Seems like our situation was of a more harmless nature. Still, the fail did not go entirely smooth. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Xiubo Li Sent: Friday, July 21, 2023 1:32 PM To: ceph-users@ceph.io Subject: [ceph-users] Re: MDS stuck in rejoin On 7/20/23 22:09, Frank Schilder wrote: > Hi all, > > we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients > failing to advance oldest client/flush tid". I looked at the client and there > was nothing going on, so I rebooted it. After the client was back, the > message was still there. To clean this up I failed the MDS. Unfortunately, > the MDS that took over is remained stuck in rejoin without doing anything. > All that happened in the log was: BTW, are you using the kclient or user space client ? How long was the MDS stuck in rejoin state ? This means in the client side the oldest client has been stuck too long, maybe in heavy load case there were to many requests generated in a short time and the oldest request was stuck too long in MDS. > [root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log > 2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start > 2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start > 2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to > version 896614 from mon.4 > 2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to > version 896615 from mon.4 > [...] > 2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to > version 896653 from mon.4 > 2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to > version 896654 from mon.4 Did you see any slow request log in the mds log files ? And any other suspect logs from the dmesg if it's kclient ? > After some time I decided to give another fail a try and, this time, the > replacement daemon went to active state really fast. > > If I have a message like the above, what is the clean way of getting the > client clean again (version: 15.2.17 > (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))? I think your steps are correct. Thanks - Xiubo > Thanks and best regards, > = > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MDS stuck in rejoin
Hi all, we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in rejoin without doing anything. All that happened in the log was: [root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log 2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start 2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start 2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896614 from mon.4 2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896615 from mon.4 [...] 2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896653 from mon.4 2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896654 from mon.4 After some time I decided to give another fail a try and, this time, the replacement daemon went to active state really fast. If I have a message like the above, what is the clean way of getting the client clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))? Thanks and best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD memory usage after cephadm adoption
Hi all, now that host masks seem to work, could somebody please shed some light at the relative priority of these settings: ceph config set osd memory_target X ceph config set osd/host:A memory_target Y ceph config set osd/class:B memory_target Z Which one wins for an OSD on host A in class B? Similar for an explicit ID. The expectation is that a setting for OSD.ID always wins. Then the masked values, then the generic osd setting, then the globals and last the defaults. The relative precedence of masked values is not defined anywhere, nor the precedence in general. This is missing in even the latest docs: https://docs.ceph.com/en/quincy/rados/configuration/ceph-conf/#sections-and-masks . Would be great if someone could add this. Thanks! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Luis Domingues Sent: Monday, July 17, 2023 9:36 AM To: Sridhar Seshasayee Cc: Mark Nelson; ceph-users@ceph.io Subject: [ceph-users] Re: OSD memory usage after cephadm adoption It looks indeed to be that bug that I hit. Thanks. Luis Domingues Proton AG --- Original Message --- On Monday, July 17th, 2023 at 07:45, Sridhar Seshasayee wrote: > Hello Luis, > > Please see my response below: > > But when I took a look on the memory usage of my OSDs, I was below of that > > > value, by quite a bite. Looking at the OSDs themselves, I have: > > > > "bluestore-pricache": { > > "target_bytes": 4294967296, > > "mapped_bytes": 1343455232, > > "unmapped_bytes": 16973824, > > "heap_bytes": 1360429056, > > "cache_bytes": 2845415832 > > }, > > > > And if I get the running config: > > "osd_memory_target": "4294967296", > > "osd_memory_target_autotune": "true", > > "osd_memory_target_cgroup_limit_ratio": "0.80", > > > > Which is not the value I observe from the config. I have 4294967296 > > instead of something around 7219293672. Did I miss something? > > This is very likely due to https://tracker.ceph.com/issues/48750. The fix > was recently merged into > the main branch and should be backported soon all the way to pacific. > > Until then, the workaround would be to set the osd_memory_target on each > OSD individually to > the desired value. > > -Sridhar > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cluster down after network outage
Hi all, trying to answer Dan and Stefan in one go. The OSDs were not booting (understood as the ceph-osd daemon starting up), they were running all the time. If they actually boot, they come up quite fast even though they are HDD and collocated. The time to be marked up is usually 30-60s after start, completely fine with me. We are not affected by pglog-dup accumulation or other issues causing excessive boot time and memory consumption. On the affected part of the cluster we have 972 OSDs running on 900 disks, 48 SSDs with 1 or 4 OSDs depending on size and 852 HDDs with 1 OSD. All OSDs were up and running, but network to their hosts was down for a few hours. After network came up, a peering-recover-remap frenzy broke loose and we observed the death spiral described in this old post: https://narkive.com/KAzvjjPc.4 Specifically, just setting nodown didn't help, I guess our cluster is simply too large. I really had to stop recovery as well (norebalance had no effect). I had to do a bit of guessing when all OSDs are both, marked up and seen up at the same time (not trivial when the nodown flag is present), but after all OSDs were seen up for a while and the initial peering was over, enabling recovery now worked as expected and didn't lead to further OSDs being seen as down again. In this particular situation, when the flag nodown is set, it would be extremely helpful to have an extra piece of info in the OSD part of ceph status showing the number "seen up" so that it becomes easier to be sure that an OSD that was marked up some time ago is not "seen down" but not marked down again due to the flag. This was really critical for getting the cluster back. During all this, no parts of the hardware were saturated. However, the way ceph handles such a simultaneous OSD mass-return-to-life event seems not very clever. As soon as some PGs manage to peer into an active state, they start with recovery irrespective of whether or not they are remappings and the missing OSDs are actually there and waiting to be marked up. This in turn increases the load on the daemons unnecessarily. On top of this impatience-implied load amplification comes that the operations scheduler is not good at pushing high-priority messages like MON heart beats through to busy daemons. As a consequence, one has a constant up-and down marking of OSDs even though the OSDs are actually running fine and nothing crashed. This implies constant re-peering of all PGs and accumulation of additional pglog history for any successful recovery or remapped write. The strategy in such a such a situation would actually be to wait for every single OSD to be marked up, start peering and only then look if something needs recovery. Trying to do this all at the same time is just a mess. In this particular case, setting osd_recovery_delay_start to a high value looks very promising to help dealing with such a situation in a bit more hands-free way - if this flag is doing something in octopus. I can see it available in the ceph config options on the command line but not in the docs (its implemented but not documented). Thanks for bringing this to my attention! Looking at the time it took for all OSDs to come up and peering to complete, 10-30 minutes might do the job for our cluster. 10-30 minutes on an HDD pool after simultaneously marking 900 OSDs up is actually a good time. I assume that recovery for a PG will be postponed whenever this PG (or one of its OSDs) has a peering event? When looking at the sequence of events, I would actually also like an osd_up_peer_delay_start to postpone peering a bit in case several OSDs are starting up to avoid redundant re-peering. The ideal recovery sequence after a mass-down of OSDs really seems to be to do each of these as atomic steps: - get all OSDs up - peer all PGs - start recovery if needed Being able to delay peering after an OSD comes up would give some space for the "I'm up" messages to arrive with priority at the MONs. Then the PGs have time to say "I'm active+clean" and only then will the trash be taken out. Thanks for replying to my messages and for pointing me to osd_recovery_delay_start. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Dan van der Ster Sent: Wednesday, July 12, 2023 6:58 PM To: Frank Schilder Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: Cluster down after network outage On Wed, Jul 12, 2023 at 1:26 AM Frank Schilder mailto:fr...@dtu.dk>> wrote: Hi all, one problem solved, another coming up. For everyone ending up in the same situation, the trick seems to be to get all OSDs marked up and then allow recovery. Steps to take: - set noout, nodown, norebalance, norecover - wait patiently until all OSDs are shown as up - unset norebalance, norecover - wait wait wait, PGs will eventually become active as OSDs
[ceph-users] Re: Cluster down after network outage
Answering myself for posteriority. The rebalancing list disappeared after waiting even longer. Might just have been an MGR that needed to catch up. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Wednesday, July 12, 2023 10:25 AM To: ceph-users@ceph.io Subject: [ceph-users] Re: Cluster down after network outage Hi all, one problem solved, another coming up. For everyone ending up in the same situation, the trick seems to be to get all OSDs marked up and then allow recovery. Steps to take: - set noout, nodown, norebalance, norecover - wait patiently until all OSDs are shown as up - unset norebalance, norecover - wait wait wait, PGs will eventually become active as OSDs become responsive - unset nodown, noout Now the new problem. I now have an ever growing list of OSDs listed as rebalancing, but nothing is actually rebalancing. How can I stop this growth and how can I get rid of this list: [root@gnosis ~]# ceph status cluster: id: XXX health: HEALTH_WARN noout flag(s) set Slow OSD heartbeats on back (longest 634775.858ms) Slow OSD heartbeats on front (longest 635210.412ms) 1 pools nearfull services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 6m) mgr: ceph-25(active, since 57m), standbys: ceph-26, ceph-01, ceph-02, ceph-03 mds: con-fs2:8 4 up:standby 8 up:active osd: 1260 osds: 1258 up (since 24m), 1258 in (since 45m) flags noout data: pools: 14 pools, 25065 pgs objects: 1.97G objects, 3.5 PiB usage: 4.4 PiB used, 8.7 PiB / 13 PiB avail pgs: 25028 active+clean 30active+clean+scrubbing+deep 7 active+clean+scrubbing io: client: 1.3 GiB/s rd, 718 MiB/s wr, 7.71k op/s rd, 2.54k op/s wr progress: Rebalancing after osd.135 marked in (1s) [=...] Rebalancing after osd.69 marked in (2s) [] Rebalancing after osd.75 marked in (2s) [===.] Rebalancing after osd.173 marked in (2s) [] Rebalancing after osd.42 marked in (1s) [=...] (remaining: 2s) Rebalancing after osd.104 marked in (2s) [] Rebalancing after osd.82 marked in (2s) [] Rebalancing after osd.107 marked in (2s) [===.] Rebalancing after osd.19 marked in (2s) [===.] Rebalancing after osd.67 marked in (2s) [=...] Rebalancing after osd.46 marked in (2s) [===.] (remaining: 1s) Rebalancing after osd.123 marked in (2s) [===.] Rebalancing after osd.66 marked in (2s) [] Rebalancing after osd.12 marked in (2s) [==..] (remaining: 2s) Rebalancing after osd.95 marked in (2s) [=...] Rebalancing after osd.134 marked in (2s) [===.] Rebalancing after osd.14 marked in (1s) [===.] Rebalancing after osd.56 marked in (2s) [=...] Rebalancing after osd.143 marked in (1s) [] Rebalancing after osd.118 marked in (2s) [===.] Rebalancing after osd.96 marked in (2s) [] Rebalancing after osd.105 marked in (2s) [===.] Rebalancing after osd.44 marked in (1s) [===.] (remaining: 5s) Rebalancing after osd.41 marked in (1s) [==..] (remaining: 1s) Rebalancing after osd.9 marked in (2s) [=...] (remaining: 37s) Rebalancing after osd.58 marked in (2s) [==..] (remaining: 8s) Rebalancing after osd.140 marked in (1s) [===.] Rebalancing after osd.132 marked in (2s) [] Rebalancing after osd.31 marked in (1s) [=...] Rebalancing after osd.110 marked in (2s) [] Rebalancing after osd.21 marked in (2s) [=...] Rebalancing after osd.114 marked in (2s) [===.] Rebalancing after osd.83 marked in (2s) [===.] Rebalancing after osd.23 marked in (1s) [===.] Rebalancing after osd.25 marked in (1s) [==..] Rebalancing after osd.147 marked in (2s) [] Rebalancing after osd.62
[ceph-users] Re: Cluster down after network outage
) [] Rebalancing after osd.112 marked in (1s) [=...] Rebalancing after osd.154 marked in (1s) [=...] Rebalancing after osd.64 marked in (2s) [===.] Rebalancing after osd.34 marked in (2s) [] (remaining: 90s) Rebalancing after osd.161 marked in (1s) [] Rebalancing after osd.160 marked in (2s) [===.] Rebalancing after osd.142 marked in (2s) [===.] Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: Wednesday, July 12, 2023 9:53 AM To: ceph-users@ceph.io Subject: [ceph-users] Cluster down after network outage Hi all, we had a network outage tonight (power loss) and restored network in the morning. All OSDs were running during this period. After restoring network peering hell broke loose and the cluster has a hard time coming back up again. OSDs get marked down all the time and come back later. Peering never stops. Below is the current status, I had all OSDs shown as up for a while, but many were not responsive. Are there some flags that help bringing things up in a sequence that causes less overload on the system? [root@gnosis ~]# ceph status cluster: id: XXX health: HEALTH_WARN 2 clients failing to respond to capability release 6 MDSs report slow metadata IOs 3 MDSs report slow requests nodown,noout,nobackfill,norecover flag(s) set 176 osds down Slow OSD heartbeats on back (longest 551718.679ms) Slow OSD heartbeats on front (longest 549598.330ms) Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161 pgs peering, 1341 pgs stale Degraded data redundancy: 1187354920/16402772667 objects degraded (7.239%), 6222 pgs degraded, 6231 pgs undersized 1 pools nearfull 17386 slow ops, oldest one blocked for 1811 sec, daemons [osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]... have slow ops. services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m) mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02, ceph-03 mds: con-fs2:8 4 up:standby 8 up:active osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs flags nodown,noout,nobackfill,norecover data: pools: 14 pools, 25065 pgs objects: 1.91G objects, 3.4 PiB usage: 3.1 PiB used, 6.0 PiB / 9.0 PiB avail pgs: 0.626% pgs unknown 31.566% pgs not active 1187354920/16402772667 objects degraded (7.239%) 51/16402772667 objects misplaced (0.000%) 11706 active+clean 4752 active+undersized+degraded 3286 down 2702 peering 799 undersized+degraded+peered 464 stale+down 418 stale+active+undersized+degraded 214 remapped+peering 157 unknown 128 stale+peering 117 stale+remapped+peering 101 stale+undersized+degraded+peered 57stale+active+undersized+degraded+remapped+backfilling 35down+remapped 26stale+undersized+degraded+remapped+backfilling+peered 23undersized+degraded+remapped+backfilling+peered 14active+clean+scrubbing+deep 9 stale+active+undersized+degraded+remapped+backfill_wait 7 active+recovering+undersized+degraded 7 stale+active+recovering+undersized+degraded 6 active+undersized+degraded+remapped+backfilling 6 active+undersized 5 active+undersized+degraded+remapped+backfill_wait 5 stale+remapped 4 stale+activating+undersized+degraded 3 active+undersized+remapped 3 stale+undersized+degraded+remapped+backfill_wait+peered 1 activating+undersized+degraded 1 activating+undersized+degraded+remapped 1 undersized+degraded+remapped+backfill_wait+peered 1 stale+active+clean 1 active+recovering 1 stale+down+remapped 1 undersized+peered 1 active+undersized+degraded+remapped 1 active+clean+scrubbing 1 active+clean+remapped 1 active+recovering+degraded io: client: 1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr Thanks for any hints! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users
[ceph-users] Cluster down after network outage
Hi all, we had a network outage tonight (power loss) and restored network in the morning. All OSDs were running during this period. After restoring network peering hell broke loose and the cluster has a hard time coming back up again. OSDs get marked down all the time and come back later. Peering never stops. Below is the current status, I had all OSDs shown as up for a while, but many were not responsive. Are there some flags that help bringing things up in a sequence that causes less overload on the system? [root@gnosis ~]# ceph status cluster: id: XXX health: HEALTH_WARN 2 clients failing to respond to capability release 6 MDSs report slow metadata IOs 3 MDSs report slow requests nodown,noout,nobackfill,norecover flag(s) set 176 osds down Slow OSD heartbeats on back (longest 551718.679ms) Slow OSD heartbeats on front (longest 549598.330ms) Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161 pgs peering, 1341 pgs stale Degraded data redundancy: 1187354920/16402772667 objects degraded (7.239%), 6222 pgs degraded, 6231 pgs undersized 1 pools nearfull 17386 slow ops, oldest one blocked for 1811 sec, daemons [osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]... have slow ops. services: mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m) mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02, ceph-03 mds: con-fs2:8 4 up:standby 8 up:active osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs flags nodown,noout,nobackfill,norecover data: pools: 14 pools, 25065 pgs objects: 1.91G objects, 3.4 PiB usage: 3.1 PiB used, 6.0 PiB / 9.0 PiB avail pgs: 0.626% pgs unknown 31.566% pgs not active 1187354920/16402772667 objects degraded (7.239%) 51/16402772667 objects misplaced (0.000%) 11706 active+clean 4752 active+undersized+degraded 3286 down 2702 peering 799 undersized+degraded+peered 464 stale+down 418 stale+active+undersized+degraded 214 remapped+peering 157 unknown 128 stale+peering 117 stale+remapped+peering 101 stale+undersized+degraded+peered 57stale+active+undersized+degraded+remapped+backfilling 35down+remapped 26stale+undersized+degraded+remapped+backfilling+peered 23undersized+degraded+remapped+backfilling+peered 14active+clean+scrubbing+deep 9 stale+active+undersized+degraded+remapped+backfill_wait 7 active+recovering+undersized+degraded 7 stale+active+recovering+undersized+degraded 6 active+undersized+degraded+remapped+backfilling 6 active+undersized 5 active+undersized+degraded+remapped+backfill_wait 5 stale+remapped 4 stale+activating+undersized+degraded 3 active+undersized+remapped 3 stale+undersized+degraded+remapped+backfill_wait+peered 1 activating+undersized+degraded 1 activating+undersized+degraded+remapped 1 undersized+degraded+remapped+backfill_wait+peered 1 stale+active+clean 1 active+recovering 1 stale+down+remapped 1 undersized+peered 1 active+undersized+degraded+remapped 1 active+clean+scrubbing 1 active+clean+remapped 1 active+recovering+degraded io: client: 1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr Thanks for any hints! = Frank Schilder AIT Risø Campus Bygning 109, rum S14 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 pg inconsistent and does not recover
Hi Stefan, after you wrote that you issue hundreds of deep-scrub commands per day I was already suspecting something like > [...] deep_scrub daemon requests a deep-scrub [...] Its not a Minion you hired that types these commends every so many seconds by hand and hopes for the best. My guess is rather that you are actually running a cluster specifically configured for manual scrub scheduling as was discussed in a thread some time ago to solve the problem of "not deep scrubbed in time" messages due to the built-in scrub scheduler not using the last-scrubbed timestamp for priority (among other things). On such a system I would not be surprised that these commands have their desired effect. To know why it works for you it would be helpful to disclose the whole story, for example what ceph config parameters are active within the context of your daemon executing the deep-scrub instructions and what other ceph-commands surround it in the same way that the pg repair is surrounded by injectargs instructions in the script I posted. It doesn't work like that on a ceph cluster with default config. For example, on our cluster there is a very high likelihood that at least one OSD of any PG is part of a scrub at any time already. In that case, if a PG is not eligible for scrubbing because one of its OSDs has already max-scrubs (default=1) scrubs running, the reservation has no observable effect. Some time ago I had a ceph-user thread discussing exactly that, I wanted to increase the concurrent scrubs running without increasing max-scrubs. The default scheduler seems to be very poor with ordering scrubs in such a way that a maximum number of pgs is scrubbed at any given time. One of the suggestions was to run manual scheduling, which seems exactly like what you are doing. Best regards, ===== Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Stefan Kooman Sent: Wednesday, June 28, 2023 2:17 PM To: Frank Schilder; Alexander E. Patrakov; Niklas Hambüchen Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: 1 pg inconsistent and does not recover On 6/28/23 10:45, Frank Schilder wrote: > Hi Stefan, > > we run Octopus. The deep-scrub request is (immediately) cancelled if the > PG/OSD is already part of another (deep-)scrub or if some peering happens. As > far as I understood, the commands osd/pg deep-scrub and pg repair do not > create persistent reservations. If you issue this command, when does the PG > actually start scrubbing? As soon as another one finishes or when it is its > natural turn? Do you monitor the scrub order to confirm it was the manual > command that initiated a scrub? We request a deep-scrub ... a few seconds later it starts deep-scrubbing. We do not verify in this process if the PG really did start, but they do. See example from a PG below: Jun 27 22:59:50 mon1 pg_scrub[2478540]: [27-06-2023 22:59:34] Scrub PG 5.48a (last deep-scrub: 2023-06-16T22:54:58.684038+0200) ^^ deep_scrub daemon requests a deep-scrub, based on latest deep-scrub timestamp. After a couple of minutes it's deep-scrubbed. See below the deep-scrub timestamp (info from a PG query of 5.48a): "last_deep_scrub_stamp": "2023-06-27T23:06:01.823894+0200" We have been using this in Octopus (actually since Luminous, but in a different way). Now we are on Pacific. Gr. Stefan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: 1 pg inconsistent and does not recover
A repair always comes with a deep scrub. You can replace it if you want. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Niklas Hambüchen Sent: Wednesday, June 28, 2023 3:05 PM To: Frank Schilder; Alexander E. Patrakov Cc: ceph-users@ceph.io Subject: Re: [ceph-users] Re: 1 pg inconsistent and does not recover Hi Frank, > The response to that is not to try manual repair but to issue a deep-scrub. I am a bit confused, because in your script you do issue "ceph pg repair", not a scrub. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io