from:"Frank Schilder"

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-24 Thread Frank Schilder

Hi Eugen,

so it is partly "unexpectedly expected" and partly buggy. I really wish the 
crush implementation was honouring a few obvious invariants. It is extremely 
counter-intuitive that mappings taken from a sub-set change even if both, the 
sub-set and the mapping instructions themselves don't.

> - Use different root names

That's what we are doing and it works like a charm, also for draining OSDs.

> more specific crush rules.

I guess you mean use something like "step take DCA class hdd" instead of "step 
take default class hdd" as in:

rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take DCA class hdd
step chooseleaf indep 9 type host
step take DCB class hdd
step chooseleaf indep 9 type host
step emit
}

According to the documentation, this should actually work and be almost 
equivalent to your crush rule. The difference here is that it will make sure 
that the first 9 shards are from DCA and the second 9 shards from DCB (its an 
ordering). Side effect is that all primary OSDs will be in DCA if both DCs are 
up. I remember people asking for that as a feature in multi-DC set-ups to pick 
the one with lowest latency to have the primary OSDs by default.

Can you give this crush rule a try and report back whether or not the behaviour 
when adding hosts changes?

In case you have time, it would be great if you could collect information on 
(reproducing) the fatal peering problem. While remappings might be 
"unexpectedly expected" it is clearly a serious bug that incomplete and unknown 
PGs show up in the process of adding hosts at the root.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Friday, May 24, 2024 2:51 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

I start to think that the root cause of the remapping is just the fact
that the crush rule(s) contain(s) the "step take default" line:

  step take default class hdd

My interpretation is that crush simply tries to honor the rule:
consider everything underneath the "default" root, so PGs get remapped
if new hosts are added there (but not in their designated subtree
buckets). The effect (unknown PGs) is bad, but there are a couple of
options to avoid that:

- Use different root names and/or more specific crush rules.
- Use host spec file(s) to place new hosts directly where they belong.
- Set osd_crush_initial_weight = 0 to avoid remapping until everything
is where it's supposed to be, then reweight the OSDs.


Zitat von Eugen Block :

> Hi Frank,
>
> thanks for looking up those trackers. I haven't looked into them
> yet, I'll read your response in detail later, but I wanted to add
> some new observation:
>
> I added another root bucket (custom) to the osd tree:
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -12   0  root custom
>  -1 0.27698  root default
>  -8 0.09399  room room1
>  -3 0.04700  host host1
>   7hdd  0.02299  osd.7   up   1.0  1.0
>  10hdd  0.02299  osd.10  up   1.0  1.0
> ...
>
> Then I tried this approach to add a new host directly to the
> non-default root:
>
> # cat host5.yaml
> service_type: host
> hostname: host5
> addr: 192.168.168.54
> location:
>   root: custom
> labels:
>- osd
>
> # ceph orch apply -i host5.yaml
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REWEIGHT  PRI-AFF
> -12 0.04678  root custom
> -23 0.04678  host host5
>   1hdd  0.02339  osd.1   up   1.0  1.0
>  13hdd  0.02339  osd.13  up   1.0  1.0
>  -1 0.27698  root default
>  -8 0.09399  room room1
>  -3 0.04700  host host1
>   7hdd  0.02299  osd.7   up   1.0  1.0
>  10hdd  0.02299  osd.10  up   1.0  1.0
> ...
>
> host5 is placed directly underneath the new custom root correctly,
> but not a single PG is marked "remapped"! So this is actually what I
> (or we) expected. I'm not sure yet what to make of it, but I'm
> leaning towards using this approach in the future and add hosts
> underneath a different root first, and then move it to its
> designated location.
>
> Just to validate again, I added host6 without a location spec, so
> it's placed underneath the default root again:
>
> # ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME   STATUS  REW

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

just to add another strangeness observation from long ago: 
https://www.spinics.net/lists/ceph-users/msg74655.html. I didn't see any 
reweights in your trees, so its something else. However, there seem to be 
multiple issues with EC pools and peering.

I also want to clarify:

> If this is the case, it is possible that this is partly intentional and 
> partly buggy.

"Partly intentional" here means the code behaviour changes when you add OSDs to 
the root outside the rooms and this change is not considered a bug. It is 
clearly *not* expected as it means you cannot do maintenance on a pool living 
on a tree A without affecting pools on the same device class living on an 
unmodified subtree of A.

>From a ceph user's point of view everything you observe looks buggy. I would 
>really like to see a good explanation why the mappings in the subtree *should* 
>change when adding OSDs above that subtree as in your case when the 
>expectation for good reasons is that they don't. This would help devising 
>clean procedures for adding hosts when you (and I) want to add OSDs first 
>without any peering and then move OSDs into place to have it happen separate 
>from adding and not a total mess with everything in parallel.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder 
Sent: Thursday, May 23, 2024 6:32 PM
To: Eugen Block
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

Hi Eugen,

I'm at home now. Could you please check all the remapped PGs that they have no 
shards on the new OSDs, i.e. its just shuffling around mappings within the same 
set of OSDs under rooms?

If this is the case, it is possible that this is partly intentional and partly 
buggy. The remapping is then probably intentional and the method I use with a 
disjoint tree for new hosts prevents such remappings initially (the crush code 
sees the new OSDs in the root, doesn't use them but their presence does change 
choice orders resulting in remapped PGs). However, the unknown PGs should 
clearly not occur.

I'm afraid that the peering code has quite a few bugs, I reported something at 
least similarly weird a long time ago: https://tracker.ceph.com/issues/56995 
and https://tracker.ceph.com/issues/46847. Might even be related. It looks like 
peering can loose track of PG members in certain situations (specifically after 
adding OSDs until rebalancing completed). In my cases, I get degraded objects 
even though everything is obviously still around. Flipping between the 
crush-maps before/after the change re-discovers everything again.

Issue 46847 is long-standing and still unresolved. In case you need to file a 
tracker, please consider to refer to the two above as well as "might be 
related" if you deem that they might be related.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

I'm at home now. Could you please check all the remapped PGs that they have no 
shards on the new OSDs, i.e. its just shuffling around mappings within the same 
set of OSDs under rooms?

If this is the case, it is possible that this is partly intentional and partly 
buggy. The remapping is then probably intentional and the method I use with a 
disjoint tree for new hosts prevents such remappings initially (the crush code 
sees the new OSDs in the root, doesn't use them but their presence does change 
choice orders resulting in remapped PGs). However, the unknown PGs should 
clearly not occur.

I'm afraid that the peering code has quite a few bugs, I reported something at 
least similarly weird a long time ago: https://tracker.ceph.com/issues/56995 
and https://tracker.ceph.com/issues/46847. Might even be related. It looks like 
peering can loose track of PG members in certain situations (specifically after 
adding OSDs until rebalancing completed). In my cases, I get degraded objects 
even though everything is obviously still around. Flipping between the 
crush-maps before/after the change re-discovers everything again.

Issue 46847 is long-standing and still unresolved. In case you need to file a 
tracker, please consider to refer to the two above as well as "might be 
related" if you deem that they might be related.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: does the RBD client block write when the Watcher times out?

2024-05-23 Thread Frank Schilder

Hi, we run into the same issue and there is actually another use case: 
live-migration of VMs. This requires an RBD image being mapped to two clients 
simultaneously, so this is intentional. If multiple clints map an image in 
RW-mode, the ceph back-end will cycle the write lock between the clients to 
allow each of them to flush writes, this is intentional. The way to coordinate 
here is the job of the orchestrator. In this case specifically, its explicitly 
managing a write lock during live-migration such that writes occur in the 
correct order.

Its not a ceph job, its an orchestration job. The rbd interface just provides 
the tools to do it, for example, you can attach information that helps you 
hunting down dead-looking clients and kill them proper before mapping an image 
somewhere else.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Ilya Dryomov 
Sent: Thursday, May 23, 2024 2:05 PM
To: Yuma Ogami
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: does the RBD client block write when the Watcher 
times out?

On Thu, May 23, 2024 at 4:48 AM Yuma Ogami  wrote:
>
> Hello.
>
> I'm currently verifying the behavior of RBD on failure. I'm wondering
> about the consistency of RBD images after network failures. As a
> result of my investigation, I found that RBD sets a Watcher to RBD
> image if a client mounts this volume to prevent multiple mounts. In

Hi Yuma,

The watcher is created to watch for updates (technically, to listen to
notifications) on the RBD image, not to prevent multiple mounts.  RBD
allows the same image to be mapped multiple times on the same node or
on different nodes.

> addition, I found that if the client is isolated from the network for
> a long time, the Watcher is released. However, the client still mounts
> this image. In this situation, if another client can also mount this
> image and the image is writable from both clients, data corruption
> occurs. Could you tell me whether this is a realistic scenario?

Yes, this is a realistic scenario which can occur even if the client
isn't isolated from the network.  If the user does this, it's up to the
user to ensure that everything remains consistent.  One use case for
mapping the same image on multiple nodes is a clustered (also referred
to as a shared disk) filesystem, such as OCFS2.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

thanks for this clarification. Yes, with the observations you describe for 
transition 1->2, something is very wrong. Nothing should happen. Unfortunately, 
I'm going to be on holidays and, generally, don't have too much time. If they 
can afford to share the osdmap (ceph osd getmap -o file), I could also take a 
look at some point.

I don't think it has to do with set_choose_tries, there is likely something 
else screwed up badly. There should simply not be any remapping going on at 
this stage. Just for fun, you should be able to produce a clean crushmap from 
scratch with a similar or the same tree and check if you see the same problems.

Using the full osdmap with osdmaptool allows to reproduce the exact mappings as 
used in the cluster and it encodes other important information as well. That's 
why I'm asking for this instead of just the crush map.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Thursday, May 23, 2024 1:26 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: unknown PGs after adding hosts in different 
subtree

Hi Frank,

thanks for chiming in here.

> Please correct if this is wrong. Assuming its correct, I conclude
> the following.

You assume correctly.

> Now, from your description it is not clear to me on which of the
> transitions 1->2 or 2->3 you observe
> - peering and/or
> - unknown PGs.

The unknown PGs were observed during/after 1 -> 2. All or almost all
PGs were reported as "remapped", I don't remember the exact number,
but it was more than 4k, and the largest pool has 4096 PGs. We didn't
see down OSDs at all.
Only after moving the hosts into their designated location (the DCs)
the unknown PGs cleared and the application resumed its operation.

I don't want to overload this thread but I asked for a copy of their
crushmap to play around a bit. I moved the new hosts out of the DCs
into the default root via 'crushtool --move ...', then running the
crushtool --test command

# crushtool -i crushmap --test --rule 1 --num-rep 18
--show-choose-tries [--show-bad-mappings] --show-utilization

results in a couple of issues:

- there are lots of bad mappings no matter how high the number for
set_choose_tries is set
- the show-utilization output shows 240 OSDs in usage (there were 240
OSDs before the expansion), but plenty of them have only 9 chunks
assigned:

rule 1 (rule-ec-k7m11), x = 0..1023, numrep = 18..18
rule 1 (rule-ec-k7m11) num_rep 18 result size == 0: 55/1024
rule 1 (rule-ec-k7m11) num_rep 18 result size == 9: 488/1024
rule 1 (rule-ec-k7m11) num_rep 18 result size == 18:481/1024

And this reminds me of the inactive PGs we saw before I failed the
mgr, those inactive PGs showed only 9 chunks in the acting set. With
k=7 (and min_size=8) that should still be enough, we have successfully
tested disaster recovery with one entire DC down multiple times.

- with --show-mappings some lines contain an empty set like this:

CRUSH rule 1 x 22 []

And one more observation: with the currently active crushmap there are
no bad mappings at all when the hosts are in their designated location.
So there's definitely something wrong here, I just can't tell what it
is yet. I'll play a bit more with that crushmap...

Thanks!
Eugen


Zitat von Frank Schilder :

> Hi Eugen,
>
> I'm afraid the description of your observation breaks a bit with
> causality and this might be the reason for the few replies. To
> produce a bit more structure for when exactly what happened, let's
> look at what I did and didn't get:
>
> Before adding the hosts you have situation
>
> 1)
> default
>   DCA
> host A1 ... AN
>   DCB
> host B1 ... BM
>
> Now you add K+L hosts, they go into the default root and we have situation
>
> 2)
> default
>   host C1 ... CK, D1 ... DL
>   DCA
> host A1 ... AN
>   DCB
> host B1 ... BM
>
> As a last step, you move the hosts to their final locations and we
> arrive at situation
>
> 3)
> default
>   DCA
> host A1 ... AN, C1 ... CK
>   DCB
> host B1 ... BM, D1 ... DL
>
> Please correct if this is wrong. Assuming its correct, I conclude
> the following.
>
> Now, from your description it is not clear to me on which of the
> transitions 1->2 or 2->3 you observe
> - peering and/or
> - unknown PGs.
>
> We use a somewhat similar procedure except that we have a second
> root (separate disjoint tree) for new hosts/OSDs. However, in terms
> of peering it is the same and if everything is configured correctly
> I would expect this to happen (this is what happens when we add
> OSDs/hosts):
>
> transition 1->2: hosts get added: no peering, no remapped objects,
> nothing, just new OSDs doing nothing
> transition 2->3:

[ceph-users] Re: unknown PGs after adding hosts in different subtree

2024-05-23 Thread Frank Schilder

Hi Eugen,

I'm afraid the description of your observation breaks a bit with causality and 
this might be the reason for the few replies. To produce a bit more structure 
for when exactly what happened, let's look at what I did and didn't get:

Before adding the hosts you have situation

1)
default
  DCA
host A1 ... AN
  DCB
host B1 ... BM

Now you add K+L hosts, they go into the default root and we have situation

2)
default
  host C1 ... CK, D1 ... DL
  DCA
host A1 ... AN
  DCB
host B1 ... BM

As a last step, you move the hosts to their final locations and we arrive at 
situation

3)
default
  DCA
host A1 ... AN, C1 ... CK
  DCB
host B1 ... BM, D1 ... DL

Please correct if this is wrong. Assuming its correct, I conclude the following.

Now, from your description it is not clear to me on which of the transitions 
1->2 or 2->3 you observe
- peering and/or
- unknown PGs.

We use a somewhat similar procedure except that we have a second root (separate 
disjoint tree) for new hosts/OSDs. However, in terms of peering it is the same 
and if everything is configured correctly I would expect this to happen (this 
is what happens when we add OSDs/hosts):

transition 1->2: hosts get added: no peering, no remapped objects, nothing, 
just new OSDs doing nothing
transition 2->3: hosts get moved: peering starts and remapped objects appear, 
all PGs active+clean

Unknown PGs should not occur (maybe only temporarily when the primary changes 
or the PG is slow to respond/report status??). The crush bug with too few 
set_choose_tries is observed if one has *just enough hosts* for the EC profile 
and should not be observed if all PGs are active+clean and one *adds hosts*. 
Persistent unknown PGs can (to my understanding, does unknown mean "has no 
primary"?) only occur if the number of PGs changes (autoscaler messing 
around??) because all PGs were active+clean before. The crush bug leads to 
incomplete PGs, so PGs can go incomplete but they should always have an acting 
primary.

This is assuming no OSDs went down/out during the process.

Can you please check if my interpretation is correct and describe at which step 
exactly things start diverging from my expectations.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Thursday, May 23, 2024 12:05 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: unknown PGs after adding hosts in different subtree

Hi again,

I'm still wondering if I misunderstand some of the ceph concepts.
Let's assume the choose_tries value is too low and ceph can't find
enough OSDs for the remapping. I would expect that there are some PG
chunks in remapping state or unknown or whatever, but why would it
affect the otherwise healthy cluster in such a way?
Even if ceph doesn't know where to put some of the chunks, I wouldn't
expect inactive PGs and have a service interruption.
What am I missing here?

Thanks,
Eugen

Zitat von Eugen Block :

> Thanks, Konstantin.
> It's been a while since I was last bitten by the choose_tries being
> too low... Unfortunately, I won't be able to verify that... But I'll
> definitely keep that in mind, or least I'll try to. :-D
>
> Thanks!
>
> Zitat von Konstantin Shalygin :
>
>> Hi Eugen
>>
>>> On 21 May 2024, at 15:26, Eugen Block  wrote:
>>>
>>> step set_choose_tries 100
>>
>> I think you should try to increase set_choose_tries to 200
>> Last year we had an Pacific EC 8+2 deployment of 10 racks. And even
>> with 50 hosts, the value of 100 not worked for us
>>
>>
>> k

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Frank Schilder

Hi Stefan,

ahh OK, misunderstood your e-mail. It sounded like it was a custom profile, not 
a standard one shipped with tuned.

Thanks for the clarification!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Bauer 
Sent: Wednesday, May 22, 2024 12:44 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How network latency affects ceph performance really 
with NVME only storage?

Hi Frank,

it's pretty straightforward. Just follow the steps:

apt install tuned

tuned-adm profile network-latency

According to [1]:

network-latency
A server profile focused on lowering network latency.
This profile favors performance over power savings by setting
|intel_pstate| and |min_perf_pct=100|. It disables transparent huge
pages, and automatic NUMA balancing. It also uses *cpupower* to set
the |performance| cpufreq governor, and requests a
/|cpu_dma_latency|/ value of |1|. It also sets /|busy_read|/ and
/|busy_poll|/ times to |50| μs, and /|tcp_fastopen|/ to |3|.

[1]
https://access.redhat.com/documentation/de-de/red_hat_enterprise_linux/7/html/performance_tuning_guide/sect-red_hat_enterprise_linux-performance_tuning_guide-tool_reference-tuned_adm

Cheers.

Stefan

Am 22.05.24 um 12:18 schrieb Frank Schilder:
> Hi Stefan,
>
> can you provide a link to or copy of the contents of the tuned-profile so 
> others can also profit from it?
>
> Thanks!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Stefan Bauer
> Sent: Wednesday, May 22, 2024 10:51 AM
> To: Anthony D'Atri;ceph-users@ceph.io
> Subject: [ceph-users] Re: How network latency affects ceph performance really 
> with NVME only storage?
>
> Hi Anthony and others,
>
> thank you for your reply.  To be honest, I'm not even looking for a
> solution, i just wanted to ask if latency affects the performance at all
> in my case and how others handle this ;)
>
> One of our partners delivered a solution with a latency-optimized
> profile for tuned-daemon. Now the latency is much better:
>
> apt install tuned
>
> tuned-adm profile network-latency
>
> # ping 10.1.4.13
> PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
> 64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
> 64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
> 64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
> 64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
> 64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
> 64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
> 64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
> 64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
> 64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
> 64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
> ^C
> --- 10.1.4.13 ping statistics ---
> 10 packets transmitted, 10 received, 0% packet loss, time 9001ms
> rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms
>
> Am 21.05.24 um 15:08 schrieb Anthony D'Atri:
>> Check the netmask on your interfaces, is it possible that you're sending 
>> inter-node traffic up and back down needlessly?
>>
>>> On May 21, 2024, at 06:02, Stefan Bauer  wrote:
>>>
>>> Dear Users,
>>>
>>> i recently setup a new ceph 3 node cluster. Network is meshed between all 
>>> nodes (2 x 25G with DAC).
>>> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)
>>>
>>> The latency with ping tests between the nodes shows:
>>>
>>> # ping 10.1.3.13
>>> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
>>> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
>>> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
>>> --- 10.1.3.13 ping statistics ---
>>> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms
>>> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms
>>>
>>>
>>> On another cluster i have much better values, with 10G SFP+ and 
>>> fibre-cables:
>>>
>>> 64 bytes fr

[ceph-users] Re: How network latency affects ceph performance really with NVME only storage?

2024-05-22 Thread Frank Schilder

Hi Stefan,

can you provide a link to or copy of the contents of the tuned-profile so 
others can also profit from it?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Stefan Bauer 
Sent: Wednesday, May 22, 2024 10:51 AM
To: Anthony D'Atri; ceph-users@ceph.io
Subject: [ceph-users] Re: How network latency affects ceph performance really 
with NVME only storage?

Hi Anthony and others,

thank you for your reply.  To be honest, I'm not even looking for a
solution, i just wanted to ask if latency affects the performance at all
in my case and how others handle this ;)

One of our partners delivered a solution with a latency-optimized
profile for tuned-daemon. Now the latency is much better:

apt install tuned

tuned-adm profile network-latency

# ping 10.1.4.13
PING 10.1.4.13 (10.1.4.13) 56(84) bytes of data.
64 bytes from 10.1.4.13: icmp_seq=1 ttl=64 time=0.047 ms
64 bytes from 10.1.4.13: icmp_seq=2 ttl=64 time=0.028 ms
64 bytes from 10.1.4.13: icmp_seq=3 ttl=64 time=0.025 ms
64 bytes from 10.1.4.13: icmp_seq=4 ttl=64 time=0.020 ms
64 bytes from 10.1.4.13: icmp_seq=5 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=6 ttl=64 time=0.026 ms
64 bytes from 10.1.4.13: icmp_seq=7 ttl=64 time=0.024 ms
64 bytes from 10.1.4.13: icmp_seq=8 ttl=64 time=0.023 ms
64 bytes from 10.1.4.13: icmp_seq=9 ttl=64 time=0.033 ms
64 bytes from 10.1.4.13: icmp_seq=10 ttl=64 time=0.021 ms
^C
--- 10.1.4.13 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9001ms
rtt min/avg/max/mdev = 0.020/0.027/0.047/0.007 ms

Am 21.05.24 um 15:08 schrieb Anthony D'Atri:
> Check the netmask on your interfaces, is it possible that you're sending 
> inter-node traffic up and back down needlessly?
>
>> On May 21, 2024, at 06:02, Stefan Bauer  wrote:
>>
>> Dear Users,
>>
>> i recently setup a new ceph 3 node cluster. Network is meshed between all 
>> nodes (2 x 25G with DAC).
>> Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)
>>
>> The latency with ping tests between the nodes shows:
>>
>> # ping 10.1.3.13
>> PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
>> 64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
>> 64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
>> 64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
>> 64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
>> 64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
>> 64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
>> 64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
>> 64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
>> 64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
>> 64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
>> 64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
>> --- 10.1.3.13 ping statistics ---
>> 11 packets transmitted, 11 received, 0% packet loss, time 10242ms
>> rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms
>>
>>
>> On another cluster i have much better values, with 10G SFP+ and fibre-cables:
>>
>> 64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms
>> 64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms
>> ^C
>> --- long-ipv6-ip ping statistics ---
>> 53 packets transmitted, 53 received, 0% packet loss, time 53260ms
>> rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms
>>
>> If i want best performance, does the latency difference matter at all? 
>> Should i change DAC to SFP-transceivers wwith fibre-cables to improve 
>> overall ceph performance or is this nitpicking?
>>
>> Thanks a lot.
>>
>> Stefan
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Mit freundlichen Grüßen

Stefan Bauer
Schulstraße 5
83308 Trostberg
0179-1194767
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: dkim on this mailing list

2024-05-21 Thread Frank Schilder

Hi Marc,

in case you are working on the list server, at least for me the situation seems 
to have improved no more than 2-3 hours ago. My own e-mails to the list now 
pass.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Marc 
Sent: Tuesday, May 21, 2024 5:34 PM
To: ceph-users@ceph.io
Subject: [ceph-users] dkim on this mailing list

Just to confirm if I am messing up my mailserver configs. But currently all 
messages from this mailing list should generate a dkim pass status?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please discuss about Slow Peering

2024-05-21 Thread Frank Schilder

> Not with the most recent Ceph releases.

Actually, this depends. If its SSDs for which IOPs profit from higher iodepth, 
it is very likely to improve performance, because until today each OSD has only 
one kv_sync_thread and this is typically the bottleneck with heavy IOPs load. 
Having 2-4 kv_sync_threads per SSD, meaning 2-4 OSDs per disk, will help a lot 
if this thread is saturating.

For NVMes this is usually not required.

The question still remains, do you have enough CPU? If you have 13 disks with 4 
OSDs each, you will need a core-count of at least 50-ish per host. Newer OSDs 
might be able to utilize even more on fast disks. You will also need 4 times 
the RAM.

> I suspect your PGs are too few though.

In addition, on these drives you should aim for 150-200 PGs per OSD (another 
reason to go x4 OSDs - x4 PGs per drive). We have 198PGs/OSD on average and 
this helps a lot with IO, recovery, everything.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Anthony D'Atri 
Sent: Tuesday, May 21, 2024 3:06 PM
To: 서민우
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Please discuss about Slow Peering



I have additional questions,
We use 13 disk (3.2TB NVMe) per server and allocate one OSD to each disk. In 
other words 1 Node has 13 osds.
Do you think this is inefficient?
Is it better to create more OSD by creating LV on the disk?

Not with the most recent Ceph releases.  I suspect your PGs are too few though.




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please discuss about Slow Peering

2024-05-21 Thread Frank Schilder

We are using the read-intensive kioxia drives (octopus cluster) in RBD pools 
and are very happy with them. I don't think its the drives.

The last possibility I could think of is CPU. We run 4 OSDs per 1.92TB Kioxia 
drive to utilize their performance (single OSD per disk doesn't cut it at all) 
and have 2x16-core Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz per server. 
During normal operations the CPU is only lightly loaded. During peering this 
load peaks at or above 100%. If not enough CPU power is available, peering will 
be hit very badly. What you could check is:

- number of cores: at least 1 HT per OSD, better 1 core per OSD.
- cstates disabled: we run with virtualization-performance profile and the CPU 
is basically always at all-core boost (3.2GHz)
- sufficient RAM: we run these OSDs with 6G memory limit, that's 24G per disk! 
Still, the servers have 50% OSD RAM utilisation and 50% buffers, so there is 
enough for fast peak allocations during peering.
- check vm.min_free_kbytes: the default is way too low for OSD hosts, we use 
vm.min_free_kbytes=4194304 (4G), this can have latency impact for network 
connections
- swap disabled: disable swap on OSD hosts
- sysctl network tuning: check that your network parameters are appropriate for 
your network cards, the kernel defaults are still for 1G connections, there are 
great tuning guides on-line, here some of our settings for 10G NICs:

# Increase autotuning TCP buffer limits
# 10G fiber/64MB buffers (67108864)
net.core.rmem_max = 67108864
net.core.wmem_max = 67108864
net.core.rmem_default = 67108864
net.core.wmem_default = 67108864
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 22500   218450 67108864
net.ipv4.tcp_wmem = 22500  81920 67108864

- last check: are you using WPQ or MCLOCK? The mclock scheduler still has 
serious issues and switching to WPQ might help.

If none of these help, I'm out of ideas. For us the Kioxia drives work like a 
charm, its the pool that is easiest to manage and maintain with super-fast 
recovery and really good sustained performance.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: 서민우 
Sent: Tuesday, May 21, 2024 11:25 AM
To: Anthony D'Atri
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Please discuss about Slow Peering

We used the "kioxia kcd6xvul3t20" model.
Any infamous information of this Model?

2024년 5월 17일 (금) 오전 2:58, Anthony D'Atri 
mailto:anthony.da...@gmail.com>>님이 작성:
If using jumbo frames, also ensure that they're consistently enabled on all OS 
instances and network devices.

> On May 16, 2024, at 09:30, Frank Schilder mailto:fr...@dtu.dk>> 
> wrote:
>
> This is a long shot: if you are using octopus, you might be hit by this 
> pglog-dup problem: 
> https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. They don't 
> mention slow peering explicitly in the blog, but its also a consequence 
> because the up+acting OSDs need to go through the PG_log during peering.
>
> We are also using octopus and I'm not sure if we have ever seen slow ops 
> caused by peering alone. It usually happens when a disk cannot handle load 
> under peering. We have, unfortunately, disks that show random latency spikes 
> (firmware update pending). You can try to monitor OPS latencies for your 
> drives when peering and look for something that sticks out. People on this 
> list were reporting quite bad results for certain infamous NVMe brands. If 
> you state your model numbers, someone else might recognize it.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: 서민우 mailto:smw940...@gmail.com>>
> Sent: Thursday, May 16, 2024 7:39 AM
> To: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
> Subject: [ceph-users] Please discuss about Slow Peering
>
> Env:
> - OS: Ubuntu 20.04
> - Ceph Version: Octopus 15.0.0.1
> - OSD Disk: 2.9TB NVMe
> - BlockStorage (Replication 3)
>
> Symptom:
> - Peering when OSD's node up is very slow. Peering speed varies from PG to
> PG, and some PG may even take 10 seconds. But, there is no log for 10
> seconds.
> - I checked the effect of client VM's. Actually, Slow queries of mysql
> occur at the same time.
>
> There are Ceph OSD logs of both Best and Worst.
>
> Best Peering Case (0.5 Seconds)
> 2024-04-11T15:32:44.693+0900 7f108b522700  1 osd.7 pg_epoch: 27368 pg[6.8]
> state: transitioning to Primary
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> state: Peering, affected_by_map, going to Reset
> 2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
> start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11],
> acting_primary 7 -> 6, up_p

[ceph-users] Re: Please discuss about Slow Peering

2024-05-16 Thread Frank Schilder

This is a long shot: if you are using octopus, you might be hit by this 
pglog-dup problem: https://docs.clyso.com/blog/osds-with-unlimited-ram-growth/. 
They don't mention slow peering explicitly in the blog, but its also a 
consequence because the up+acting OSDs need to go through the PG_log during 
peering.

We are also using octopus and I'm not sure if we have ever seen slow ops caused 
by peering alone. It usually happens when a disk cannot handle load under 
peering. We have, unfortunately, disks that show random latency spikes 
(firmware update pending). You can try to monitor OPS latencies for your drives 
when peering and look for something that sticks out. People on this list were 
reporting quite bad results for certain infamous NVMe brands. If you state your 
model numbers, someone else might recognize it.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: 서민우 
Sent: Thursday, May 16, 2024 7:39 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Please discuss about Slow Peering

Env:
- OS: Ubuntu 20.04
- Ceph Version: Octopus 15.0.0.1
- OSD Disk: 2.9TB NVMe
- BlockStorage (Replication 3)

Symptom:
- Peering when OSD's node up is very slow. Peering speed varies from PG to
PG, and some PG may even take 10 seconds. But, there is no log for 10
seconds.
- I checked the effect of client VM's. Actually, Slow queries of mysql
occur at the same time.

There are Ceph OSD logs of both Best and Worst.

Best Peering Case (0.5 Seconds)
2024-04-11T15:32:44.693+0900 7f108b522700  1 osd.7 pg_epoch: 27368 pg[6.8]
state: transitioning to Primary
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
state: Peering, affected_by_map, going to Reset
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27371 pg[6.8]
start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11],
acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
state: transitioning to Primary
2024-04-11T15:32:45.165+0900 7f108f52a700  1 osd.7 pg_epoch: 27377 pg[6.8]
start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11],
acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting

Worst Peering Case (11.6 Seconds)
2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
state: transitioning to Stray
2024-04-11T15:32:45.169+0900 7f108b522700  1 osd.7 pg_epoch: 27377 pg[30.20]
start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
state: transitioning to Stray
2024-04-11T15:32:46.173+0900 7f108b522700  1 osd.7 pg_epoch: 27378 pg[30.20]
start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting
2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
state: transitioning to Stray
2024-04-11T15:32:57.794+0900 7f108b522700  1 osd.7 pg_epoch: 27390 pg[30.20]
start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting

*I wish to know about*
- Why some PG's take 10 seconds until Peering finishes.
- Why Ceph log is quiet during peering.
- Is this symptom intended in Ceph.

*And please give some advice,*
- Is there any way to improve peering speed?
- Or, Is there a way to not affect the client when peering occurs?

P.S
- I checked the symptoms in the following environments.
-> Octopus Version, Reef Version, Cephadm, Ceph-Ansible
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

2024-04-30 Thread Frank Schilder

I think you are panicking way too much. Chances are that you will never need 
that command, so don't get fussed out by an old post.

Just follow what I wrote and, in the extremely rare case that recovery does not 
complete due to missing information, send an e-mail to this list and state that 
you still have the disk of the down OSD. Someone will send you the 
export/import commands within a short time.

So stop worrying and just administrate your cluster with common storage admin 
sense.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Mary Zhang 
Sent: Tuesday, April 30, 2024 5:00 PM
To: Frank Schilder
Cc: Eugen Block; ceph-users@ceph.io; Wesley Dillingham
Subject: Re: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

Thank you Frank for sharing such valuable experience! I really appreciate it.
We observe similar timelines: it took more than 1 week to drain our OSD.
Regarding export PGs from failed disk and inject it back to the cluster, do you 
have any documentations? I find this online Ceph.io — Incomplete PGs -- OH 
MY!<https://ceph.io/en/news/blog/2015/incomplete-pgs-oh-my/>, but not sure 
whether it's the standard process.

Thanks,
Mary

On Tue, Apr 30, 2024 at 3:27 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi all,

I second Eugen's recommendation. We have a cluster with large HDD OSDs where 
the following timings are found:

- drain an OSD: 2 weeks.
- down an OSD and let cluster recover: 6 hours.

The drain OSD procedure is - in my experience - a complete waste of time, 
actually puts your cluster at higher risk of a second failure (its not 
guaranteed that the bad PG(s) is/are drained first) and also screws up all 
sorts of internal operations like scrub etc for an unnecessarily long time. The 
recovery procedure is much faster, because it uses all-to-all recovery while 
drain is limited to no more than max_backfills PGs at a time and your broken 
disk sits much longer in the cluster.

On SSDs the "down OSD"-method shows a similar speed-up factor.

For a security measure, don't destroy the OSD right away, wait for recovery to 
complete and only then destroy the OSD and throw away the disk. In case an 
error occurs during recovery, you can almost always still export PGs from a 
failed disk and inject it back into the cluster. This, however, requires to 
take disks out as soon as they show problems and before they fail hard. Keep a 
little bit of life time to have a chance to recover data. Look at the manual of 
ddrescue why it is important to stop IO from a failing disk as soon as possible.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block mailto:ebl...@nde.ag>>
Sent: Saturday, April 27, 2024 10:29 AM
To: Mary Zhang
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>; Wesley Dillingham
Subject: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

If the rest of the cluster is healthy and your resiliency is
configured properly, for example to sustain the loss of one or more
hosts at a time, you don’t need to worry about a single disk. Just
take it out and remove it (forcefully) so it doesn’t have any clients
anymore. Ceph will immediately assign different primary OSDs and your
clients will be happy again. ;-)

Zitat von Mary Zhang mailto:maryzhang0...@gmail.com>>:

> Thank you Wesley for the clear explanation between the 2 methods!
> The tracker issue you mentioned https://tracker.ceph.com/issues/44400 talks
> about primary-affinity. Could primary-affinity help remove an OSD with
> hardware issue from the cluster gracefully?
>
> Thanks,
> Mary
>
>
> On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham 
> mailto:w...@wesdillingham.com>>
> wrote:
>
>> What you want to do is to stop the OSD (and all its copies of data it
>> contains) by stopping the OSD service immediately. The downside of this
>> approach is it causes the PGs on that OSD to be degraded. But the upside is
>> the OSD which has bad hardware is immediately no  longer participating in
>> any client IO (the source of your RGW 503s). In this situation the PGs go
>> into degraded+backfilling
>>
>> The alternative method is to keep the failing OSD up and in the cluster
>> but slowly migrate the data off of it, this would be a long drawn out
>> period of time in which the failing disk would continue to serve client
>> reads and also facilitate backfill but you wouldnt take a copy of the data
>> out of the cluster and cause degraded PGs. In this scenario the PGs would
>> be remapped+backfilling
>>
>> I tried to find a way to have your cake and eat it to in relation to this
>> "predicament" in this tracker issue: https://tracker.ceph.com/issues/4440

[ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

2024-04-30 Thread Frank Schilder

Hi all,

I second Eugen's recommendation. We have a cluster with large HDD OSDs where 
the following timings are found:

- drain an OSD: 2 weeks.
- down an OSD and let cluster recover: 6 hours.

The drain OSD procedure is - in my experience - a complete waste of time, 
actually puts your cluster at higher risk of a second failure (its not 
guaranteed that the bad PG(s) is/are drained first) and also screws up all 
sorts of internal operations like scrub etc for an unnecessarily long time. The 
recovery procedure is much faster, because it uses all-to-all recovery while 
drain is limited to no more than max_backfills PGs at a time and your broken 
disk sits much longer in the cluster.

On SSDs the "down OSD"-method shows a similar speed-up factor.

For a security measure, don't destroy the OSD right away, wait for recovery to 
complete and only then destroy the OSD and throw away the disk. In case an 
error occurs during recovery, you can almost always still export PGs from a 
failed disk and inject it back into the cluster. This, however, requires to 
take disks out as soon as they show problems and before they fail hard. Keep a 
little bit of life time to have a chance to recover data. Look at the manual of 
ddrescue why it is important to stop IO from a failing disk as soon as possible.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Saturday, April 27, 2024 10:29 AM
To: Mary Zhang
Cc: ceph-users@ceph.io; Wesley Dillingham
Subject: [ceph-users] Re: Remove an OSD with hardware issue caused rgw 503

If the rest of the cluster is healthy and your resiliency is
configured properly, for example to sustain the loss of one or more
hosts at a time, you don’t need to worry about a single disk. Just
take it out and remove it (forcefully) so it doesn’t have any clients
anymore. Ceph will immediately assign different primary OSDs and your
clients will be happy again. ;-)

Zitat von Mary Zhang :

> Thank you Wesley for the clear explanation between the 2 methods!
> The tracker issue you mentioned https://tracker.ceph.com/issues/44400 talks
> about primary-affinity. Could primary-affinity help remove an OSD with
> hardware issue from the cluster gracefully?
>
> Thanks,
> Mary
>
>
> On Fri, Apr 26, 2024 at 8:43 AM Wesley Dillingham 
> wrote:
>
>> What you want to do is to stop the OSD (and all its copies of data it
>> contains) by stopping the OSD service immediately. The downside of this
>> approach is it causes the PGs on that OSD to be degraded. But the upside is
>> the OSD which has bad hardware is immediately no  longer participating in
>> any client IO (the source of your RGW 503s). In this situation the PGs go
>> into degraded+backfilling
>>
>> The alternative method is to keep the failing OSD up and in the cluster
>> but slowly migrate the data off of it, this would be a long drawn out
>> period of time in which the failing disk would continue to serve client
>> reads and also facilitate backfill but you wouldnt take a copy of the data
>> out of the cluster and cause degraded PGs. In this scenario the PGs would
>> be remapped+backfilling
>>
>> I tried to find a way to have your cake and eat it to in relation to this
>> "predicament" in this tracker issue: https://tracker.ceph.com/issues/44400
>> but it was deemed "wont fix".
>>
>> Respectfully,
>>
>> *Wes Dillingham*
>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>> w...@wesdillingham.com
>>
>>
>>
>>
>> On Fri, Apr 26, 2024 at 11:25 AM Mary Zhang 
>> wrote:
>>
>>> Thank you Eugen for your warm help!
>>>
>>> I'm trying to understand the difference between 2 methods.
>>> For method 1, or "ceph orch osd rm osd_id", OSD Service — Ceph
>>> Documentation
>>> <https://docs.ceph.com/en/latest/cephadm/services/osd/#remove-an-osd>
>>> says
>>> it involves 2 steps:
>>>
>>>1.
>>>
>>>evacuating all placement groups (PGs) from the OSD
>>>2.
>>>
>>>removing the PG-free OSD from the cluster
>>>
>>> For method 2, or the procedure you recommended, Adding/Removing OSDs —
>>> Ceph
>>> Documentation
>>> <
>>> https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/#removing-osds-manual
>>> >
>>> says
>>> "After the OSD has been taken out of the cluster, Ceph begins rebalancing
>>> the cluster by migrating placement groups out of the OSD that was removed.
>>> "
>>>
>>> What's the difference between "evacuatin

[ceph-users] Re: Latest Doco Out Of Date?

2024-04-24 Thread Frank Schilder

Hi Eugen,

I would ask for a slight change here:

> If a client already has a capability for file-system name a and path
> dir1, running fs authorize again for FS name a but path dir2,
> instead of modifying the capabilities client already holds, a new
> cap for dir2 will be granted

The formulation "a new cap for dir2 will be granted" is very misleading. I 
would also read it as that the new cap is in addition to the already existing 
cap. I tried to modify caps with fs authorize as well in the past, because it 
will set caps using pool tags and the docu sounded like it will allow to modify 
caps. In my case, I got the same error and thought that its implementation is 
buggy and did it with the authtool.

To be honest, when I look at the command sequence

ceph fs authorize a client.x /dir1 rw
ceph fs authorize a client.x /dir2 rw

and it goes through without error, I would expect the client to have both 
permissions as a result - no matter what the documentation says. There is no 
"revoke caps" instruction anywhere. Revoking caps in this way is a really 
dangerous side effect and telling people to read the documentation about a 
command that should follow how other linux tools manage permissions is not the 
best answer. There is something called parallelism in software engineering and 
this command line syntax violates this in a highly un-intuitive way. The 
intuition of the syntax clearly is that it *adds* capabilities, its incremental.

A command like this should follow how existing linux tools work so that context 
switching will be easier for admins. Here, the choice of the term "authorize" 
seems to be unlucky. A more explicit command that follows setfacl a bit could be

ceph fs caps set a client.x /dir1 rw
ceph fs caps modify a client.x /dir2 rw

or even more parallel

ceph fs setcaps a client.x /dir1 rw
ceph fs setcaps -m a client.x /dir2 rw

Such parallel syntax will not only avoid the reported confusion but also make 
it possible to implement a modify operation in the future without breaking 
stuff. And you can save time on the documentation, because it works like other 
stuff.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Wednesday, April 24, 2024 9:02 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Latest Doco Out Of Date?

Hi,

I believe the docs [2] are okay, running 'ceph fs authorize' will
overwrite the existing caps, it will not add more caps to the client:

> Capabilities can be modified by running fs authorize only in the
> case when read/write permissions must be changed.

> If a client already has a capability for file-system name a and path
> dir1, running fs authorize again for FS name a but path dir2,
> instead of modifying the capabilities client already holds, a new
> cap for dir2 will be granted

To add more caps you'll need to use the 'ceph auth caps' command, for example:

quincy-1:~ # ceph fs authorize cephfs client.usera /dir1 rw
[client.usera]
 key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw==

quincy-1:~ # ceph auth get client.usera
[client.usera]
 key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw==
 caps mds = "allow rw fsname=cephfs path=/dir1"
 caps mon = "allow r fsname=cephfs"
 caps osd = "allow rw tag cephfs data=cephfs"

quincy-1:~ # ceph auth caps client.usera mds 'allow rw fsname=cephfs
path=/dir1, allow rw fsname=cephfs path=/dir2' mon 'allow r
fsname=cephfs' osd 'allow rw tag cephfs data=cephfs'
updated caps for client.usera

quincy-1:~ # ceph auth get client.usera
[client.usera]
 key = AQDOrShmk6XhGxAAwz07ngr0JtPSID06RH8lAw==
 caps mds = "allow rw fsname=cephfs path=/dir1, allow rw
fsname=cephfs path=/dir2"
 caps mon = "allow r fsname=cephfs"
 caps osd = "allow rw tag cephfs data=cephfs"

Note that I don't actually have these directories in that cephfs, it's
just to demonstrate, so you'll need to make sure your caps actually
work.

Thanks,
Eugen

[2]
https://docs.ceph.com/en/latest/cephfs/client-auth/#changing-rw-permissions-in-caps

Zitat von Zac Dover :

> It's in my list of ongoing initiatives. I'll stay up late tonight
> and ask Venky directly what's going on in this instance.
>
> Sometime later today, I'll create an issue tracking bug and I'll
> send it to you for review. Make sure that I haven't misrepresented
> this issue.
>
> Zac
>
> On Wednesday, April 24th, 2024 at 2:10 PM, duluxoz  wrote:
>
>> Hi Zac,
>>
>> Any movement on this? We really need to come up with an
>> answer/solution - thanks
>>
>> Dulux-Oz
>>
>> On 19/04/2024 18:03, duluxoz wrote:
>>
>>> Cool!
>>>
>>> Thanks for that :-)
>>>
>>> On 19/04/2024 18:01, Zac Dover wrote:

[ceph-users] (deep-)scrubs blocked by backfill

2024-04-17 Thread Frank Schilder

Hi all, I have a technical question about scrub scheduling. I replaced a disk 
and it is back-filling slowly. We have set osd_scrub_during_recovery = true and 
still observe that scrub times continuously increase (number of PGs not 
scrubbed in time is continuously increasing). Investigating the situation it 
looks like any OSD that has a PG in states "backfill_wait" or "backfilling" is 
preventing scrubs to be scheduled on PGs it is a member of. However, it seems 
it is not quite like that.

On the one hand I have never seen a PG in a state like 
"active+scrubbing+remapped+backfilling", so backfilling PGs at least never seem 
to scrub. On the other hand, it seems like more PGs are scrubbed than would be 
eligible if *all* OSDs with a remapped PG on it would refuse scrubs. It looks 
like something in between "only OSDs with a backfilling PG block requests for 
scrub reservations" and "all OSDs with a PG in states backfilling or 
backfill_wait block requests for scrub reservations". Does the position in the 
backfill reservation queue play a role?

If anyone has insight into how scrub reservations are granted and when not in 
the situation of an OSD backfilling that would be great. My naive 
interpretation of "osd_scrub_during_recovery = true" was that scrubs proceed as 
if no backfill was going on. This, however, is clearly not the case. Having an 
answer to my question above would help me a lot to get an idea when things will 
go back to normal.

Thanks a lot and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

2024-04-16 Thread Frank Schilder

Question about HA here: I understood the documentation of the fuse NFS client 
such that the connection state of all NFS clients is stored on ceph in rados 
objects and, if using a floating IP, the NFS clients should just recover from a 
short network timeout.

Not sure if this is what should happen with this specific HA set-up in the 
original request, but a fail-over of the NFS server ought to be handled 
gracefully by starting a new one up with the IP of the down one. Or not?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Tuesday, April 16, 2024 11:24 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Have a problem with haproxy/keepalived/ganesha/docker

Ah, okay, thanks for the hint. In that case what I see is expected.

Zitat von Robert Sander :

> Hi,
>
> On 16.04.24 10:49, Eugen Block wrote:
>>
>> I believe I can confirm your suspicion, I have a test cluster on
>> Reef 18.2.1 and deployed nfs without HAProxy but with keepalived [1].
>> Stopping the active NFS daemon doesn't trigger anything, the MGR
>> notices that it's stopped at some point, but nothing else seems to
>> happen.
>
> There is currently no failover for NFS.
>
> The ingress service (haproxy + keepalived) that cephadm deploys for
> an NFS cluster does not have a health check configured. Haproxy does
> not notice if a backend NFS server dies. This does not matter as
> there is no failover and the NFS client cannot be "load balanced" to
> another backend NFS server.
>
> There is no use to configure an ingress service currently without failover.
>
> The NFS clients have to remount the NFS share in case of their
> current NFS server dies anyway.
>
> Regards
> --
> Robert Sander
> Heinlein Consulting GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Frank Schilder

>>> Fast write enabled would mean that the primary OSD sends #size copies to the
>>> entire active set (including itself) in parallel and sends an ACK to the
>>> client as soon as min_size ACKs have been received from the peers (including
>>> itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow
>>> for whatever reason) without suffering performance penalties immediately
>>> (only after too many requests started piling up, which will show as a slow
>>> requests warning).
>>>
>> What happens if there occurs an error on the slowest osd after the min_size 
>> ACK has already been send to the client?
>>
>This should not be different than what exists today..unless of-course if
>the error happens on the local/primary osd

Can this be addressed with reasonable effort? I don't expect this to be a 
quick-fix and it should be tested. However, beating the tail-latency statistics 
with the extra redundancy should be worth it. I observe fluctuations of 
latencies, OSDs become randomly slow for whatever reason for short time 
intervals and then return to normal.

A reason for this could be DB compaction. I think during compaction latency 
tends to spike.

A fast-write option would effectively remove the impact of this.

Best regards and thanks for considering this!
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Performance improvement suggestion

2024-03-04 Thread Frank Schilder

Hi all, coming late to the party but want to ship in as well with some 
experience.

The problem of tail latencies of individual OSDs is a real pain for any 
redundant storage system. However, there is a way to deal with this in an 
elegant way when using large replication factors. The idea is to use the 
counterpart of the "fast read" option that exists for EC pools and:

1) make this option available to replicated pools as well (is on the road map 
as far as I know), but also
2) implement an option "fast write" for all pool types.

Fast write enabled would mean that the primary OSD sends #size copies to the 
entire active set (including itself) in parallel and sends an ACK to the client 
as soon as min_size ACKs have been received from the peers (including itself). 
In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever 
reason) without suffering performance penalties immediately (only after too 
many requests started piling up, which will show as a slow requests warning).

I have fast read enabled on all EC pools. This does increase the 
cluster-internal network traffic, which is nowadays absolutely no problem (in 
the good old 1G times it potentially would be). In return, the read latencies 
on the client side are lower and much more predictable. In effect, the user 
experience improved dramatically.

I would really wish that such an option gets added as we use wide replication 
profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large 
replication factors (more precisely, large (size-min_size)) to mitigate the 
impact of slow OSDs would be awesome. It would also add some incentive to stop 
the ridiculous size=2 min_size=1 habit, because one gets an extra gain from 
replication on top of redundancy.

In the long run, the ceph write path should try to deal with a-priori known 
different-latency connections (fast local ACK with async remote completion, was 
asked for a couple of times), for example, for stretched clusters where one has 
an internal connection for the local part and external connections for the 
remote parts. It would be great to have similar ways of mitigating some 
penalties of the slow write paths to remote sites.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Peter Grandi 
Sent: Wednesday, February 21, 2024 1:10 PM
To: list Linux fs Ceph
Subject: [ceph-users] Re: Performance improvement suggestion

> 1. Write object A from client.
> 2. Fsync to primary device completes.
> 3. Ack to client.
> 4. Writes sent to replicas.
[...]

As mentioned in the discussion this proposal is the opposite of
what the current policy, is, which is to wait for all replicas
to be written before writes are acknowledged to the client:

https://github.com/ceph/ceph/blob/main/doc/architecture.rst

   "After identifying the target placement group, the client
   writes the object to the identified placement group's primary
   OSD. The primary OSD then [...] confirms that the object was
   stored successfully in the secondary and tertiary OSDs, and
   reports to the client that the object was stored
   successfully."

A more revolutionary option would be for 'librados' to write in
parallel to all the "active set" OSDs and report this to the
primary, but that would greatly increase client-Ceph traffic,
while the current logic increases traffic only among OSDs.

> So I think that to maintain any semblance of reliability,
> you'd need to at least wait for a commit ack from the first
> replica (i.e. min_size=2).

Perhaps it could be similar to 'k'+'m' for EC, that is 'k'
synchronous (write completes to the client only when all at
least 'k' replicas, including primary, have been committed) and
'm' asynchronous, instead of 'k' being just 1 or 2.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Frank Schilder

You will have to look at the output of "ceph df" and make a decision to balance 
"objects per PG" and "GB per PG". Increase he PG count for the pools with the 
worst of these two numbers most such that it balances out as much as possible. 
If you have pools that see significantly more user-IO than others, prioritise 
these. 

You will have to find out for your specific cluster, we can only give general 
guidelines. Make changes, run benchmarks, re-evaluate. Take the time for it. 
The better you know your cluster and your users, the better the end result will 
be.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Michel Niyoyita 
Sent: Monday, January 29, 2024 2:04 PM
To: Janne Johansson
Cc: Frank Schilder; E Taka; ceph-users
Subject: Re: [ceph-users] Re: 6 pgs not deep-scrubbed in time

This is how it is set , if you suggest to make some changes please advises.

Thank you.


ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 1407 
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application 
mgr_devicehealth
pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1393 flags 
hashpspool stripe_width 0 application rgw
pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1394 flags 
hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1395 
flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1396 flags 
hashpspool stripe_width 0 pg_autoscale_bias 4 application rgw
pool 6 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 128 pgp_num 128 autoscale_mode on last_change 108802 lfor 0/0/14812 
flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps_queue 
[22d7~3,11561~2,11571~1,11573~1c,11594~6,1159b~f,115b0~1,115b3~1,115c3~1,115f3~1,115f5~e,11613~6,1161f~c,11637~1b,11660~1,11663~2,11673~1,116d1~c,116f5~10,11721~c]
pool 7 'images' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 94609 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 8 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 1399 flags hashpspool 
stripe_width 0 application rbd
pool 9 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 108783 lfor 0/561/559 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
removed_snaps_queue [3fa~1,3fc~3,400~1,402~1]
pool 10 'testbench' replicated size 3 min_size 2 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 20931 lfor 
0/20931/20929 flags hashpspool stripe_width 0


On Mon, Jan 29, 2024 at 2:09 PM Michel Niyoyita 
mailto:mico...@gmail.com>> wrote:
Thank you Janne ,

no need of setting some flags like ceph osd set nodeep-scrub  ???

Thank you

On Mon, Jan 29, 2024 at 2:04 PM Janne Johansson 
mailto:icepic...@gmail.com>> wrote:
Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita 
mailto:mico...@gmail.com>>:
>
> Thank you Frank ,
>
> All disks are HDDs . Would like to know if I can increase the number of PGs
> live in production without a negative impact to the cluster. if yes which
> commands to use .

Yes. "ceph osd pool set  pg_num "
where the number usually should be a power of two that leads to a
number of PGs per OSD between 100-200.

--
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Frank Schilder

Setting osd_max_scrubs = 2 for HDD OSDs was a mistake I made. The result was 
that PGs needed a bit more than twice as long to deep-scrub. Net effect: high 
scrub load, much less user IO and, last but not least, the "not deep-scrubbed 
in time" problem got worse, because (2+eps)/2 > 1.

For spinners a consideration looking at the actually available drive 
performance is required, plus a few things more, like PG count, distribution 
etc.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Wesley Dillingham 
Sent: Monday, January 29, 2024 7:14 PM
To: Michel Niyoyita
Cc: Josh Baergen; E Taka; ceph-users
Subject: [ceph-users] Re: 6 pgs not deep-scrubbed in time

Respond back with "ceph versions" output

If your sole goal is to eliminate the not scrubbed in time errors you can
increase the aggressiveness of scrubbing by setting:
osd_max_scrubs = 2

The default in pacific is 1.

if you are going to start tinkering manually with the pg_num you will want
to turn off the pg autoscaler on the pools you are touching.
reducing the size of your PGs may make sense and help with scrubbing but if
the pool has a lot of data it will take a long long time to finish.





Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>


On Mon, Jan 29, 2024 at 10:08 AM Michel Niyoyita  wrote:

> I am running ceph pacific , version 16 , ubuntu 20 OS , deployed using
> ceph-ansible.
>
> Michel
>
> On Mon, Jan 29, 2024 at 4:47 PM Josh Baergen 
> wrote:
>
> > Make sure you're on a fairly recent version of Ceph before doing this,
> > though.
> >
> > Josh
> >
> > On Mon, Jan 29, 2024 at 5:05 AM Janne Johansson 
> > wrote:
> > >
> > > Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita  >:
> > > >
> > > > Thank you Frank ,
> > > >
> > > > All disks are HDDs . Would like to know if I can increase the number
> > of PGs
> > > > live in production without a negative impact to the cluster. if yes
> > which
> > > > commands to use .
> > >
> > > Yes. "ceph osd pool set  pg_num "
> > > where the number usually should be a power of two that leads to a
> > > number of PGs per OSD between 100-200.
> > >
> > > --
> > > May the most significant bit of your life be positive.
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 6 pgs not deep-scrubbed in time

2024-01-29 Thread Frank Schilder

Hi Michel,

are your OSDs HDD or SSD? If they are HDD, its possible that they can't handle 
the deep-scrub load with default settings. In that case, have a look at this 
post 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/
 for some basic tuning info and a script to check your scrub stamp distribution.

You should also get rid of slow/failing ones. Look at smartctl output and throw 
out disks with remapped sectors, uncorrectable r/w errors or unusually many 
corrected read-write-errors (assuming you have disks with ECC).

A basic calculation for deep-scrub is as follows: max number of PGs that can be 
scrubbed at the same time: A=#OSDs/replication factor (rounded down). Take the 
B=deep-scrub times from the OSD logs (grep for deep-scrub) in minutes. Your 
pool can deep-scrub at a max A*24*(60/B) PGs per day. For reasonable operations 
you should not do more than 50% of that. With that you can calculate how many 
days it needs to deep-scrub your PGs.

Usual reasons for slow deep-scrub progress is too few PGs. With replication 
factor 3 and 48 OSDs you have a PG budget of ca. 3200 (ca 200/OSD) but use only 
385. You should consider increasing the PG count for pools with lots of data. 
This should already relax the situation somewhat. Then do the calc above and 
tune deep-scrub times per pool such that they match with disk performance.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Michel Niyoyita 
Sent: Monday, January 29, 2024 7:42 AM
To: E Taka
Cc: ceph-users
Subject: [ceph-users] Re: 6 pgs not deep-scrubbed in time

Now they are increasing , Friday I tried to deep-scrubbing manually and
they have been successfully done , but Monday morning I found that they are
increasing to 37 , is it the best to deep-scrubbing manually while we are
using the cluster? if not what is the best to do in order to address that .

Best Regards.

Michel

 ceph -s
  cluster:
id: cb0caedc-eb5b-42d1-a34f-96facfda8c27
health: HEALTH_WARN
37 pgs not deep-scrubbed in time

  services:
mon: 3 daemons, quorum ceph-mon1,ceph-mon2,ceph-mon3 (age 11M)
mgr: ceph-mon2(active, since 11M), standbys: ceph-mon3, ceph-mon1
osd: 48 osds: 48 up (since 11M), 48 in (since 11M)
rgw: 6 daemons active (6 hosts, 1 zones)

  data:
pools:   10 pools, 385 pgs
objects: 6.00M objects, 23 TiB
usage:   151 TiB used, 282 TiB / 433 TiB avail
pgs: 381 active+clean
 4   active+clean+scrubbing+deep

  io:
client:   265 MiB/s rd, 786 MiB/s wr, 3.87k op/s rd, 699 op/s wr

On Sun, Jan 28, 2024 at 6:14 PM E Taka <0eta...@gmail.com> wrote:

> 22 is more often there than the others. Other operations may be blocked
> because of a deep-scrub is not finished yet. I would remove OSD 22, just to
> be sure about this: ceph orch osd rm osd.22
>
> If this does not help, just add it again.
>
> Am Fr., 26. Jan. 2024 um 08:05 Uhr schrieb Michel Niyoyita <
> mico...@gmail.com>:
>
>> It seems that are different OSDs as shown here . how have you managed to
>> sort this out?
>>
>> ceph pg dump | grep -F 6.78
>> dumped all
>> 6.78   44268   0 0  00
>> 1786796401180   0  10099 10099
>>  active+clean  2024-01-26T03:51:26.781438+0200  107547'115445304
>> 107547:225274427  [12,36,37]  12  [12,36,37]  12
>> 106977'114532385  2024-01-24T08:37:53.597331+0200  101161'109078277
>> 2024-01-11T16:07:54.875746+0200  0
>> root@ceph-osd3:~# ceph pg dump | grep -F 6.60
>> dumped all
>> 6.60   9   0 0  00
>> 179484338742  716  36  10097 10097
>>  active+clean  2024-01-26T03:50:44.579831+0200  107547'153238805
>> 107547:287193139   [32,5,29]  32   [32,5,29]  32
>> 107231'152689835  2024-01-25T02:34:01.849966+0200  102171'147920798
>> 2024-01-13T19:44:26.922000+0200  0
>> 6.3a   44807   0 0  00
>> 1809690056940   0  10093 10093
>>  active+clean  2024-01-26T03:53:28.837685+0200  107547'114765984
>> 107547:238170093  [22,13,11]  22  [22,13,11]  22
>> 106945'113739877  2024-01-24T04:10:17.224982+0200  102863'109559444
>> 2024-01-15T05:31:36.606478+0200  0
>> root@ceph-osd3:~# ceph pg dump | grep -F 6.5c
>> 6.5c   44277   0 0  00
>> 1787649782300   0  10051 10051
>>  active+clean  2024-01-26T03:55:23.339584+0200  107547'126480090
>> 107547:264432655  [22,37,30]  22  [22,37,30]  22
>> 107205'12

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-27 Thread Frank Schilder

Hi Özkan,

> ... The client is actually at idle mode and there is no reason to fail at 
> all. ...

if you re-read my message, you will notice that I wrote that

- its not the client failing, its a false positive error flag that
- is not cleared for idle clients.

You seem to encounter exactly this situation and a simple

echo 3 > /proc/sys/vm/drop_caches

would probably have cleared the warning. There is nothing wrong with your 
client, its an issue with the client-MDS communication protocol that is 
probably still under review. You will encounter these warnings every now and 
then until its fixed.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 1 clients failing to respond to cache pressure (quincy:17.2.6)

2024-01-26 Thread Frank Schilder

Hi, this message is one of those that are often spurious. I don't recall in 
which thread/PR/tracker I read it, but the story was something like that:

If an MDS gets under memory pressure it will request dentry items back from 
*all* clients, not just the active ones or the ones holding many of them. If 
you have a client that's below the min-threshold for dentries (its one of the 
client/mds tuning options), it will not respond. This client will be flagged as 
not responding, which is a false positive.

I believe the devs are working on a fix to get rid of these spurious warnings. 
There is a "bug/feature" in the MDS that does not clear this warning flag for 
inactive clients. Hence, the message hangs and never disappears. I usually 
clear it with a "echo 3 > /proc/sys/vm/drop_caches" on the client. However, 
except for being annoying in the dashboard, it has no performance or otherwise 
negative impact.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Friday, January 26, 2024 10:05 AM
To: Özkan Göksu
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: 1 clients failing to respond to cache pressure 
(quincy:17.2.6)

Performance for small files is more about IOPS rather than throughput,
and the IOPS in your fio tests look okay to me. What you could try is
to split the PGs to get around 150 or 200 PGs per OSD. You're
currently at around 60 according to the ceph osd df output. Before you
do that, can you share 'ceph pg ls-by-pool cephfs.ud-data.data |
head'? I don't need the whole output, just to see how many objects
each PG has. We had a case once where that helped, but it was an older
cluster and the pool was backed by HDDs and separate rocksDB on SSDs.
So this might not be the solution here, but it could improve things as
well.


Zitat von Özkan Göksu :

> Every user has a 1x subvolume and I only have 1 pool.
> At the beginning we were using each subvolume for ldap home directory +
> user data.
> When a user logins any docker on any host, it was using the cluster for
> home and the for user related data, we was have second directory in the
> same subvolume.
> Time to time users were feeling a very slow home environment and after a
> month it became almost impossible to use home. VNC sessions became
> unresponsive and slow etc.
>
> 2 weeks ago, I had to migrate home to a ZFS storage and now the overall
> performance is better for only user_data without home.
> But still the performance is not good enough as I expected because of the
> problems related to MDS.
> The usage is low but allocation is high and Cpu usage is high. You saw the
> IO Op/s, it's nothing but allocation is high.
>
> I develop a fio benchmark script and I run the script on 4x test server at
> the same time, the results are below:
> Script:
> https://github.com/ozkangoksu/benchmark/blob/8f5df87997864c25ef32447e02fcd41fda0d2a67/iobench.sh
>
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-01.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-02.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-03.txt
> https://github.com/ozkangoksu/benchmark/blob/main/benchmark-results/iobench-client-04.txt
>
> While running benchmark, I take sample values for each type of iobench run.
>
> Seq Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   70 MiB/s rd, 762 MiB/s wr, 337 op/s rd, 24.41k op/s wr
> client:   60 MiB/s rd, 551 MiB/s wr, 303 op/s rd, 35.12k op/s wr
> client:   13 MiB/s rd, 161 MiB/s wr, 101 op/s rd, 41.30k op/s wr
>
> Seq Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   1.6 GiB/s rd, 219 KiB/s wr, 28.76k op/s rd, 89 op/s wr
> client:   370 MiB/s rd, 475 KiB/s wr, 90.38k op/s rd, 89 op/s wr
>
> Rand Write benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   63 MiB/s rd, 1.5 GiB/s wr, 8.77k op/s rd, 5.50k op/s wr
> client:   14 MiB/s rd, 1.8 GiB/s wr, 81 op/s rd, 13.86k op/s wr
> client:   6.6 MiB/s rd, 1.2 GiB/s wr, 61 op/s rd, 30.13k op/s wr
>
> Rand Read benchmarking: size=1G,direct=1,numjobs=3,iodepth=32
> client:   317 MiB/s rd, 841 MiB/s wr, 426 op/s rd, 10.98k op/s wr
> client:   2.8 GiB/s rd, 882 MiB/s wr, 25.68k op/s rd, 291 op/s wr
> client:   4.0 GiB/s rd, 226 MiB/s wr, 89.63k op/s rd, 124 op/s wr
> client:   2.4 GiB/s rd, 295 KiB/s wr, 197.86k op/s rd, 20 op/s wr
>
> It seems I only have problems with the 4K,8K,16K other sector sizes.
>
>
>
>
> Eugen Block , 25 Oca 2024 Per, 19:06 tarihinde şunu yazdı:
>
>> I understand that your MDS shows a high CPU usage, but other than that
>> what is your performance issue? Do users

[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

2024-01-24 Thread Frank Schilder

Hi,

Hector also claims that he observed an incomplete acting set after *adding* an 
OSD. Assuming that the cluster was health OK before that, that should not 
happen in theory. In practice this was observed with certain definitions of 
crush maps. There is, for example, the issue with "choose" and "chooseleaf" not 
doing the same thing in situations they should. Another one was that spurious 
(temporary) allocations of PGs could exceed hard limits without being obvious 
or reported at all. Without seeing the crush maps its hard to tell what is 
going on. With just 3 hosts and 4 OSDs per hosts the cluster might be hitting 
corner cases with such a wide EC profile.

Having the osdmap of the cluster in normal conditions would allow to simulate 
OSD downs and ups off-line and one might gain inside why crush fails to compute 
a complete acting set (yes, I'm not talking about the up set, I was always 
talking about the acting set). There might also be an issue with the 
PG-/OSD-map logs tracking the full history of the PGs in question.

A possible way to test is to issue a re-peer command after all peering finished 
on a PG with incomplete acting set to see if this resolves the PG. If so, there 
is a temporary condition that prevents the PGs from becoming clean when going 
through the standard peering procedure.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Eugen Block 
Sent: Wednesday, January 24, 2024 9:45 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

Hi,

this topic pops up every now and then, and although I don't have
definitive proof for my assumptions I still stand with them. ;-)
As the docs [2] already state, it's expected that PGs become degraded
after some sort of failure (setting an OSD "out" falls into that
category IMO):

> It is normal for placement groups to enter “degraded” or “peering”
> states after a component failure. Normally, these states reflect the
> expected progression through the failure recovery process. However,
> a placement group that stays in one of these states for a long time
> might be an indication of a larger problem.

And you report that your PGs do not stay in that state but eventually
recover. My understanding is as follows:
PGs have to be recreated on different hosts/OSDs after setting an OSD
"out". During this transition (peering) the PGs are degraded until the
newly assigned OSD have noticed their new responsibility (I'm not
familiar with the actual data flow). The degraded state then clears as
long as the out OSD is up (its PGs are active). If you stop that OSD
("down") the PGs become and stay degraded until they have been fully
recreated on different hosts/OSDs. Not sure what impacts the duration
until the degraded state clears, but in my small test cluster (similar
osd tree as yours) the degraded state clears after a few seconds only,
but I only have a few (almost empty) PGs in the EC test pool.

I guess a comment from the devs couldn't hurt to clear this up.

[2]
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#stuck-placement-groups

Zitat von Hector Martin :

> On 2024/01/22 19:06, Frank Schilder wrote:
>> You seem to have a problem with your crush rule(s):
>>
>> 14.3d ... [18,17,16,3,1,0,NONE,NONE,12]
>>
>> If you really just took out 1 OSD, having 2xNONE in the acting set
>> indicates that your crush rule can't find valid mappings. You might
>> need to tune crush tunables:
>> https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs
>
> Look closely: that's the *acting* (second column) OSD set, not the *up*
> (first column) OSD set. It's supposed to be the *previous* set of OSDs
> assigned to that PG, but inexplicably some OSDs just "fall off" when the
> PGs get remapped around.
>
> Simply waiting lets the data recover. At no point are any of my PGs
> actually missing OSDs according to the current cluster state, and CRUSH
> always finds a valid mapping. Rather the problem is that the *previous*
> set of OSDs just loses some entries some for some reason.
>
> The same problem happens when I *add* an OSD to the cluster. For
> example, right now, osd.15 is out. This is the state of one pg:
>
> 14.3d   1044   0 0  00
> 157307567310   0  1630 0  1630
> active+clean  2024-01-22T20:15:46.684066+0900 15550'1630
> 15550:16184  [18,17,16,3,1,0,11,14,12]  18
> [18,17,16,3,1,0,11,14,12]  18 15550'1629
> 2024-01-22T20:15:46.683491+0900  0'0
> 2024-01-08T15:18:21.654679+0900  02
> periodic scrub scheduled @ 2024-01-31T

[ceph-users] List contents of stray buckets with octopus

2024-01-24 Thread Frank Schilder

Hi all,

I need to list the contents of the stray buckets on one of our MDSes. The MDS 
reports 772674 stray entries. However, if I dump its cache and grep for stray I 
get only 216 hits.

How can I get to the contents of the stray buckets?

Please note that Octopus is still hit by https://tracker.ceph.com/issues/57059 
so a "dump tree" will not work. In addition, I clearly don't just need the 
entries in cache, I need a listing of everything. How can I get that? I'm 
willing to run rados commands and pipe through ceph-encoder if necessary.

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Degraded PGs on EC pool when marking an OSD out

2024-01-22 Thread Frank Schilder

You seem to have a problem with your crush rule(s):

14.3d ... [18,17,16,3,1,0,NONE,NONE,12]

If you really just took out 1 OSD, having 2xNONE in the acting set indicates 
that your crush rule can't find valid mappings. You might need to tune crush 
tunables: 
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/?highlight=crush%20gives%20up#troubleshooting-pgs

It is possible that your low OSD count causes the "crush gives up too soon" 
issue. You might also consider to use a crush rule that places exactly 3 shards 
per host (examples were in posts just last week). Otherwise, it is not 
guaranteed that "... data remains available if a whole host goes down ..." 
because you might have 4 chunks on one of the hosts and fall below min_size 
(the failure domain of your crush rule for the EC profiles is OSD).

To test if your crush rules can generate valid mappings, you can pull the 
osdmap of your cluster and use osdmaptool to experiment with it without risk of 
destroying anything. It allows you to try different crush rules and failure 
scenarios on off-line but real cluster meta-data.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Hector Martin 
Sent: Friday, January 19, 2024 10:12 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Degraded PGs on EC pool when marking an OSD out

I'm having a bit of a weird issue with cluster rebalances with a new EC
pool. I have a 3-machine cluster, each machine with 4 HDD OSDs (+1 SSD).
Until now I've been using an erasure coded k=5 m=3 pool for most of my
data. I've recently started to migrate to a k=5 m=4 pool, so I can
configure the CRUSH rule to guarantee that data remains available if a
whole host goes down (3 chunks per host, 9 total). I also moved the 5,3
pool to this setup, although by nature I know its PGs will become
inactive if a host goes down (need at least k+1 OSDs to be up).

I've only just started migrating data to the 5,4 pool, but I've noticed
that any time I trigger any kind of backfilling (e.g. take one OSD out),
a bunch of PGs in the 5,4 pool become degraded (instead of just
misplaced/backfilling). This always seems to happen on that pool only,
and the object count is a significant fraction of the total pool object
count (it's not just "a few recently written objects while PGs were
repeering" or anything like that, I know about that effect).

Here are the pools:

pool 13 'cephfs2_data_hec5.3' erasure profile ec5.3 size 8 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14133 lfor 0/11307/11305 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs
pool 14 'cephfs2_data_hec5.4' erasure profile ec5.4 size 9 min_size 6
crush_rule 7 object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode
warn last_change 14509 lfor 0/0/14234 flags
hashpspool,ec_overwrites,bulk stripe_width 20480 application cephfs

EC profiles:

# ceph osd erasure-code-profile get ec5.3
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=3
plugin=jerasure
technique=reed_sol_van
w=8

# ceph osd erasure-code-profile get ec5.4
crush-device-class=
crush-failure-domain=osd
crush-root=default
jerasure-per-chunk-alignment=false
k=5
m=4
plugin=jerasure
technique=reed_sol_van
w=8

They both use the same CRUSH rule, which is designed to select 9 OSDs
balanced across the hosts (of which only 8 slots get used for the older
5,3 pool):

rule hdd-ec-x3 {
id 7
type erasure
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 3 type host
step choose indep 3 type osd
step emit
}

If I take out an OSD (14), I get something like this:

health: HEALTH_WARN
Degraded data redundancy: 37631/120155160 objects degraded
(0.031%), 38 pgs degraded

All the degraded PGs are in the 5,4 pool, and the total object count is
around 50k, so this is *most* of the data in the pool becoming degraded
just because I marked an OSD out (without stopping it). If I mark the
OSD in again, the degraded state goes away.

Example degraded PGs:

# ceph pg dump | grep degraded
dumped all
14.3c812   0   838  00
119250277580   0  1088 0  1088
active+recovery_wait+undersized+degraded+remapped
2024-01-19T18:06:41.786745+0900 15440'1088 15486:10772
[18,17,16,1,3,2,11,13,12]  18[18,17,16,1,3,2,11,NONE,12]
 18  14537'432  2024-01-12T11:25:54.168048+0900
0'0  2024-01-08T15:18:21.654679+0900  02
 periodic scrub scheduled @ 2024-01-21T08:00:23.572904+0900
  2410
14.3d772   0  1602  00
113032802230   0  1283 0  1283
active+recovery_wait+unders

[ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

2024-01-18 Thread Frank Schilder

Hi, maybe this is related. On a system with many disks I also had aio problems 
causing OSDs to hang. Here it was the kernel parameter fs.aio-max-nr that was 
way too low by default. I bumped it to fs.aio-max-nr = 1048576 (sysctl/tuned) 
and OSDs came up right away.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Thursday, January 18, 2024 9:46 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Adding OSD's results in slow ops, inactive PG's

I'm glad to hear (or read) that it worked for you as well. :-)

Zitat von Torkil Svensgaard :

> On 18/01/2024 09:30, Eugen Block wrote:
>> Hi,
>>
>>> [ceph: root@lazy /]# ceph-conf --show-config | egrep
>>> osd_max_pg_per_osd_hard_ratio
>>> osd_max_pg_per_osd_hard_ratio = 3.00
>>
>> I don't think this is the right tool, it says:
>>
>>> --show-config-value    Print the corresponding ceph.conf value
>>> that matches the specified key.
>>> Also searches
>>> global defaults.
>>
>> I suggest to query the daemon directly:
>>
>> storage01:~ # ceph config set osd osd_max_pg_per_osd_hard_ratio 5
>>
>> storage01:~ # ceph tell osd.0 config get osd_max_pg_per_osd_hard_ratio
>> {
>> "osd_max_pg_per_osd_hard_ratio": "5.00"
>> }
>
> Copy that, verified to be 5 now.
>
>>> Daemons are running but those last OSDs won't come online.
>>> I've tried upping bdev_aio_max_queue_depth but it didn't seem to
>>> make a difference.
>>
>> I don't have any good idea for that right now except what you
>> already tried. Which values for bdev_aio_max_queue_depth have you
>> tried?
>
> The previous value was 1024, I bumped it to 4096.
>
> A couple of the OSDs seemingly stuck on the aio thing has now come
> to, so I went ahead and added the rest. Some of them came in right
> away, some are stuck on the aio thing. Hopefully they will recover
> eventually.
>
> Thanks you again for the osd_max_pg_per_osd_hard_ratio suggestion,
> that seems to have solved the core issue =)
>
> Mvh.
>
> Torkil
>
>>
>> Zitat von Torkil Svensgaard :
>>
>>> On 18/01/2024 07:48, Eugen Block wrote:
>>>> Hi,
>>>>
>>>>>  -3281> 2024-01-17T14:57:54.611+ 7f2c6f7ef540  0 osd.431
>>>>> 2154828 load_pgs opened 750 pgs <---
>>>>
>>>> I'd say that's close enough to what I suspected. ;-) Not sure why
>>>> the "maybe_wait_for_max_pg" message isn't there but I'd give it a
>>>> try with a higher osd_max_pg_per_osd_hard_ratio.
>>>
>>> Might have helped, not quite sure.
>>>
>>> I've set these since I wasn't sure which one was the right one?:
>>>
>>> "
>>> ceph config dump | grep osd_max_pg_per_osd_hard_ratio
>>> globaladvanced  osd_max_pg_per_osd_hard_ratio
>>> 5.00 osd   advanced
>>> osd_max_pg_per_osd_hard_ratio5.00
>>> "
>>>
>>> Restarted MONs and MGRs. Still getting this with ceph-conf though:
>>>
>>> "
>>> [ceph: root@lazy /]# ceph-conf --show-config | egrep
>>> osd_max_pg_per_osd_hard_ratio
>>> osd_max_pg_per_osd_hard_ratio = 3.00
>>> "
>>>
>>> I re-added a couple small SSD OSDs and they came in just fine. I
>>> then added a couple HDD OSDs and they also came in after a bit of
>>> aio_submit spam. I added a couple more and have now been looking
>>> at this for 40 minutes:
>>>
>>>
>>> "
>>> ...
>>>
>>> 2024-01-18T07:42:01.789+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 10
>>> 2024-01-18T07:42:01.808+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 4
>>> 2024-01-18T07:42:01.819+ 7f735d1b8700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 82
>>> 2024-01-18T07:42:07.499+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 6
>>> 2024-01-18T07:42:07.542+ 7f734fa04700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 8
>>> 2024-01-18T07:42:07.554+ 7f735d1b8700 -1 bdev(0x56295d586400
>>> /var/lib/ceph/osd/ceph-436/block) aio_submit retries 108
>>> ...
>>> "
>>>
>&g

[ceph-users] Re: Performance impact of Heterogeneous environment

2024-01-18 Thread Frank Schilder

For multi- vs. single-OSD per flash drive decision the following test might be 
useful:

We found dramatic improvements using multiple OSDs per flash drive with octopus 
*if* the bottleneck is the kv_sync_thread. Apparently, each OSD has only one 
and this thread is effectively sequentializing otherwise async IO if saturated.

There was a dev discussion about having more kv_sync_threads per OSD daemon by 
splitting up rocks-dbs for PGs, but I don't know if this ever materialized.

My guess is that for good NVMe drives it is possible that a single 
kv_sync_thread can saturate the device and there will be no advantage of having 
more OSDs/device. On not so good drives (SATA/SAS flash) multi-OSD deployments 
usually are better, because the on-disk controller requires concurrency to 
saturate the drive. Its not possible to saturate usual SAS-/SATA- SSDs with 
iodepth=1.

With good NVME drives I have seen fio-tests with direct IO saturate the drive 
with 4K random IO and iodepth=1. You need enough PCI-lanes per drive for that 
and I could imagine that here 1 OSD/drive is sufficient. For such drives, 
storage access quickly becomes CPU bound, so some benchmarking taking all 
system properties into account is required. If you are already CPU bound (too 
many NVMe drives per core, many standard servers with 24+ NVMe drives have that 
property) there is no point adding extra CPU load with more OSD daemons.

Don't just look at single disks, look at the whole system.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Bailey Allison 
Sent: Thursday, January 18, 2024 12:36 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Performance impact of Heterogeneous environment

+1 to this, great article and great research. Something we've been keeping a 
very close eye on ourselves.

Overall we've mostly settled on the old keep it simple stupid methodology with 
good results. Especially as the benefits have gotten less beneficial the more 
recent your ceph version, and have been rocking with single OSD/NVMe, but as 
always everything is workload dependant and there is sometimes a need for 
doubling up 

Regards,

Bailey

> -Original Message-
> From: Maged Mokhtar 
> Sent: January 17, 2024 4:59 PM
> To: Mark Nelson ; ceph-users@ceph.io
> Subject: [ceph-users] Re: Performance impact of Heterogeneous
> environment
>
> Very informative article you did Mark.
>
> IMHO if you find yourself with very high per-OSD core count, it may be logical
> to just pack/add more nvmes per host, you'd be getting the best price per
> performance and capacity.
>
> /Maged
>
>
> On 17/01/2024 22:00, Mark Nelson wrote:
> > It's a little tricky.  In the upstream lab we don't strictly see an
> > IOPS or average latency advantage with heavy parallelism by running
> > muliple OSDs per NVMe drive until per-OSD core counts get very high.
> > There does seem to be a fairly consistent tail latency advantage even
> > at moderately low core counts however.  Results are here:
> >
> > https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/
> >
> > Specifically for jitter, there is probably an advantage to using 2
> > cores per OSD unless you are very CPU starved, but how much that
> > actually helps in practice for a typical production workload is
> > questionable imho.  You do pay some overhead for running 2 OSDs per
> > NVMe as well.
> >
> >
> > Mark
> >
> >
> > On 1/17/24 12:24, Anthony D'Atri wrote:
> >> Conventional wisdom is that with recent Ceph releases there is no
> >> longer a clear advantage to this.
> >>
> >>> On Jan 17, 2024, at 11:56, Peter Sabaini  wrote:
> >>>
> >>> One thing that I've heard people do but haven't done personally with
> >>> fast NVMes (not familiar with the IronWolf so not sure if they
> >>> qualify) is partition them up so that they run more than one OSD
> >>> (say 2 to 4) on a single NVMe to better utilize the NVMe bandwidth.
> >>> See
> >>> https://ceph.com/community/bluestore-default-vs-tuned-
> performance-co
> >>> mparison/
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> >> email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email
> to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Recomand number of k and m erasure code

2024-01-15 Thread Frank Schilder

I would like to add here a detail that is often overlooked: maintainability 
under degraded conditions.

For production systems I would recommend to use EC profiles with at least m=3. 
The reason being that if you have a longer problem with a node that is down and 
m=2 it is not possible to do any maintenance on the system without loosing 
write access. Don't trust what users claim they are willing to tolerate - at 
least get it in writing. Once a problem occurs they will be at your door step 
no matter what they said before.

Similarly, when doing a longer maintenance task and m=2, any disk fail during 
maintenance will imply loosing write access.

Having m=3 or larger allows for 2 (or larger) numbers of hosts/OSDs being 
unavailable simultaneously while service is fully operational. That can be a 
life saver in many situations.

An additional reason for larger m is systematic failures of drives if your 
vendor doesn't mix drives from different batches and factories. If a batch has 
a systematic production error, failures are no longer statistically 
independent. In such a situation, if one drives fails the likelihood that more 
drives fail at the same time is very high. Having a larger number of parity 
shards increases the chances of recovering from such events.

For similar reasons I would recommend to deploy 5 MONs instead of 3. My life 
got so much better after having the extra redundancy.

As some background, in our situation we experience(d) somewhat heavy 
maintenance operations including modifying/updating ceph nodes (hardware, not 
software), exchanging Racks, switches, cooling and power etc. This required 
longer downtime and/or moving of servers and moving the ceph hardware was the 
easiest compared with other systems due to the extra redundancy bits in it. We 
had no service outages during such operations.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Anthony D'Atri 
Sent: Saturday, January 13, 2024 5:36 PM
To: Phong Tran Thanh
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Recomand number of k and m erasure code

There are nuances, but in general the higher the sum of m+k, the lower the 
performance, because *every* operation has to hit that many drives, which is 
especially impactful with HDDs.  So there’s a tradeoff between storage 
efficiency and performance.  And as you’ve seen, larger parity groups 
especially mean slower recovery/backfill.

There’s also a modest benefit to choosing values of m and k that have small 
prime factors, but I wouldn’t worry too much about that.

You can find EC efficiency tables on the net:

https://docs.netapp.com/us-en/storagegrid-116/ilm/what-erasure-coding-schemes-are.html

I should really add a table to the docs, making a note to do that.

There’s a nice calculator at the OSNEXUS site:

https://www.osnexus.com/ceph-designer

The overhead factor is (k+m) / k

So for a 4,2 profile, that’s 6 / 4 == 1.5

For 6,2, 8 / 6 = 1.33

For 10,2, 12 / 10 = 1.2

and so forth.  As k increases, the incremental efficiency gain sees diminishing 
returns, but performance continues to decrease.

Think of m as the number of copies you can lose without losing data, and m-1 as 
the number you can lose / have down and still have data *available*.

I also suggest that the number of failure domains — in your cases this means 
OSD nodes — be *at least* k+m+1, so in your case you want k+m to be at most 9.

With RBD and many CephFS implementations, we mostly have relatively large RADOS 
objects that are striped over many OSDs.

When using RGW especially, one should attend to average and median S3 object 
size.  There’s an analysis of the potential for space amplification in the docs 
so I won’t repeat it here in detail. This sheet 
https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit#gid=358760253
 visually demonstrates this.

Basically, for an RGW bucket pool — or for a CephFS data pool storing unusually 
small objects — if you have a lot of S3 objects in the multiples of KB size, 
you waste a significant fraction of underlying storage.  This is exacerbated by 
EC, and the larger the sum of k+m, the more waste.

When people ask me about replication vs EC and EC profile, the first question I 
ask is what they’re storing.  When EC isn’t a non-starter, I tend to recommend 
4,2 as a profile until / unless someone has specific needs and can understand 
the tradeoffs. This lets you store ~~ 2x the data of 3x replication while not 
going overboard on the performance hit.

If you care about your data, do not set m=1.

If you need to survive the loss of many drives, say if your cluster is across 
multiple buildings or sites, choose a larger value of k.  There are people 
running profiles like 4,6 because they have unusual and specific needs.

> On Jan 13, 2024, at 10:32 AM, Phong Tran Thanh  wrote:
>
> Hi ceph user!
>
> I need to determine whi

[ceph-users] Re: 3 DC with 4+5 EC not quite working

2024-01-12 Thread Frank Schilder

Is it maybe this here: 
https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon

I always have to tweak the num-tries parameters.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Torkil Svensgaard 
Sent: Friday, January 12, 2024 10:17 AM
To: Frédéric Nass
Cc: ceph-users@ceph.io; Ruben Vestergaard
Subject: [ceph-users] Re: 3 DC with 4+5 EC not quite working



On 12-01-2024 09:35, Frédéric Nass wrote:
>
> Hello Torkil,

Hi Frédéric

> We're using the same ec scheme than yours with k=5 and m=4 over 3 DCs with 
> the below rule:
>
>
> rule ec54 {
>  id 3
>  type erasure
>  min_size 3
>  max_size 9
>  step set_chooseleaf_tries 5
>  step set_choose_tries 100
>  step take default class hdd
>  step choose indep 0 type datacenter
>  step chooseleaf indep 3 type host
>  step emit
> }
>
> Works fine. The only difference I see with your EC rule is the fact that we 
> set min_size and max_size but I doubt this has anything to do with your 
> situation.

Great, thanks. I wonder if we might need to tweak the min_size. I think
I tried lowering it to no avail and then set it back to 5 after editing
the crush rule.

> Since the cluster still complains about "Pool cephfs.hdd.data has 1024 
> placement groups, should have 2048", did you run "ceph osd pool set 
> cephfs.hdd.data pgp_num 2048" right after running "ceph osd pool set 
> cephfs.hdd.data pg_num 2048"? [1]
>
> Might be that the pool still has 1024 PGs.

Hmm coming from RHCS we didn't do this as:

"
RHCS 4.x and 5.x does not require the pgp_num value to be set. This will
be done by ceph-mgr automatically. For RHCS 4.x and 5.x, only the pg_num
is required to be incremented for the necessary pools.
"

So I only did "ceph osd pool set cephfs.hdd.data pgp_num 2048" and let
the mgr handle the rest. I had a watch running to see how it went and
the pool was up to something like 1922 PGs when it got stuck.

As I read the documentation[1] this shouldn't get us stuck like we did,
but we would have to set the pgp_num eventually to get it to rebalance?

Mvh.

Torkil

[1]
https://docs.ceph.com/en/quincy/rados/operations/placement-groups/#setting-the-number-of-pgs

>
>
> Regards,
> Frédéric.
>
> [1] 
> https://docs.ceph.com/en/mimic/rados/operations/placement-groups/#set-the-number-of-placement-groups
>
>
>
> -Message original-
>
> De: Torkil 
> à: ceph-users 
> Cc: Ruben 
> Envoyé: vendredi 12 janvier 2024 09:00 CET
> Sujet : [ceph-users] 3 DC with 4+5 EC not quite working
>
> We are looking to create a 3 datacenter 4+5 erasure coded pool but can't
> quite get it to work. Ceph version 17.2.7. These are the hosts (there
> will eventually be 6 hdd hosts in each datacenter):
>
> -33 886.00842 datacenter 714
> -7 209.93135 host ceph-hdd1
>
> -69 69.86389 host ceph-flash1
> -6 188.09579 host ceph-hdd2
>
> -3 233.57649 host ceph-hdd3
>
> -12 184.54091 host ceph-hdd4
> -34 824.47168 datacenter DCN
> -73 69.86389 host ceph-flash2
> -2 201.78067 host ceph-hdd5
>
> -81 288.26501 host ceph-hdd6
>
> -31 264.56207 host ceph-hdd7
>
> -36 1284.48621 datacenter TBA
> -77 69.86389 host ceph-flash3
> -21 190.83224 host ceph-hdd8
>
> -29 199.08838 host ceph-hdd9
>
> -11 193.85382 host ceph-hdd10
>
> -9 237.28154 host ceph-hdd11
>
> -26 187.19536 host ceph-hdd12
>
> -4 206.37102 host ceph-hdd13
>
> We did this:
>
> ceph osd erasure-code-profile set DRCMR_k4m5_datacenter_hdd
> plugin=jerasure k=4 m=5 technique=reed_sol_van crush-root=default
> crush-failure-domain=datacenter crush-device-class=hdd
>
> ceph osd pool create cephfs.hdd.data erasure DRCMR_k4m5_datacenter_hdd
> ceph osd pool set cephfs.hdd.data allow_ec_overwrites true
> ceph osd pool set cephfs.hdd.data pg_autoscale_mode warn
>
> Didn't quite work:
>
> "
> [WARN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive, 1 pg
> incomplete
> pg 33.0 is creating+incomplete, acting
> [104,219,NONE,NONE,NONE,41,NONE,NONE,NONE] (reducing pool
> cephfs.hdd.data min_size from 5 may help; search ceph.com/docs for
> 'incomplete')
> "
>
> I then manually changed the crush rule from this:
>
> "
> rule cephfs.hdd.data {
> id 7
> type erasure
> step set_chooseleaf_tries 5
> step set_choose_tries 100
> step take default class hdd
> step chooseleaf indep 0 type datacenter
> step emit
> }
> "
>
> To this:
>
> "
> rule cephfs.hdd.data {
> id 7
> type erasure
> step set_chooseleaf_tries 5
> step set_ch

[ceph-users] Re: Rack outage test failing when nodes get integrated again

2024-01-11 Thread Frank Schilder

Hi Steve,

I also observed that setting mon_osd_reporter_subtree_level to anything else 
than host leads to incorrect behavior.

In our case, I actually observed the opposite. I had 
mon_osd_reporter_subtree_level=datacenter (we have 3 DCs in the crush tree). 
After cutting off a single host with ifdown - also a network cut-off albeit not 
via firewall rules, I observed that not all OSDs on that host were marked down 
(neither was the host), leading to blocked IO. I didn't wait for very long 
(only a few minutes, less than 5), because its a production system. I also 
didn't find the time to file a tracker issue. I observed this with mimic, but 
since you report it for Pacific I'm pretty sure its affecting all versions.

My guess is that this is not part of the CI testing, at least not in a way that 
covers network cut-off.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Steve Baker 
Sent: Thursday, January 11, 2024 8:45 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Rack outage test failing when nodes get integrated again

Hi, we're currently testing a ceph (v 16.2.14) cluster, 3 mon nodes, 6 osd
nodes à 8 nvme ssd osds distributed over 3 racks. Daemons are deployed in
containers with cephadm / podman. We got 2 pools on it, one with 3x
replication and min_size=2, one with an EC (k3m3). With 1 mon node and 2
osd nodes in each rack, the crush rules are configured in a way (for the 3x
pool chooseleaf_firstn rack, for the ec pool choose_indep 3 rack /
chooseleaf_indep 2 host), so that a full rack can go down while the cluster
stays accessible for client operations. Other options we have set are
mon_osd_down_out_subtree_limit=host so that in case of a host/rack outage
the cluster does not automatically start to backfill, but will continue to
run in a degraded state until human interaction comes to fix it. Also we
set mon_osd_reporter_subtree_level=rack.

We tested - while under (synthetic test-)client load - what happens if we
take a full rack (one mon node and 2 osd nodes) out of the cluster. We did
that using iptables to block the nodes of the rack from other nodes of the
cluster (global and cluster network), as well as from the clients. As
expected, the remainder of the cluster continues to run in a degraded state
without to start any backfilling or recovery processes. All client requests
gets served while the rack is out.

But then a strange thing happens when we take the rack (1mon node, 2 osd
nodes) back into the cluster again by deleting all firewall rules with
iptables -F at once. Some osds get integrated in the cluster again
immediatelly but some others remain in state "down" for exactly 10 minutes.
These osds that stay down for the 10 minutes are in a state where they
still seem to not be able to reach other osd nodes (see heartbeat_check
logs below). After these 10 minutes have passed, these osds come up as well
but then at exactly that time, many PGs get stuck in peering state and
other osds that were all the time in the cluster get slow requests and the
cluster blocks client traffic (I think it's just the PGs stuck in peering
soaking all the client threads). Then, exactly 45 minutes after the nodes
of the rack were made reachable by iptables -F again, the situation
recovers, peering succeeds and client load gets handled again.

We have repeated this test several times and it's always exactly the same
10 min "down interval" and a 45 min affected client requests. When we
integrate the nodes into the cluster again one after another with a delay
of some minutes inbetween, this does not happen at all. I wonder what's
happening there. It must be some kind of split-brain situation having to do
with blocking the nodes using iptables but not rebooting them completelly.
The 10 min and 45 min intervals I described occure every time. For the 10
minutes, some osds stay down after the hosts got integrated again. It's not
all of the 16 osds from the 2 osd hosts that got integrated again but just
some of them. Which ones varies randomly. Sometimes it's also only just
one. We also observerd, the longer the hosts were out of the cluster, the
more osds are affected. Then even after they get up again after 10 minutes,
it takes another 45 minutes until the stuck peering situation resolves.
Also during these 45 minutes, we see slow ops on osds thet remained into
the cluster.


See here some OSD logs that get written after the reintegration:


2024-01-04T08:25:03.856+ 7f369132b700 -1 monclient:
_check_auth_rotating possible clock skew, rotating keys expired way too
early (before 2024-01-04T07:25:03.860426+)
2024-01-04T08:25:06.556+ 7f3682882700  0 log_channel(cluster) log [WRN]
: Monitor daemon marked osd.0 down, but it is still running
2024-01-04T08:25:06.556+ 7f3682882700  0 log_channel(clus

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2024-01-09 Thread Frank Schilder

Quick answers:

  *   ... osd_deep_scrub_randomize_ratio ... but not on Octopus: is it still a 
valid parameter?

Yes, this parameter exists and can be used to prevent premature deep-scrubs. 
The effect is dramatic.

  *   ... essentially by playing with osd_scrub_min_interval,...

The main parameter is actually osd_deep_scrub_randomize_ratio, all other 
parameters have less effect in terms of scrub load. osd_scrub_min_interval is 
the second most important parameter and needs increasing for large 
SATA-/NL-SAS-HDDs. For sufficiently fast drives the default of 24h is good 
(although might be a bit aggressive/paranoid).

  *   Another small question: you opt for osd_max_scrubs=1 just to make sure
your I/O is not adversely affected by scrubbing, or is there a more
profound reason for that?

Well, not affecting user-IO too much is a quite profound reason and many admins 
try to avoid scrubbing at all when users are on the system. It makes IO 
somewhat unpredictable and can trigger user complaints.

However, there is another profound reason: for HDDs it increases deep-scrub 
load (that is, interference with user IO) a lot while it actually slows down 
the deep-scrubbing. HDDs can't handle the implied random IO of concurrent 
deep-scrubs well. On my system I saw that with osd_max_scrubs=2 the scrub time 
for a PG increased a bit more than double. In other words: more scrub load, 
less scrub progress = useless, do not do this.

I plan to document the script a bit more and am waiting for some deep-scrub 
histograms to converge to equilibrium. This takes months for our large pools, 
but I would like to have the numbers for an example of how it should look like.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Fulvio Galeazzi
Sent: Monday, January 8, 2024 4:21 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: How to configure something like 
osd_deep_scrub_min_interval?

Hallo Frank,
just found this post, thank you! I have also been puzzled/struggling
with scrub/deep-scrub and found your post very useful: will give this a
try, soon.

One thing, first: I am using Octopus, too, but I cannot find any
documentation about osd_deep_scrub_randomize_ratio. I do see that in
past releases, but not on Octopus: is it still a valid parameter?

Let me check whether I understood your procedure: you optimize scrub
time distribution essentially by playing with osd_scrub_min_interval,
thus "forcing" the automated algorithm to preferentially select
older-scrubbed PGs, am I correct?

Another small question: you opt for osd_max_scrubs=1 just to make sure
your I/O is not adversely affected by scrubbing, or is there a more
profound reason for that?

   Thanks!

Fulvio

On 12/13/23 13:36, Frank Schilder wrote:
> Hi all,
>
> since there seems to be some interest, here some additional notes.
>
> 1) The script is tested on octopus. It seems that there was a change in the 
> output of ceph commands used and it might need some tweaking to get it to 
> work on other versions.
>
> 2) If you want to give my findings a shot, you can do so in a gradual way. 
> The most important change is setting osd_deep_scrub_randomize_ratio=0 (with 
> osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the 
> requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp 
> younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the 
> one change to test, all other settings have less impact. The script will not 
> report some numbers at the end, but the histogram will be correct. Let it run 
> a few deep-scrub-interval rounds until the histogram is evened out.
>
> If you start your test after using osd_max_scrubs>1 for a while -as I did - 
> you will need a lot of patience and might need to mute some scrub warnings 
> for a while.
>
> 3) The changes are mostly relevant for large HDDs that take a long time to 
> deep-scrub (many small objects). The overall load reduction, however, is 
> useful in general.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

--
Fulvio Galeazzi
GARR-Net Department
tel.: +39-334-6533-250
skype: fgaleazzi70
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-12-15 Thread Frank Schilder

Hi all,

another quick update: please use this link to download the script: 
https://github.com/frans42/ceph-goodies/blob/main/scripts/pool-scrub-report

The one I sent originally does not follow latest.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-12-13 Thread Frank Schilder

Hi all,

since there seems to be some interest, here some additional notes.

1) The script is tested on octopus. It seems that there was a change in the 
output of ceph commands used and it might need some tweaking to get it to work 
on other versions.

2) If you want to give my findings a shot, you can do so in a gradual way. The 
most important change is setting osd_deep_scrub_randomize_ratio=0 (with 
osd_max_scrubs=1), this will make osd_deep_scrub_interval work exactly as the 
requested osd_deep_scrub_min_interval setting, PGs with a deep-scrub stamp 
younger than osd_deep_scrub_interval will *not* be deep-scrubbed. This is the 
one change to test, all other settings have less impact. The script will not 
report some numbers at the end, but the histogram will be correct. Let it run a 
few deep-scrub-interval rounds until the histogram is evened out.

If you start your test after using osd_max_scrubs>1 for a while -as I did - you 
will need a lot of patience and might need to mute some scrub warnings for a 
while.

3) The changes are mostly relevant for large HDDs that take a long time to 
deep-scrub (many small objects). The overall load reduction, however, is useful 
in general.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: increasing number of (deep) scrubs

2023-12-13 Thread Frank Schilder

Yes, octopus. -- Frank
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Szabo, Istvan (Agoda) 
Sent: Wednesday, December 13, 2023 6:13 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: increasing number of (deep) scrubs

Hi,

You are on octopus right?

Istvan Szabo
Staff Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com<mailto:istvan.sz...@agoda.com>
---

From: Frank Schilder 
Sent: Tuesday, December 12, 2023 7:33 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] Re: increasing number of (deep) scrubs

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !

Hi all,

if you follow this thread, please see the update in "How to configure something 
like osd_deep_scrub_min_interval?" 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/).
 I found out how to tune the scrub machine and I posted a quick update in the 
other thread, because the solution was not to increase the number of scrubs, 
but to tune parameters.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________
From: Frank Schilder
Sent: Monday, January 9, 2023 9:14 AM
To: Dan van der Ster
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Dan,

thanks for your answer. I don't have a problem with increasing osd_max_scrubs 
(=1 at the moment) as such. I would simply prefer a somewhat finer grained way 
of controlling scrubbing than just doubling or tripling it right away.

Some more info. These 2 pools are data pools for a large FS. Unfortunately, we 
have a large percentage of small files, which is a pain for recovery and 
seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to 
increase the warning interval already to 2 weeks. With all the warning grace 
parameters this means that we manage to deep scrub everything about every 
month. I need to plan for 75% utilisation and a 3 months period is a bit far on 
the risky side.

Our data is to a large percentage cold data. Client reads will not do the check 
for us, we need to combat bit-rot pro-actively.

The reasons I'm interested in parameters initiating more scrubs while also 
converting more scrubs into deep scrubs are, that

1) scrubs seem to complete very fast. I almost never catch a PG in state 
"scrubbing", I usually only see "deep scrubbing".

2) I suspect the low deep-scrub count is due to a low number of deep-scrubs 
scheduled and not due to conflicting per-OSD deep scrub reservations. With the 
OSD count we have and the distribution over 12 servers I would expect at least 
a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing 
now. It ought to be possible to schedule more PGs for deep scrub than actually 
are.

3) Every OSD having only 1 deep scrub active seems to have no measurable impact 
on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it 
would already help a lot. Once this is working, I can eventually increase 
osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) 
scrub scheduling looks a bit harder and schedules more eligible PGs per time 
unit.

If we can get deep scrubbing up to an average of 42PGs completing per hour with 
keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to 
complete a deep scrub with 75% full OSDs in about 30 days. This is the current 
tail-time with 25% utilisation. I believe currently a deep scrub of a PG in 
these pools takes 2-3 hours. Its just a gut feeling from some repair and 
deep-scrub commands, I would need to check logs for more precise info.

Increasing osd_max_scrubs would then be a further and not the only option to 
push for more deep scrubbing. My expectation would be that values of 2-3 are 
fine due to the increasingly higher percentage of cold data for which no 
interference with client IO will happen.

Hope that makes sense and there is a way beyond bumping osd_max_scrubs to 
increase the number of scheduled and executed deep scrubs.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Dan van der Ster 
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Frank,

What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the scrub to complete in t

[ceph-users] Re: increasing number of (deep) scrubs

2023-12-12 Thread Frank Schilder

Hi all,

if you follow this thread, please see the update in "How to configure something 
like osd_deep_scrub_min_interval?" 
(https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/YUHWQCDAKP5MPU6ODTXUSKT7RVPERBJF/).
 I found out how to tune the scrub machine and I posted a quick update in the 
other thread, because the solution was not to increase the number of scrubs, 
but to tune parameters.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Frank Schilder
Sent: Monday, January 9, 2023 9:14 AM
To: Dan van der Ster
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Dan,

thanks for your answer. I don't have a problem with increasing osd_max_scrubs 
(=1 at the moment) as such. I would simply prefer a somewhat finer grained way 
of controlling scrubbing than just doubling or tripling it right away.

Some more info. These 2 pools are data pools for a large FS. Unfortunately, we 
have a large percentage of small files, which is a pain for recovery and 
seemingly also for deep scrubbing. Our OSDs are about 25% used and I had to 
increase the warning interval already to 2 weeks. With all the warning grace 
parameters this means that we manage to deep scrub everything about every 
month. I need to plan for 75% utilisation and a 3 months period is a bit far on 
the risky side.

Our data is to a large percentage cold data. Client reads will not do the check 
for us, we need to combat bit-rot pro-actively.

The reasons I'm interested in parameters initiating more scrubs while also 
converting more scrubs into deep scrubs are, that

1) scrubs seem to complete very fast. I almost never catch a PG in state 
"scrubbing", I usually only see "deep scrubbing".

2) I suspect the low deep-scrub count is due to a low number of deep-scrubs 
scheduled and not due to conflicting per-OSD deep scrub reservations. With the 
OSD count we have and the distribution over 12 servers I would expect at least 
a peak of 50% OSDs being active in scrubbing instead of the 25% peak I'm seeing 
now. It ought to be possible to schedule more PGs for deep scrub than actually 
are.

3) Every OSD having only 1 deep scrub active seems to have no measurable impact 
on user IO. If I could just get more PGs scheduled with 1 deep scrub per OSD it 
would already help a lot. Once this is working, I can eventually increase 
osd_max_scrubs when the OSDs fill up. For now I would just like that (deep) 
scrub scheduling looks a bit harder and schedules more eligible PGs per time 
unit.

If we can get deep scrubbing up to an average of 42PGs completing per hour with 
keeping osd_max_scrubs=1 to maintain current IO impact, we should be able to 
complete a deep scrub with 75% full OSDs in about 30 days. This is the current 
tail-time with 25% utilisation. I believe currently a deep scrub of a PG in 
these pools takes 2-3 hours. Its just a gut feeling from some repair and 
deep-scrub commands, I would need to check logs for more precise info.

Increasing osd_max_scrubs would then be a further and not the only option to 
push for more deep scrubbing. My expectation would be that values of 2-3 are 
fine due to the increasingly higher percentage of cold data for which no 
interference with client IO will happen.

Hope that makes sense and there is a way beyond bumping osd_max_scrubs to 
increase the number of scheduled and executed deep scrubs.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Dan van der Ster 
Sent: 05 January 2023 15:36
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] increasing number of (deep) scrubs

Hi Frank,

What is your current osd_max_scrubs, and why don't you want to increase it?
With 8+2, 8+3 pools each scrub is occupying the scrub slot on 10 or 11
OSDs, so at a minimum it could take 3-4x the amount of time to scrub
the data than if those were replicated pools.
If you want the scrub to complete in time, you need to increase the
amount of scrub slots accordingly.

On the other hand, IMHO the 1-week deadline for deep scrubs is often
much too ambitious for large clusters -- increasing the scrub
intervals is one solution, or I find it simpler to increase
mon_warn_pg_not_scrubbed_ratio and mon_warn_pg_not_deep_scrubbed_ratio
until you find a ratio that works for your cluster.
Of course, all of this can impact detection of bit-rot, which anyway
can be covered by client reads if most data is accessed periodically.
But if the cluster is mostly idle or objects are generally not read,
then it would be preferable to increase slots osd_max_scrubs.

Cheers, Dan

On Tue, Jan 3, 2023 at 2:30 AM Frank Schilder  wrote:
>
> Hi all,
>
> we are using 16T and 18T spinning drives as OSDs and I'm observing that they 
> are not scrubbed as often as I would like. It looks like too few scr

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-12-12 Thread Frank Schilder

d since  4 intervals ( 96h) [1 scrubbing]
  31% 619 PGs not deep-scrubbed since  5 intervals (120h)
  39% 660 PGs not deep-scrubbed since  6 intervals (144h)
  47% 726 PGs not deep-scrubbed since  7 intervals (168h) [1 scrubbing]
  57% 743 PGs not deep-scrubbed since  8 intervals (192h)
  65% 727 PGs not deep-scrubbed since  9 intervals (216h) [1 scrubbing]
  73% 656 PGs not deep-scrubbed since 10 intervals (240h)
  75% 107 PGs not deep-scrubbed since 11 intervals (264h)
  82% 626 PGs not deep-scrubbed since 12 intervals (288h)
  90% 588 PGs not deep-scrubbed since 13 intervals (312h)
  94% 388 PGs not deep-scrubbed since 14 intervals (336h) [1 scrubbing]
  96% 129 PGs not deep-scrubbed since 15 intervals (360h) 2 scrubbing+deep
  98% 207 PGs not deep-scrubbed since 16 intervals (384h) 2 scrubbing+deep
  99%  79 PGs not deep-scrubbed since 17 intervals (408h) 1 scrubbing+deep
 100%  10 PGs not deep-scrubbed since 18 intervals (432h) 2 scrubbing+deep
 8192 PGs out of 8192 reported, 0 missing, 7 scrubbing+deep, 0 unclean.

con-fs2-data2  scrub_min_interval=66h  (11i/84%/625PGs÷i)
con-fs2-data2  scrub_max_interval=168h  (7d)
con-fs2-data2  deep_scrub_interval=336h  (14d/~89%/~520PGs÷d)
osd.338  osd_scrub_interval_randomize_ratio=0.363636  scrubs start after: 
66h..90h
osd.338  osd_deep_scrub_randomize_ratio=0.00
osd.338  osd_max_scrubs=1
osd.338  osd_scrub_backoff_ratio=0.931900  rec. this pool: .9319 (class hdd, 
size 11)
mon.ceph-01  mon_warn_pg_not_scrubbed_ratio=0.50  warn: 10.5d (42.0i)
mon.ceph-01  mon_warn_pg_not_deep_scrubbed_ratio=0.75  warn: 24.5d

Best regards, merry Christmas and a happy new year to everyone!
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-12-08 Thread Frank Schilder

Hi Xiubo,

I will update the case. I'm afraid this will have to wait a little bit though. 
I'm too occupied for a while and also don't have a test cluster that would help 
speed things up. I will update you, please keep the tracker open.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Tuesday, December 5, 2023 1:58 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent

Frank,

By using your script I still couldn't reproduce it. Locally my python
version is 3.9.16, and I didn't have other VMs to test python other
versions.

Could you check the tracker to provide the debug logs ?

Thanks

- Xiubo

On 12/1/23 21:08, Frank Schilder wrote:
> Hi Xiubo,
>
> I uploaded a test script with session output showing the issue. When I look 
> at your scripts, I can't see the stat-check on the second host anywhere. 
> Hence, I don't really know what you are trying to compare.
>
> If you want me to run your test scripts on our system for comparison, please 
> include the part executed on the second host explicitly in an ssh-command. 
> Running your scripts alone in their current form will not reproduce the issue.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Xiubo Li 
> Sent: Monday, November 27, 2023 3:59 AM
> To: Frank Schilder; Gregory Farnum
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent
>
>
> On 11/24/23 21:37, Frank Schilder wrote:
>> Hi Xiubo,
>>
>> thanks for the update. I will test your scripts in our system next week. 
>> Something important: running both scripts on a single client will not 
>> produce a difference. You need 2 clients. The inconsistency is between 
>> clients, not on the same client. For example:
> Frank,
>
> Yeah, I did this with 2 different kclients.
>
> Thanks
>
>> Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs
>>
>> Test 1
>> - on host1: execute shutil.copy2
>> - execute ls -l /mnt/kcephfs/ on host1 and host2: same result
>>
>> Test 2
>> - on host1: shutil.copy
>> - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 
>> while correct on host 1
>>
>> Your scripts only show output of one host, but the inconsistency requires 
>> two hosts for observation. The stat information is updated on host1, but not 
>> synchronized to host2 in the second test. In case you can't reproduce that, 
>> I will append results from our system to the case.
>>
>> Also it would be important to know the python and libc versions. We observe 
>> this only for newer versions of both.
>>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Xiubo Li 
>> Sent: Thursday, November 23, 2023 3:47 AM
>> To: Frank Schilder; Gregory Farnum
>> Cc: ceph-users@ceph.io
>> Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent
>>
>> I just raised one tracker to follow this:
>> https://tracker.ceph.com/issues/63510
>>
>> Thanks
>>
>> - Xiubo
>>
>>
>
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: EC Profiles & DR

2023-12-06 Thread Frank Schilder

Hi,

the post linked in the previous message is a good source for different 
approaches.

To provide some first-hand experience, I was operating a pool with a 6+2 EC 
profile on 4 hosts for a while (until we got more hosts) and the "subdivide a 
physical host into 2 crush-buckets" approach is actually working best (I 
basically tried all the approaches described in the linked post and they all 
had pitfalls).

Procedure is more or less:

- add second (logical) host bucket for each physical host by suffixing the host 
name with "-B" (ceph osd crush add-bucket   )
- move half the OSDs per host to this new host bucket (ceph osd crush move 
osd.ID host=HOSTNAME-B)
- make this location persist reboot of the OSDs (ceph config set osd.ID 
crush_location host=HOSTNAME-B")

This will allow you to move OSDs back easily when you get more hosts and can 
afford the recommended 1 shard per host. It will also show which and where OSDs 
are moved to with a simple "ceph config dump | grep crush_location". Bets of 
all, you don't have to fiddle around with crush maps and hope they do what you 
want. Just use failure domain host and you are good. No more than 2 host 
buckets per physical host means no more than 2 shards per physical host with 
default placement rules.

I was operating this set-up with min_size=6 and feeling bad about it due to the 
reduced maintainability (risk of data loss during maintenance). Its not great 
really, but sometimes there is no way around it. I was happy when I got the 
extra hosts.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Curt 
Sent: Wednesday, December 6, 2023 3:56 PM
To: Patrick Begou
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: EC Profiles & DR

Hi Patrick,

Yes K and M are chunks, but the default crush map is a chunk per host,
which is probably the best way to do it, but I'm no expert. I'm not sure
why you would want to do a crush map with 2 chunks per host and min size 4
as it' s just asking for trouble at some point, in my opinion.  Anyway,
take a look at this post if your interested in doing 2 chunks per host it
will give you an idea of crushmap setup,
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NB3M22GNAC7VNWW7YBVYTH6TBZOYLTWA/
.

Regards,
Curt
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ceph df reports incorrect stats

2023-12-06 Thread Frank Schilder

74.27673  host ceph-09   
  
 -23 1075.67920  host ceph-10   
  
 -15 1067.16492  host ceph-11   
  
 -25 1080.21912  host ceph-12   
  
 -83 1061.17480  host ceph-13   
  
 -85 1047.70276  host ceph-14   
  
 -87 1079.02820  host ceph-15   
  
-136 1012.55048  host ceph-16   
  
-139 1073.61475  host ceph-17   
  
-261 1125.57202  host ceph-23   
  
-262 1054.32227  host ceph-24   
  
-148  885.49133  datacenter MultiSite   
  
 -65   86.16304  host ceph-04   
  
 -67  101.50623  host ceph-05   
  
 -69  104.85805  host ceph-06   
  
 -71   96.39923  host ceph-07   
  
 -81   97.54230  host ceph-18   
  
 -94   98.48271  host ceph-19   
  
  -4   97.20181  host ceph-20   
  
 -64   99.77657  host ceph-21   
  
 -66  103.56137  host ceph-22   
  
 -49  885.49133  datacenter ServerRoom  
  
 -55  885.49133  room SR-113
  
 -65   86.16304  host ceph-04   
  
 -67  101.50623  host ceph-05   
  
 -69  104.85805  host ceph-06   
  
 -71   96.39923  host ceph-07   
  
 -81   97.54230  host ceph-18   
  
 -94   98.48271  host ceph-19   
  
  -4   97.20181  host ceph-20   
  
 -64   99.77657  host ceph-21   
  
 -66  103.56137  host ceph-22   
  
  -1  0  root default

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [ext] CephFS pool not releasing space after data deletion

2023-12-02 Thread Frank Schilder

Hi Mathias,

have you made any progress on this? Did the capacity become available 
eventually?

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Kuhring, Mathias 
Sent: Friday, October 27, 2023 3:52 PM
To: ceph-users@ceph.io; Frank Schilder
Subject: Re: [ext] [ceph-users] CephFS pool not releasing space after data 
deletion

Dear ceph users,

We are wondering, if this might be the same issue as with this bug:
https://tracker.ceph.com/issues/52581

Except that we seem to have been snapshots dangling on the old pool.
And the bug report snapshots dangling on the new pool.
But maybe it's both?

I mean, once the global root layout was created to a new pool,
the new pool became in charge for snapshooting at least of new data, right?
What about data which is overwritten? Is there a conflict of responsibility?

We do have similar listings of snaps with "ceph osd pool ls detail", I
think:

0|0[root@osd-1 ~]# ceph osd pool ls detail | grep -B 1 removed_snaps_queue
pool 1 'cephfs_data' replicated size 3 min_size 2 crush_rule 1
object_hash rjenkins pg_num 115 pgp_num 107 pg_num_target 32
pgp_num_target 32 autoscale_mode on last_change 803558 lfor
0/803250/803248 flags hashpspool,selfmanaged_snaps stripe_width 0
expected_num_objects 1 application cephfs
 removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 3 'hdd_ec' erasure profile hdd_ec size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
last_change 803558 lfor 0/87229/87229 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 8192 application
cephfs
 removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]
--
pool 20 'hdd_ec_8_2_pool' erasure profile hdd_ec_8_2_profile size 10
min_size 9 crush_rule 5 object_hash rjenkins pg_num 8192 pgp_num 8192
autoscale_mode off last_change 803558 lfor 0/0/681917 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 32768
application cephfs
 removed_snaps_queue
[3541~1,36e4~1,379f~2,3862~1,3876~1,387d~1,388b~1,389a~1,38a6~1,38bc~1,3993~1,3999~1,39a0~1,39a7~1,39ae~1,39b5~3,39be~1,39c5~1,39cc~1]

Here, pool hdd_ec_8_2_pool is the one we recently assigned to the root
layout.
Pool hdd_ec is the one which was assigned before and which won't release
space (at least where I know of).

Is this removed_snaps_queue the same as removed_snaps in the bug issue
(i.e. the label was renamed)?
And is it normal that all queues list the same info or should this be
different per pool?
Might this be related to pools having now share responsibility over some
snaps due to layout changes?

And for the big question:
How can I actually trigger/speedup the removal of those snaps?
I find the removed_snaps/removed_snaps_queue mentioned a few times in
the user list.
But never with some conclusive answer how to deal with them.
And the only mentions in the docs are just change logs.

I also looked into and started cephfs stray scrubbing:
https://docs.ceph.com/en/latest/cephfs/scrub/#evaluate-strays-using-recursive-scrub
But according to the status output, no scrubbing is actually active.

I would appreciate any further ideas. Thanks a lot.

Best Wishes,
Mathias

On 10/23/2023 12:42 PM, Kuhring, Mathias wrote:
> Dear Ceph users,
>
> Our CephFS is not releasing/freeing up space after deleting hundreds of
> terabytes of data.
> By now, this drives us in a "nearfull" osd/pool situation and thus
> throttles IO.
>
> We are on ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5)
> quincy (stable).
>
> Recently, we moved a bunch of data to a new pool with better EC.
> This was done by adding a new EC pool to the FS.
> Then assigning the FS root to the new EC pool via the directory layout xattr
> (so all new data is written to the new pool).
> And finally copying old data to new folders.
>
> I swapped the data as follows to remain the old directory structures.
> I also made snapshots for validation purposes.
>
> So basically:
> cp -r mymount/mydata/ mymount/new/ # this creates copy on new pool
> mkdir mymount/mydata/.snap/tovalidate
> mkdir mymount/new/mydata/.snap/tovalidate
> mv mymount/mydata/ mymount/old/
> mv mymount/new/mydata mymount/
>
> I could see the increase of data in the new pool as expected (ceph df).
> I compared the snapshots with hashdeep to make sure the new data is alright.
>
> Then I went ahead deleting the old data, basically:
> rmdir mymount/old/mydata/.snap/* # this also included a bunch of other
> older snapshots
> rm -r mymount/old/mydata
>
> At first we had a bunch of PGs with snaptrim/snaptrim_wait.
> But they are done for quite

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-12-01 Thread Frank Schilder

Hi Xiubo,

I uploaded a test script with session output showing the issue. When I look at 
your scripts, I can't see the stat-check on the second host anywhere. Hence, I 
don't really know what you are trying to compare.

If you want me to run your test scripts on our system for comparison, please 
include the part executed on the second host explicitly in an ssh-command. 
Running your scripts alone in their current form will not reproduce the issue.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Monday, November 27, 2023 3:59 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent


On 11/24/23 21:37, Frank Schilder wrote:
> Hi Xiubo,
>
> thanks for the update. I will test your scripts in our system next week. 
> Something important: running both scripts on a single client will not produce 
> a difference. You need 2 clients. The inconsistency is between clients, not 
> on the same client. For example:

Frank,

Yeah, I did this with 2 different kclients.

Thanks

> Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs
>
> Test 1
> - on host1: execute shutil.copy2
> - execute ls -l /mnt/kcephfs/ on host1 and host2: same result
>
> Test 2
> - on host1: shutil.copy
> - execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 while 
> correct on host 1
>
> Your scripts only show output of one host, but the inconsistency requires two 
> hosts for observation. The stat information is updated on host1, but not 
> synchronized to host2 in the second test. In case you can't reproduce that, I 
> will append results from our system to the case.
>
> Also it would be important to know the python and libc versions. We observe 
> this only for newer versions of both.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Xiubo Li 
> Sent: Thursday, November 23, 2023 3:47 AM
> To: Frank Schilder; Gregory Farnum
> Cc: ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent
>
> I just raised one tracker to follow this:
> https://tracker.ceph.com/issues/63510
>
> Thanks
>
> - Xiubo
>
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-24 Thread Frank Schilder

Hi Xiubo,

thanks for the update. I will test your scripts in our system next week. 
Something important: running both scripts on a single client will not produce a 
difference. You need 2 clients. The inconsistency is between clients, not on 
the same client. For example:

Setup: host1 and host2 with a kclient mount to a cephfs under /mnt/kcephfs

Test 1
- on host1: execute shutil.copy2
- execute ls -l /mnt/kcephfs/ on host1 and host2: same result

Test 2
- on host1: shutil.copy
- execute ls -l /mnt/kcephfs/ on host1 and host2: file size=0 on host 2 while 
correct on host 1

Your scripts only show output of one host, but the inconsistency requires two 
hosts for observation. The stat information is updated on host1, but not 
synchronized to host2 in the second test. In case you can't reproduce that, I 
will append results from our system to the case.

Also it would be important to know the python and libc versions. We observe 
this only for newer versions of both.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Thursday, November 23, 2023 3:47 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent

I just raised one tracker to follow this:
https://tracker.ceph.com/issues/63510

Thanks

- Xiubo
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder

Hi Dennis,

I have to ask a clarifying question. If I understand the intend of 
osd_fast_fail_on_connection_refused correctly, an OSD that receives a 
connection_refused should get marked down fast to avoid unnecessarily long wait 
times. And *only* OSDs that receive connection refused.

In your case, did booting up the server actually create a network route for all 
other OSDs to the wrong network as well? In other words, did it act as a 
gateway and all OSDs received connection refused messages and not just the ones 
on the critical host? If so, your observation would be expected. If not, then 
there is something wrong with the down reporting that should be looked at.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: Friday, November 24, 2023 1:20 PM
To: Denis Krienbühl; Burkhard Linke
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

Hi Denis.

>  The mon then propagates that failure, without taking any other reports into 
> consideration:

Exactly. I cannot imagine that this change of behavior is intended. The configs 
on OSD down reporting ought to be honored in any failure situation. Since you 
already investigated the relevant code lines, please update/create the tracker 
with your findings. Hope a dev looks at this.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Denis Krienbühl 
Sent: Friday, November 24, 2023 12:04 PM
To: Burkhard Linke
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

> On 24 Nov 2023, at 11:49, Burkhard Linke 
>  wrote:
>
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true
>  changes this behaviour.

In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com

The mon then propagates that failure, without taking any other reports into 
consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 
ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder

Hi Denis.

>  The mon then propagates that failure, without taking any other reports into 
> consideration:

Exactly. I cannot imagine that this change of behavior is intended. The configs 
on OSD down reporting ought to be honored in any failure situation. Since you 
already investigated the relevant code lines, please update/create the tracker 
with your findings. Hope a dev looks at this.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Denis Krienbühl 
Sent: Friday, November 24, 2023 12:04 PM
To: Burkhard Linke
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

> On 24 Nov 2023, at 11:49, Burkhard Linke 
>  wrote:
>
> This should not be case in the reported situation unless setting 
> osd_fast_fail_on_connection_refused<https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_fast_fail_on_connection_refused>=true
>  changes this behaviour.

In our tests it does change the behavior. Usually the mons take 
mon_osd_reporter_subtree_level and mon_osd_min_down_reporters into account. In 
our tests, this is the case if an OSD heartbeat is dropped and the OSD is still 
able to talk to the mons.

However, if the OSD heartbeat is rejected, in our case because of an unrelated 
firewall change, the OSD sends an immediate failure to the mon:
https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/osd/OSD.cc#L6434
ceph/src/osd/OSD.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · ceph/ceph
github.com

The mon then propagates that failure, without taking any other reports into 
consideration:

https://github.com/ceph/ceph/blob/febfdd83a7838338033486826ef1fc9a5e8d588e/src/mon/OSDMonitor.cc#L3367
ceph/src/mon/OSDMonitor.cc at febfdd83a7838338033486826ef1fc9a5e8d588e · 
ceph/ceph
github.com

This is fine when a single OSD goes down and everything else is okay. It then 
has the intended effect of getting rid of the OSD fast. The assumption 
presumably being: If a host can answer with a rejection to the OSD heartbeat, 
it is only the OSD that is affected.

In our case however, a network change caused rejections from an entirely 
different host (a gateway), while a network path to the mons was still 
available. In this case, Ceph does not apply the safe-guards it usually does.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Full cluster outage when ECONNREFUSED is triggered

2023-11-24 Thread Frank Schilder

Hi Denis,

I would agree with you that a single misconfigured host should not take out 
healthy hosts under any circumstances. I'm not sure if your incident is 
actually covered by the devs comments, it is quite possible that you observed 
an unintended side effect that is a bug in handling the connection error. I 
think the intention is to shut down fast the OSDs with connection refused 
(where timeouts are not required) and not other OSDs.

A bug report with tracker seems warranted.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Denis Krienbühl 
Sent: Friday, November 24, 2023 9:01 AM
To: ceph-users
Subject: [ceph-users] Full cluster outage when ECONNREFUSED is triggered

Hi

We’ve recently had a serious outage at work, after a host had a network problem:

- We rebooted a single host in a cluster of fifteen hosts across three racks.
- The single host had a bad network configuration after booting, causing it to 
send some packets to the wrong network.
- One network still worked and offered a connection to the mons.
- The other network connection was bad. Packets were refused, not dropped.
- Due to osd_fast_fail_on_connection_refused=true, the broken host forced the 
mons to take all other OSDs down (immediate failure).
- Only after shutting down the faulty host, was it possible to start the shut 
down OSDs, to restore the cluster.

We have since solved the problem by removing the default route that caused the 
packets to end up in the wrong network, where they were summarily rejected by a 
firewall. That is, we made sure that packets would be dropped in the future, 
not rejected.

Still, I figured I’ll send this experience of ours to this mailing list, as 
this seems to be something others might encounter as well.

In the following PR, that introduced osd_fast_fail_on_connection_refused, 
there’s this description:

> This changeset adds additional handler (handle_refused()) to the dispatchers
> and code that detects when connection attempt fails with ECONNREFUSED error
> (connection refused) which is a clear indication that host is alive, but
> daemon isn't, so daemons can instantly mark the other side as undoubtly
> downed without the need for grace timer.

And this comment:

> As for flapping, we discussed it on ceph-devel ml
> and came to conclusion that it requires either broken firewall or network
> configuration to cause this, and these are more serious issues that should
> be resolved first before worrying about OSDs flapping (either way, flapping
> OSDs could be good for getting someone's attention).

https://github.com/ceph/ceph/pull/8558https://github.com/ceph/ceph/pull/8558

It has left us wondering if these are the right assumptions. An ECONNREFUSED 
condition can bring down a whole cluster, and I wonder if there should be some 
kind of safe-guard to ensure that this is avoided. One badly configured host 
should generally not be able do that, and if the packets are dropped, instead 
of refused, the cluster notices that the OSD down reports come only from one 
host, and acts accordingly.

What do you think? Does this warrant a change in Ceph? I’m happy to provide 
details and create a ticket.

Cheers,

Denis
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: mds slow request with “failed to authpin, subtree is being exported"

2023-11-22 Thread Frank Schilder

There are some unhandled race conditions in the MDS cluster in rare 
circumstances.

We had this issue with mimic and octopus and it went away after manually 
pinning sub-dirs to MDS ranks; see 
https://docs.ceph.com/en/nautilus/cephfs/multimds/?highlight=dir%20pin#manually-pinning-directory-trees-to-a-particular-rank.

This has the added advantage that one can bypass the internal load-balancer, 
which was horrible for our work loads. I have a related post about ephemeral 
pinning on this list one-two years ago. You should be able to find it. Short 
story: after manually pinning all user directories to ranks, all our problems 
disappeared and performance improved a lot. MDS load dropped from 130% average 
to 10-20%. So did memory consumption and cache recycling.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, November 22, 2023 12:30 PM
To: ceph-users@ceph.io
Subject: [ceph-users]  Re: mds slow request with “failed to authpin, subtree is 
being exported"

Hi,

we've seen this a year ago in a Nautilus cluster with multi-active MDS
as well. It turned up only once within several years and we decided
not to look too closely at that time. How often do you see it? Is it
reproducable? In that case I'd recommend to create a tracker issue.

Regards,
Eugen

Zitat von zxcs :

> HI, Experts,
>
> we are using cephfs with  16.2.* with multi active mds, and
> recently, we have two nodes mount with ceph-fuse due to the old os
> system.
>
> and  one nodes run a python script with `glob.glob(path)`, and
> another client doing `cp` operation on the same path.
>
> then we see some log about `mds slow request`, and logs complain
> “failed to authpin, subtree is being exported"
>
> then need to restart mds,
>
>
> our question is, does there any dead lock?  how can we avoid this
> and how to fix it without restart mds(it will influence other users) ?
>
>
> Thanks a ton!
>
>
> xz
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to use hardware

2023-11-20 Thread Frank Schilder

Hi Simon,

we are using something similar for ceph-fs. For a backup system your setup can 
work, depending on how you back up. While HDD pools have poor IOP/s 
performance, they are very good for streaming workloads. If you are using 
something like Borg backup that writes huge files sequentially, a HDD back-end 
should be OK.

Here some things to consider and try out:

1. You really need to get a bunch of enterprise SSDs with power loss protection 
for the FS meta data pool (disable write cache if enabled, this will disable 
volatile write cache and switch to protected caching). We are using (formerly 
Intel) 1.8T SATA drives that we subdivide into 4 OSDs each to raise 
performance. Place the meta-data pool and the primary data pool on these disks. 
Create a secondary data pool on the HDDs and assign it to the root *before* 
creating anything on the FS (see the recommended 3-pool layout for ceph file 
systems in the docs). I would not even consider running this without SSDs. 1 
such SSD per host is the minimum, 2 is better. If Borg or whatever can make use 
of a small fast storage directory, assign a sub-dir of the root to the primary 
data pool.

2. Calculate with sufficient extra disk space. As long as utilization stays 
below 60-70% bluestore will try to make large object writes sequential, which 
is really important for HDDs. On our cluster we currently have 40% utilization 
and I get full HDD bandwidth out for large sequential reads/writes. Make sure 
your backup application makes large sequential IO requests.

3. As Anthony said, add RAM. You should go for 512G on 50 HDD-nodes. You can 
run the MDS daemons on the OSD nodes. Set a reasonable cache limit and use 
ephemeral pinning. Depending on the CPUs you are using, 48 cores can be plenty. 
The latest generation Intel Xeon Scalable Processors is so efficient with ceph 
that 1HT per HDD is more than enough.

4. 3 MON+MGR nodes are sufficient. You can do something else with the remaining 
2 nodes. Of course, you can use them as additional MON+MGR nodes. We also use 5 
and it improves maintainability a lot.

Something more exotic if you have time:

5. To improve sequential performance further, you can experiment with larger 
min_alloc_sizes for OSDs (on creation time, you will need to scrap and 
re-deploy the cluster to test different values). Every HDD has a preferred 
IO-size for which random IO achieves nearly the same band-with as sequential 
writes. (But see 7.)

6. On your set-up you will probably go for a 4+2 EC data pool on HDD. With 
object size 4M the max. chunk size per OSD will be 1M. For many HDDs this is 
the preferred IO size (usually between 256K-1M). (But see 7.)

7. Important: large min_alloc_sizes are only good if your workload *never* 
modifies files, but only replaces them. A bit like a pool without EC overwrite 
enabled. The implementation of EC overwrites has a "feature" that can lead to 
massive allocation amplification. If your backup workload does modifications to 
files instead of adding new+deleting old, do *not* experiment with options 
5.-7. Instead, use the default and make sure you have sufficient unused 
capacity to increase the chances for large bluestore writes (keep utilization 
below 60-70% and just buy extra disks). A workload with large min_alloc_sizes 
has to be S3-like, only upload, download and delete are allowed.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Anthony D'Atri 
Sent: Saturday, November 18, 2023 3:24 PM
To: Simon Kepp
Cc: Albert Shih; ceph-users@ceph.io
Subject: [ceph-users] Re: How to use hardware

Common motivations for this strategy include the lure of unit economics and RUs.

Often ultra dense servers can’t fill racks anyway due to power and weight 
limits.

Here the osd_memory_target would have to be severely reduced to avoid 
oomkilling.  Assuming the OSDs are top load LFF HDDs with expanders, the HBA 
will be a bottleck as well.  I’ve suffered similar systems for RGW.  All the 
clever juggling in the world could not override the math, and the solution was 
QLC.

“We can lose 4 servers”

Do you realize that your data would then be unavailable ?  When you lose even 
one, you will not be able to restore redundancy and your OSDs likely will 
oomkill.

If you’re running CephFS, how are you provisioning fast OSDs for the metadata 
pool?  Are the CPUs high-clock for MDS responsiveness?

Even given the caveats this seems like a recipe for at best disappointment.

At the very least add RAM.  8GB per OSD plus ample for other daemons.  Better 
would be 3x normal additional hosts for the others.

> On Nov 17, 2023, at 8:33 PM, Simon Kepp  wrote:
>
> I know that your question is regarding the service servers, but may I ask,
> why you are planning to place so many OSDs ( 300) on so few OSD hosts( 6)
> (= 50 OSDs per node)?
> This is possible to do, but sounds like the nodes were designe

[ceph-users] Re: How to configure something like osd_deep_scrub_min_interval?

2023-11-16 Thread Frank Schilder

 to the time interval 
for which about 70% of PGs were scrubbed (leaving 30% eligible), the allocation 
of deep-scrub states is much much better. I expect both tails to get shorter 
and the overall deep-scrub load to go down as well. I hope to reach a state 
where I only need to issue a few deep-scrubs manually per day to get everything 
scrubbed within 1 week and deep-scrubbed within 3-4 weeks.

For now I will wait what effect the global settings have on the SSD pools and 
what the HDD pool converges to. This will need 1-2 months observations and I 
will report back when significant changes show up.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Wednesday, November 15, 2023 11:14 AM
To: ceph-users@ceph.io
Subject: [ceph-users] How to configure something like 
osd_deep_scrub_min_interval?

Hi folks,

I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a 
likely cause of why the distribution of last_deep_scrub_stamps is so weird. I 
wrote a small script to extract a histogram of scrubs by "days not scrubbed" 
(more precisely, intervals not scrubbed; see code) to find out how (deep-) 
scrub times are distributed. Output below.

What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, 
while they try to deep-scrub every 7-14 days. In other words, OSDs that have 
been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep 
state. However, what I see is completely different. There seems to be no 
distinction between scrub- and deep-scrub start times. This is really 
unexpected as nobody would try to deep-scrub HDDs every day. Weekly to 
bi-weekly is normal, specifically for large drives.

Is there a way to configure something like osd_deep_scrub_min_interval (no, I 
don't want to run cron jobs for scrubbing yet)? In the output below, I would 
like to be able to configure a minimum period of 1-2 weeks before the next 
deep-scrub happens. How can I do that?

The observed behavior is very unusual for RAID systems (if its not a bug in the 
report script). With this behavior its not surprising that people complain 
about "not deep-scrubbed in time" messages and too high deep-scrub IO load when 
such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days 
again already.

Sample output:

# scrub-report
dumped pgs

Scrub report:
   4121 PGs not scrubbed since  1 intervals (6h)
   3831 PGs not scrubbed since  2 intervals (6h)
   4012 PGs not scrubbed since  3 intervals (6h)
   3986 PGs not scrubbed since  4 intervals (6h)
   2998 PGs not scrubbed since  5 intervals (6h)
   1488 PGs not scrubbed since  6 intervals (6h)
909 PGs not scrubbed since  7 intervals (6h)
771 PGs not scrubbed since  8 intervals (6h)
582 PGs not scrubbed since  9 intervals (6h) 2 scrubbing
431 PGs not scrubbed since 10 intervals (6h)
333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing
265 PGs not scrubbed since 12 intervals (6h)
195 PGs not scrubbed since 13 intervals (6h)
116 PGs not scrubbed since 14 intervals (6h)
 78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing
 72 PGs not scrubbed since 16 intervals (6h)
 37 PGs not scrubbed since 17 intervals (6h)
  5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 
19.1233* 14.40e*
 33 PGs not scrubbed since 20 intervals (6h)
 23 PGs not scrubbed since 21 intervals (6h)
 16 PGs not scrubbed since 22 intervals (6h)
 12 PGs not scrubbed since 23 intervals (6h)
  8 PGs not scrubbed since 24 intervals (6h)
  2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3*
  4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 
14.1ed*
  5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 
19.1788* 19.16c0*
  6 PGs not scrubbed since 28 intervals (6h)
  2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d*
  3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a*
  1 PGs not scrubbed since 32 intervals (6h) 19.133f*
  1 PGs not scrubbed since 33 intervals (6h) 19.1103*
  3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248*
  1 PGs not scrubbed since 39 intervals (6h) 19.1984*
  1 PGs not scrubbed since 41 intervals (6h) 14.449*
  1 PGs not scrubbed since 44 intervals (6h) 19.179f*

Deep-scrub report:
   3723 PGs not deep-scrubbed since  1 intervals (24h)
   4621 PGs not deep-scrubbed since  2 intervals (24h) 8 scrubbing+deep
   3588 PGs not deep-scrubbed since  3 intervals (24h) 8 scrubbing+deep
   2929 PGs not deep-scrubbed since  4 intervals (24h) 3 scrubbing+deep
   1705 PGs not deep-scrubbed since  5 intervals (24h) 4 scrubbing+deep
   1904 PGs not deep-scrubbed since  6 intervals (24h) 5 scrubbing+deep
   1540 PGs not deep-scrubbed since  7 intervals (24h) 7 scrubbing+deep
   1304 PGs not deep-scrubbed since  8 interva

[ceph-users] How to configure something like osd_deep_scrub_min_interval?

2023-11-15 Thread Frank Schilder

ce 21 intervals (24h) 19.1322* 19.10f6*
  1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc*
  1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f*

PGs marked with a * are on busy OSDs and not eligible for scrubbing.


The script (pasted here because attaching doesn't work):

# cat bin/scrub-report 
#!/bin/bash

# Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h.
# Print how many PGs have not been (deep-)scrubbed since #intervals.

ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json
echo ""

T0="$(date +%s)"

scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [
.pgid,
(.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - 
fromdateiso8601)/(60*60*6)|ceil),
(.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - 
fromdateiso8601)/(60*60*24)|ceil),
.state,
(.acting | join(" "))
] | @tsv
' /root/.cache/ceph/pgs_dump.json)"

# less <<<"$scrub_info"

# 1  2   3  45..NF
# pg_id scrub-ints deep-scrub-ints status acting[]
awk <<<"$scrub_info" '{
for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i
if($4 == "active+clean") {
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
} else if($4 ~ /scrubbing\+deep/) {
deep_scrubbing[$3]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else if($4 ~ /scrubbing/) {
scrubbing[$2]++
for(i=5; i<=NF; ++i) osd[$i]="busy"
} else {
unclean[$2]++
unclean_d[$3]++
si_mx=si_mx<$2 ? $2 : si_mx
dsi_mx=dsi_mx<$3 ? $3 : dsi_mx
pg_sn[$2]++
pg_sn_ids[$2]=pg_sn_ids[$2] " " $1
pg_dsn[$3]++
pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1
for(i=5; i<=NF; ++i) osd[$i]="busy"
}
}
END {
print "Scrub report:"
for(si=1; si<=si_mx; ++si) {
if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue;
printf("%7d PGs not scrubbed since %2d intervals (6h)", 
pg_sn[si], si)
if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si])
if(unclean[si])   printf(" %d unclean", unclean[si])
if(pg_sn[si]<=5) {
split(pg_sn_ids[si], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") 
osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "Deep-scrub report:"
for(dsi=1; dsi<=dsi_mx; ++dsi) {
if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && 
unclean_d[dsi]==0) continue;
printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", 
pg_dsn[dsi], dsi)
if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", 
deep_scrubbing[dsi])
if(unclean_d[dsi])  printf(" %d unclean", unclean_d[dsi])
if(pg_dsn[dsi]<=5) {
split(pg_dsn_ids[dsi], pgs)
osds_busy=0
for(pg in pgs) {
split(pg_osds[pgs[pg]], osds)
for(o in osds) if(osd[osds[o]]=="busy") 
osds_busy=1
if(osds_busy) printf(" %s*", pgs[pg])
if(!osds_busy) printf(" %s", pgs[pg])
}
}
printf("\n")
}
print ""
print "PGs marked with a * are on busy OSDs and not eligible for 
scrubbing."
}
'

Don't forget the last "'" when copy-pasting.

Thanks for any pointers.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-10 Thread Frank Schilder

>>> It looks like the cap update request was dropped to the ground in MDS.
>>> [...]
>>> If you can reproduce it, then please provide the mds logs by setting:
>>> [...]
>> I can do a test with MDS logs on high level. Before I do that, looking at 
>> the python
>> findings above, is this something that should work on ceph or is it a python 
>> issue?
>
> Not sure yet. I need to understand what exactly shutil.copy does in kclient.

Thanks! Will wait for further instructions.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Xiubo Li 
Sent: Friday, November 10, 2023 3:14 AM
To: Frank Schilder; Gregory Farnum
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: ceph fs (meta) data inconsistent


On 11/10/23 00:18, Frank Schilder wrote:
> Hi Xiubo,
>
> I will try to answer questions from all your 3 e-mails here together with 
> some new information we have.
>
> New: The problem occurs in newer python versions when using the shutil.copy 
> function. There is also a function shutil.copy2 for which the problem does 
> not show up. Copy2 behaves a bit like "cp -p" while copy is like "cp". The 
> only code difference (linux) between these 2 functions is that copy calls 
> copyfile+copymode while copy2 calls copyfile+copystat. For now we asked our 
> users to use copy2 to avoid the issue.
>
> The copyfile function calls _fastcopy_sendfile on linux, which in turn calls 
> os.sendfile, which seems to be part of libc:
>
> #include 
> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
>
> I'm wondering if using this function requires explicit meta-data updates or 
> should be safe on ceph-fs. I'm also not sure if a user-space client even 
> supports this function (seems to be meaningless). Should this function be 
> safe to use on ceph kclient?

I didn't foresee any limit for this in kclient.

The shutil.copy will only copy the contents of the file, while the
shutil.copy2 will also copy the metadata. I need to know what exactly
they do in kclient for shutil.copy and shutil.copy2.

> Answers to questions:
>
>> BTW, have you test the ceph-fuse with the same test ? Is also the same ?
> I don't have fuse clients available, so can't test right now.
>
>> Have you tried other ceph version ?
> We are in the process of deploying a new test cluster, the old one is 
> scrapped already. I can't test this at the moment.
>
>> It looks like the cap update request was dropped to the ground in MDS.
>> [...]
>> If you can reproduce it, then please provide the mds logs by setting:
>> [...]
> I can do a test with MDS logs on high level. Before I do that, looking at the 
> python findings above, is this something that should work on ceph or is it a 
> python issue?

Not sure yet. I need to understand what exactly shutil.copy does in kclient.

Thanks

- Xiubo



>
> Thanks for your help!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-09 Thread Frank Schilder

Hi Xiubo,

I will try to answer questions from all your 3 e-mails here together with some 
new information we have.

New: The problem occurs in newer python versions when using the shutil.copy 
function. There is also a function shutil.copy2 for which the problem does not 
show up. Copy2 behaves a bit like "cp -p" while copy is like "cp". The only 
code difference (linux) between these 2 functions is that copy calls 
copyfile+copymode while copy2 calls copyfile+copystat. For now we asked our 
users to use copy2 to avoid the issue.

The copyfile function calls _fastcopy_sendfile on linux, which in turn calls 
os.sendfile, which seems to be part of libc:

#include 
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

I'm wondering if using this function requires explicit meta-data updates or 
should be safe on ceph-fs. I'm also not sure if a user-space client even 
supports this function (seems to be meaningless). Should this function be safe 
to use on ceph kclient?

Answers to questions:

> BTW, have you test the ceph-fuse with the same test ? Is also the same ?
I don't have fuse clients available, so can't test right now.

> Have you tried other ceph version ?
We are in the process of deploying a new test cluster, the old one is scrapped 
already. I can't test this at the moment.

> It looks like the cap update request was dropped to the ground in MDS.
> [...]
> If you can reproduce it, then please provide the mds logs by setting:
> [...]
I can do a test with MDS logs on high level. Before I do that, looking at the 
python findings above, is this something that should work on ceph or is it a 
python issue?

Thanks for your help!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

2023-11-09 Thread Frank Schilder

Hi Xiubo,

great! I'm not sure if we observed this particular issue, but we did have the 
oldest_client_tid updates not advancing  message in a context that might re 
related.

If this fix is not too large, it would be really great if it could be included 
in the last Pacific point release.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Wednesday, November 8, 2023 1:38 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin

Hi Frank,

Recently I found a new possible case could cause this, please see
https://github.com/ceph/ceph/pull/54259. This is just a ceph side fix,
after this we need to fix it in kclient too, which hasn't done yet.

Thanks

- Xiubo

On 8/8/23 17:44, Frank Schilder wrote:
> Dear Xiubo,
>
> the nearfull pool is an RBD pool and has nothing to do with the file system. 
> All pools for the file system have plenty of capacity.
>
> I think we have an idea what kind of workload caused the issue. We had a user 
> run a computation that reads the same file over and over again. He started 
> 100 such jobs in parallel and our storage servers were at 400% load. I saw 
> 167K read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 
> 11K. Clearly, most of this was served from RAM.
>
> It is possible that this extreme load situation triggered a race that 
> remained undetected/unreported. There is literally no related message in any 
> logs near the time the warning started popping up. It shows up out of nowhere.
>
> We asked the user to change his workflow to use local RAM disk for the input 
> files. I don't think we can reproduce the problem anytime soon.
>
> About the bug fixes, I'm eagerly waiting for this and another one. Any idea 
> when they might show up in distro kernels?
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Xiubo Li 
> Sent: Tuesday, August 8, 2023 2:57 AM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>
>
> On 8/7/23 21:54, Frank Schilder wrote:
>> Dear Xiubo,
>>
>> I managed to collect some information. It looks like there is nothing in the 
>> dmesg log around the time the client failed to advance its TID. I collected 
>> short snippets around the critical time below. I have full logs in case you 
>> are interested. Its large files, I will need to do an upload for that.
>>
>> I also have a dump of "mds session ls" output for clients that showed the 
>> same issue later. Unfortunately, no consistent log information for a single 
>> incident.
>>
>> Here the summary, please let me know if uploading the full package makes 
>> sense:
>>
>> - Status:
>>
>> On July 29, 2023
>>
>> ceph status/df/pool stats/health detail at 01:05:14:
>> cluster:
>>   health: HEALTH_WARN
>>   1 pools nearfull
>>
>> ceph status/df/pool stats/health detail at 01:05:28:
>> cluster:
>>   health: HEALTH_WARN
>>   1 clients failing to advance oldest client/flush tid
>>   1 pools nearfull
> Okay, then this could be the root cause.
>
> If the pool nearful it could block flushing the journal logs to the pool
> and then the MDS couldn't safe reply to the requests and then block them
> like this.
>
> Could you fix the pool nearful issue first and then check could you see
> it again ?
>
>
>> [...]
>>
>> On July 31, 2023
>>
>> ceph status/df/pool stats/health detail at 10:36:16:
>> cluster:
>>   health: HEALTH_WARN
>>   1 clients failing to advance oldest client/flush tid
>>   1 pools nearfull
>>
>> cluster:
>>   health: HEALTH_WARN
>>   1 pools nearfull
>>
>> - client evict command (date, time, command):
>>
>> 2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457
>>
>> We have a 1h time difference between the date stamp of the command and the 
>> dmesg date stamps. However, there seems to be a weird 10min delay from 
>> issuing the evict command until it shows up in dmesg on the client.
>>
>> - dmesg:
>>
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
>> [Fri Jul 28 12:59:14 2023] beegfs: enabling un

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-03 Thread Frank Schilder

Hi Gregory and Xiubo,

we have a smoking gun. The error shows up when using python's shutil.copy 
function. It affects newer versions of python3. Here some test results (quoted 
e-mail from our user):

> I now have a minimal example that reproduces the error:
> 
> echo test > myfile.txt
> ml Python/3.10.4-GCCcore-11.3.0-bare
> python -c "import shutil; shutil.copy('myfile.txt','myfile2.txt')"
> ls -l | grep myfile
> srun --partition fatq --time 01:00:00 --nodes 1 --exclusive --pty bash
> ls -l | grep myfile
>  
> The problem occurs when I copy a file via python 
> (Python/3.10.4-GCCcore-11.3.0-bare)
> The problems does not occur when running the script with the default python
>  
> [user1@host1 ~]$ echo test > myfile.txt
> [user1@host1 ~]$ ml Python/3.10.4-GCCcore-11.3.0-bare
> [user1@host1 ~]$ python -c "import shutil; 
> shutil.copy('myfile.txt','myfile2.txt')"
> [user1@host1 ~]$ ls -l | grep myfile
> -rw-rw 1 user1 user1  5 Nov  3 07:23 myfile2.txt
> -rw-rw 1 user1 user1  5 Nov  3 07:23 myfile.txt
> [user1@host1 ~]$ srun --partition fatq --time 01:00:00 --nodes 1 --exclusive 
> --pty bash
> [user1@host3 ~]$ ls -l | grep myfile
> -rw-rw 1 user1 user1  0 Nov  3 07:23 myfile2.txt
> -rw-rw 1 user1 user1  5 Nov  3 07:23 myfile.txt

> Just tried with some other python versions
> 
> Ok
> Python/3.6.4-intel-2018a
> Python/3.7.4-GCCcore-8.3.0
>
> Fail
> Python/3.8.2-GCCcore-9.3.0
> Python/3.9.5-GCCcore-10.3.0-bare  
> Python/3.9.5-GCCcore-10.3.0   
> Python/3.10.4-GCCcore-11.3.0-bare 
> Python/3.10.4-GCCcore-11.3.0  
> Python/3.10.8-GCCcore-12.2.0-bare

These are easybuild python modules using different gcc versions to build. The 
default version of python referred to is Python 2.7.5. Is this a known problem 
with python3 and is there a patch we can apply? I wonder how python manages to 
break the file system so consistently.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: Thursday, November 2, 2023 12:12 PM
To: Gregory Farnum; Xiubo Li
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: ceph fs (meta) data inconsistent

Hi all,

the problem re-appeared in the following way. After moving the problematic 
folder out and copying it back, all files showed the correct sizes. Today, we 
observe that the issue is back in the copy that was fine yesterday:

[user1@host11 h2lib]$ ls -l
total 37198
-rw-rw 1 user1 user111641 Nov  2 11:32 dll_wrapper.py

[user2@host2 h2lib]# ls -l
total 44
-rw-rw. 1 user1 user1 0 Nov  2 11:28 dll_wrapper.py

It is correct in the snapshot though:

[user2@host2 h2lib]# cd .snap
[user2@host2 .snap]# ls
_2023-11-02_000611+0100_daily_1
[user2@host2 .snap]# cd _2023-11-02_000611+0100_daily_1/
[user2@host2 _2023-11-02_000611+0100_daily_1]# ls -l
total 37188
-rw-rw. 1 user1 user111641 Nov  1 13:30 dll_wrapper.py

It seems related to the path, bot the inode number. Could the re-appearance of 
the 0 length have been triggered by taking the snapshot?

We plan to reboot later today the server where the file was written. Until hen 
we can do diagnostics while the issue is visible. Please let us know what 
information we can provide.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Gregory Farnum 
Sent: Wednesday, November 1, 2023 4:57 PM
To: Frank Schilder; Xiubo Li
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] ceph fs (meta) data inconsistent

We have seen issues like this a few times and they have all been kernel client 
bugs with CephFS’ internal “capability” file locking protocol. I’m not aware of 
any extant bugs like this in our code base, but kernel patches can take a long 
and winding path before they end up on deployed systems.

Most likely, if you were to restart some combination of the client which wrote 
the file and the client(s) reading it, the size would propagate correctly. As 
long as you’ve synced the data, it’s definitely present in the cluster.

Adding Xiubo, who has worked on these and may have other comments.
-Greg

On Wed, Nov 1, 2023 at 7:16 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Dear fellow cephers,

today we observed a somewhat worrisome inconsistency on our ceph fs. A file 
created on one host showed up as 0 length on all other hosts:

[user1@host1 h2lib]$ ls -lh
total 37M
-rw-rw 1 user1 user1  12K Nov  1 11:59 dll_wrapper.py

[user2@host2 h2lib]# ls -l
total 34
-rw-rw. 1 user1 user1 0 Nov  1 11:59 dll_wrapper.py

[user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test
[user1@host1 h2lib]$ ls -l
total 37199
-rw-rw 1 user1 user111641 Nov  1 11:59 dll_wrapper.py
-rw-rw 1 user1 user111641 Nov

[ceph-users] Re: ceph fs (meta) data inconsistent

2023-11-02 Thread Frank Schilder

Hi all,

the problem re-appeared in the following way. After moving the problematic 
folder out and copying it back, all files showed the correct sizes. Today, we 
observe that the issue is back in the copy that was fine yesterday:

[user1@host11 h2lib]$ ls -l
total 37198
-rw-rw 1 user1 user111641 Nov  2 11:32 dll_wrapper.py

[user2@host2 h2lib]# ls -l
total 44
-rw-rw. 1 user1 user1 0 Nov  2 11:28 dll_wrapper.py

It is correct in the snapshot though:

[user2@host2 h2lib]# cd .snap
[user2@host2 .snap]# ls
_2023-11-02_000611+0100_daily_1
[user2@host2 .snap]# cd _2023-11-02_000611+0100_daily_1/
[user2@host2 _2023-11-02_000611+0100_daily_1]# ls -l
total 37188
-rw-rw. 1 user1 user111641 Nov  1 13:30 dll_wrapper.py

It seems related to the path, bot the inode number. Could the re-appearance of 
the 0 length have been triggered by taking the snapshot?

We plan to reboot later today the server where the file was written. Until hen 
we can do diagnostics while the issue is visible. Please let us know what 
information we can provide.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Gregory Farnum 
Sent: Wednesday, November 1, 2023 4:57 PM
To: Frank Schilder; Xiubo Li
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] ceph fs (meta) data inconsistent

We have seen issues like this a few times and they have all been kernel client 
bugs with CephFS’ internal “capability” file locking protocol. I’m not aware of 
any extant bugs like this in our code base, but kernel patches can take a long 
and winding path before they end up on deployed systems.

Most likely, if you were to restart some combination of the client which wrote 
the file and the client(s) reading it, the size would propagate correctly. As 
long as you’ve synced the data, it’s definitely present in the cluster.

Adding Xiubo, who has worked on these and may have other comments.
-Greg

On Wed, Nov 1, 2023 at 7:16 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Dear fellow cephers,

today we observed a somewhat worrisome inconsistency on our ceph fs. A file 
created on one host showed up as 0 length on all other hosts:

[user1@host1 h2lib]$ ls -lh
total 37M
-rw-rw 1 user1 user1  12K Nov  1 11:59 dll_wrapper.py

[user2@host2 h2lib]# ls -l
total 34
-rw-rw. 1 user1 user1 0 Nov  1 11:59 dll_wrapper.py

[user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test
[user1@host1 h2lib]$ ls -l
total 37199
-rw-rw 1 user1 user111641 Nov  1 11:59 dll_wrapper.py
-rw-rw 1 user1 user111641 Nov  1 13:10 dll_wrapper.py.test

[user2@host2 h2lib]# ls -l
total 45
-rw-rw. 1 user1 user1 0 Nov  1 11:59 dll_wrapper.py
-rw-rw. 1 user1 user1 11641 Nov  1 13:10 dll_wrapper.py.test

Executing a sync on all these hosts did not help. However, deleting the 
problematic file and replacing it with a copy seemed to work around the issue. 
We saw this with ceph kclients of different versions, it seems to be on the MDS 
side.

How can this happen and how dangerous is it?

ceph fs status (showing ceph version):

# ceph fs status
con-fs2 - 1662 clients
===
RANK  STATE MDS   ACTIVITY DNSINOS
 0active  ceph-15  Reqs:   14 /s  2307k  2278k
 1active  ceph-11  Reqs:  159 /s  4208k  4203k
 2active  ceph-17  Reqs:3 /s  4533k  4501k
 3active  ceph-24  Reqs:3 /s  4593k  4300k
 4active  ceph-14  Reqs:1 /s  4228k  4226k
 5active  ceph-13  Reqs:5 /s  1994k  1782k
 6active  ceph-16  Reqs:8 /s  5022k  4841k
 7active  ceph-23  Reqs:9 /s  4140k  4116k
POOL   TYPE USED  AVAIL
   con-fs2-meta1 metadata  2177G  7085G
   con-fs2-meta2   data   0   7085G
con-fs2-data   data1242T  4233T
con-fs2-data-ec-ssddata 706G  22.1T
   con-fs2-data2   data3409T  3848T
STANDBY MDS
  ceph-10
  ceph-08
  ceph-09
  ceph-12
MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)

There is no health issue:

# ceph status
  cluster:
id: abc
health: HEALTH_WARN
3 pgs not deep-scrubbed in time

  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 9w)
mgr: ceph-25(active, since 7w), standbys: ceph-26, ceph-01, ceph-03, ceph-02
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1284 osds: 1279 up (since 2d), 1279 in (since 5d)

  task status:

  data:
pools:   14 pools, 25065 pgs
objects: 2.20G objects, 3.9 PiB
usage:   4.9 PiB used, 8.2 PiB / 13 PiB avail
pgs: 25039 active+clean
 26active+clean+scrubbing+deep

  io:
client:   799 MiB/s rd, 55 MiB/s wr, 3.12k op/s rd, 1.82k op/s wr

The inconsistency seems undiagnosed, I couldn't find anything interesting in 
the cluster log. What should I look for and where?

I moved the folder to another location for diagnosis. Unfortunately, I don't 
have 2 clients any more s

[ceph-users] ceph fs (meta) data inconsistent

2023-11-01 Thread Frank Schilder

Dear fellow cephers,

today we observed a somewhat worrisome inconsistency on our ceph fs. A file 
created on one host showed up as 0 length on all other hosts:

[user1@host1 h2lib]$ ls -lh
total 37M
-rw-rw 1 user1 user1  12K Nov  1 11:59 dll_wrapper.py

[user2@host2 h2lib]# ls -l
total 34
-rw-rw. 1 user1 user1 0 Nov  1 11:59 dll_wrapper.py

[user1@host1 h2lib]$ cp dll_wrapper.py dll_wrapper.py.test
[user1@host1 h2lib]$ ls -l
total 37199
-rw-rw 1 user1 user111641 Nov  1 11:59 dll_wrapper.py
-rw-rw 1 user1 user111641 Nov  1 13:10 dll_wrapper.py.test

[user2@host2 h2lib]# ls -l
total 45
-rw-rw. 1 user1 user1 0 Nov  1 11:59 dll_wrapper.py
-rw-rw. 1 user1 user1 11641 Nov  1 13:10 dll_wrapper.py.test

Executing a sync on all these hosts did not help. However, deleting the 
problematic file and replacing it with a copy seemed to work around the issue. 
We saw this with ceph kclients of different versions, it seems to be on the MDS 
side.

How can this happen and how dangerous is it?

ceph fs status (showing ceph version):

# ceph fs status
con-fs2 - 1662 clients
===
RANK  STATE MDS   ACTIVITY DNSINOS  
 0active  ceph-15  Reqs:   14 /s  2307k  2278k  
 1active  ceph-11  Reqs:  159 /s  4208k  4203k  
 2active  ceph-17  Reqs:3 /s  4533k  4501k  
 3active  ceph-24  Reqs:3 /s  4593k  4300k  
 4active  ceph-14  Reqs:1 /s  4228k  4226k  
 5active  ceph-13  Reqs:5 /s  1994k  1782k  
 6active  ceph-16  Reqs:8 /s  5022k  4841k  
 7active  ceph-23  Reqs:9 /s  4140k  4116k  
POOL   TYPE USED  AVAIL  
   con-fs2-meta1 metadata  2177G  7085G  
   con-fs2-meta2   data   0   7085G  
con-fs2-data   data1242T  4233T  
con-fs2-data-ec-ssddata 706G  22.1T  
   con-fs2-data2   data3409T  3848T  
STANDBY MDS  
  ceph-10
  ceph-08
  ceph-09
  ceph-12
MDS version: ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)

There is no health issue:

# ceph status
  cluster:
id: abc
health: HEALTH_WARN
3 pgs not deep-scrubbed in time
 
  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 9w)
mgr: ceph-25(active, since 7w), standbys: ceph-26, ceph-01, ceph-03, ceph-02
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1284 osds: 1279 up (since 2d), 1279 in (since 5d)
 
  task status:
 
  data:
pools:   14 pools, 25065 pgs
objects: 2.20G objects, 3.9 PiB
usage:   4.9 PiB used, 8.2 PiB / 13 PiB avail
pgs: 25039 active+clean
 26active+clean+scrubbing+deep
 
  io:
client:   799 MiB/s rd, 55 MiB/s wr, 3.12k op/s rd, 1.82k op/s wr

The inconsistency seems undiagnosed, I couldn't find anything interesting in 
the cluster log. What should I look for and where?

I moved the folder to another location for diagnosis. Unfortunately, I don't 
have 2 clients any more showing different numbers, I see a 0 length now 
everywhere for the moved folder. I'm pretty sure though that the file still is 
non-zero length.

Thanks for any pointers.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: find PG with large omap object

2023-10-31 Thread Frank Schilder

Hi Ramin,

that's an interesting approach for part of the problem. Two comments:

1) We don't have RGW services, we observe large omap objects in a ceph fs 
meta-data pool. Hence, the script doesn't help here.

2) The command "radosgw-admin" has the very annoying property of creating pools 
without asking. It would be great if you could add a sanity check that confirms 
that RGW services are actually present *before* executing any radosgw-admin 
command and exiting if none are present.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Ramin Najjarbashi 
Sent: Tuesday, October 31, 2023 11:22 AM
To: Frank Schilder
Cc: Eugen Block; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: find PG with large omap object

https://gist.github.com/RaminNietzsche/b8702014c333f3a44a995d5a6d4a56be

On Mon, Oct 16, 2023 at 4:27 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi Eugen,

the warning threshold is per omap object, not per PG (which apparently has more 
than 1 omap object). Still, I misread the numbers by 1 order, which means that 
the difference between the last 2 entries is about 10, which does point 
towards a large omap object in PG 12.193.

I issued a deep-scrub on this PG and the warning is resolved.

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block mailto:ebl...@nde.ag>>
Sent: Monday, October 16, 2023 2:41 PM
To: Frank Schilder
Cc: ceph-users@ceph.io<mailto:ceph-users@ceph.io>
Subject: Re: [ceph-users] Re: find PG with large omap object

Hi Frank,

> # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] |
> sort_by(.nk)' pgs.dump | tail
>   },
>   {
> "id": "12.17b",
> "nk": 1493776
>   },
>   {
> "id": "12.193",
> "nk": 1583589
>   }

those numbers are > 1 million and the warning threshold is 200k. So a
warning is expected.

Zitat von Frank Schilder mailto:fr...@dtu.dk>>:

> Hi Eugen,
>
> thanks for the one-liner :) I'm afraid I'm in the same position as
> before though.
>
> I dumped all PGs to a file and executed these 2 commands:
>
> # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_bytes}]
> | sort_by(.nk)' pgs.dump | tail
>   },
>   {
> "id": "12.193",
> "nk": 1002401056
>   },
>   {
> "id": "21.0",
> "nk": 1235777228
>   }
> ]
>
> # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] |
> sort_by(.nk)' pgs.dump | tail
>   },
>   {
> "id": "12.17b",
> "nk": 1493776
>   },
>   {
> "id": "12.193",
> "nk": 1583589
>   }
> ]
>
> Neither is beyond the warn limit and pool 12 is indeed the pool
> where the warnings came from. OK, now back to the logs:
>
> # zgrep -i 'Large omap object found. Object:' /var/log/ceph/ceph.log-*
> /var/log/ceph/ceph.log-20231008.gz:2023-10-05T01:25:14.581962+0200
> osd.592 (osd.592) 104 : cluster [WRN] Large omap object found.
> Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key
> count: 21 Size (bytes): 230080309
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T04:33:02.678879+0200
> osd.949 (osd.949) 6897 : cluster [WRN] Large omap object found.
> Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key
> count: 200243 Size (bytes): 230307097
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T07:22:40.512228+0200
> osd.988 (osd.988) 4365 : cluster [WRN] Large omap object found.
> Object: 12:eb96322f:::637.:head PG: 12.f44c69d7 (12.1d7) Key
> count: 200329 Size (bytes): 230310393
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T15:08:03.785186+0200
> osd.50 (osd.50) 4549 : cluster [WRN] Large omap object found.
> Object: 12:08fb0eb7:::635.:head PG: 12.ed70df10 (12.110) Key
> count: 200183 Size (bytes): 230150641
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T16:37:12.901470+0200
> osd.18 (osd.18) 7011 : cluster [WRN] Large omap object found.
> Object: 12:d6758956:::634.:head PG: 12.6a91ae6b (12.6b) Key
> count: 200247 Size (bytes): 230343371
> /var/log/ceph/ceph.log-20231008.gz:2023-10-08T01:25:16.125068+0200
> osd.980 (osd.980) 308 : cluster [WRN] Large omap object found.
> Object: 12:63f985e7:::639.:head PG: 12.e7a19fc6 (12.1c6) Key
> count: 200160 Size (bytes): 230179282
> /var/log/ceph/ceph.log-20231015:2023-10-09T00:51:32.587849+0200
> osd.563 (osd.563) 3661 : cluster [WRN] Large omap object found.
> Object: 12:44346421:::632.:head PG: 12.8

[ceph-users] Combining masks in ceph config

2023-10-25 Thread Frank Schilder

Hi all,

I have a case where I want to set options for a set of HDDs under a common 
sub-tree with root A. I have also HDDs in another disjoint sub-tree with root 
B. Therefore, I would like to do something like

ceph config set osd/class:hdd,datacenter:A option value

The above does not give a syntax error, but I'm also not sure it does the right 
thing. Does the above mean "class:hdd and datacenter:A" or does it mean "for 
OSDs with device class 'hdd,datacenter:A'"?

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure

2023-10-19 Thread Frank Schilder

I just tried what sending SIGSTOP and SIGCONT do. After stopping the process 3 
caps were returned. After resuming the process these 3 caps were allocated 
again. There seems to be a large number of stale caps that are not released. 
While the process was stopped the kworker thread continued to show 2% CPU usage 
even though there was no file IO going on.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: Thursday, October 19, 2023 10:02 AM
To: Stefan Kooman; ceph-users@ceph.io
Subject: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond to 
cache pressure

Hi Stefan,

the jobs ended and the warning disappeared as expected. However, a new job 
started and the warning showed up again. There is something very strange going 
on and, maybe, you can help out here:

We have a low client CAPS limit configured for performance reasons:

# ceph config dump | grep client
[...]
  mds   advanced  mds_max_caps_per_client   
  65536

The job in question holds more than that:

# ceph tell mds.0 session ls | jq -c '[.[] | {id: .id, h: 
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: 
.num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | tail
[...]
{"id":172249397,"h":"sn272...","addr":"client.172249397 
v1:192.168.57.143:0/195146548","fs":"/hpc/home","caps":105417,"req":1442}

This CAPS allocation is stable over time, the number doesn't change (I queried 
multiple times with several minutes interval). My guess is that the MDS message 
is not about cache pressure but rather about caps trimming. We do have clients 
that regularly exceed the limit though without MDS warnings. My guess is that 
these return at least some CAPS on request and are, therefore, not flagged. The 
client above seems to sit on a fixed set of CAPS that doesn't change and this 
causes the warning to show up.

The strange thing now is that very few files (on ceph fs) are actually open on 
the client:

[USER@sn272 ~]$ lsof -u USER | grep -e /home -e /groups -e /apps | wc -l
170

The kworker thread is at about 3% CPU and should be able to release CAPS. I'm 
wondering why it doesn't happen though. I also don't believe that 170 open 
files can allocate 105417 client caps.

Questions:

- Why does the client have so many caps allocated? Is there another way than 
open files that requires allocations?
- Is there a way to find out what these caps are for?
- We will look at the code (its python+miniconda), any pointers what to look 
for?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: Tuesday, October 17, 2023 11:27 AM
To: Stefan Kooman; ceph-users@ceph.io
Subject: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond to 
cache pressure

Hi Stefan,

probably. Its 2 compute nodes and there are jobs running. Our epilogue script 
will drop the caches, at which point I indeed expect the warning to disappear. 
We have no time limit on these nodes though, so this can be a while. I was 
hoping there was an alternative to that, say, a user-level command that I could 
execute on the client without possibly affecting other users jobs.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Stefan Kooman 
Sent: Tuesday, October 17, 2023 11:13 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] stuck MDS warning: Client HOST failing to respond to 
cache pressure

On 17-10-2023 09:22, Frank Schilder wrote:
> Hi all,
>
> I'm affected by a stuck MDS warning for 2 clients: "failing to respond to 
> cache pressure". This is a false alarm as no MDS is under any cache pressure. 
> The warning is stuck already for a couple of days. I found some old threads 
> about cases where the MDS does not update flags/triggers for this warning in 
> certain situations. Dating back to luminous and I'm probably hitting one of 
> these.
>
> In these threads I could find a lot except for instructions for how to clear 
> this out in a nice way. Is there something I can do on the clients to clear 
> this warning? I don't want to evict/reboot just because of that.

echo 2 > /proc/sys/vm/drop_caches on the clients  does that help?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure

2023-10-19 Thread Frank Schilder

Hi Stefan,

the jobs ended and the warning disappeared as expected. However, a new job 
started and the warning showed up again. There is something very strange going 
on and, maybe, you can help out here:

We have a low client CAPS limit configured for performance reasons:

# ceph config dump | grep client
[...]
  mds   advanced  mds_max_caps_per_client   
  65536 

The job in question holds more than that:

# ceph tell mds.0 session ls | jq -c '[.[] | {id: .id, h: 
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: 
.num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | tail
[...]
{"id":172249397,"h":"sn272...","addr":"client.172249397 
v1:192.168.57.143:0/195146548","fs":"/hpc/home","caps":105417,"req":1442}

This CAPS allocation is stable over time, the number doesn't change (I queried 
multiple times with several minutes interval). My guess is that the MDS message 
is not about cache pressure but rather about caps trimming. We do have clients 
that regularly exceed the limit though without MDS warnings. My guess is that 
these return at least some CAPS on request and are, therefore, not flagged. The 
client above seems to sit on a fixed set of CAPS that doesn't change and this 
causes the warning to show up.

The strange thing now is that very few files (on ceph fs) are actually open on 
the client:

[USER@sn272 ~]$ lsof -u USER | grep -e /home -e /groups -e /apps | wc -l
170

The kworker thread is at about 3% CPU and should be able to release CAPS. I'm 
wondering why it doesn't happen though. I also don't believe that 170 open 
files can allocate 105417 client caps.

Questions:

- Why does the client have so many caps allocated? Is there another way than 
open files that requires allocations?
- Is there a way to find out what these caps are for?
- We will look at the code (its python+miniconda), any pointers what to look 
for?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Frank Schilder 
Sent: Tuesday, October 17, 2023 11:27 AM
To: Stefan Kooman; ceph-users@ceph.io
Subject: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond to 
cache pressure

Hi Stefan,

probably. Its 2 compute nodes and there are jobs running. Our epilogue script 
will drop the caches, at which point I indeed expect the warning to disappear. 
We have no time limit on these nodes though, so this can be a while. I was 
hoping there was an alternative to that, say, a user-level command that I could 
execute on the client without possibly affecting other users jobs.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

____
From: Stefan Kooman 
Sent: Tuesday, October 17, 2023 11:13 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] stuck MDS warning: Client HOST failing to respond to 
cache pressure

On 17-10-2023 09:22, Frank Schilder wrote:
> Hi all,
>
> I'm affected by a stuck MDS warning for 2 clients: "failing to respond to 
> cache pressure". This is a false alarm as no MDS is under any cache pressure. 
> The warning is stuck already for a couple of days. I found some old threads 
> about cases where the MDS does not update flags/triggers for this warning in 
> certain situations. Dating back to luminous and I'm probably hitting one of 
> these.
>
> In these threads I could find a lot except for instructions for how to clear 
> this out in a nice way. Is there something I can do on the clients to clear 
> this warning? I don't want to evict/reboot just because of that.

echo 2 > /proc/sys/vm/drop_caches on the clients  does that help?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-18 Thread Frank Schilder

Hi Zakhar,

since its a bit beyond of the scope of basic, could you please post the 
complete ceph.conf config section for these changes for reference?

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zakhar Kirpichenko 
Sent: Wednesday, October 18, 2023 6:14 AM
To: Eugen Block
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

Many thanks for this, Eugen! I very much appreciate yours and Mykola's
efforts and insight!

Another thing I noticed was a reduction of RocksDB store after the
reduction of the total PG number by 30%, from 590-600 MB:

65M 3675511.sst
65M 3675512.sst
65M 3675513.sst
65M 3675514.sst
65M 3675515.sst
65M 3675516.sst
65M 3675517.sst
65M 3675518.sst
62M 3675519.sst

to about half of the original size:

-rw-r--r-- 1 167 167  7218886 Oct 13 16:16 3056869.log
-rw-r--r-- 1 167 167 67250650 Oct 13 16:15 3056871.sst
-rw-r--r-- 1 167 167 67367527 Oct 13 16:15 3056872.sst
-rw-r--r-- 1 167 167 63268486 Oct 13 16:15 3056873.sst

Then when I restarted the monitors one by one before adding compression,
RocksDB store reduced even further. I am not sure why and what exactly got
automatically removed from the store:

-rw-r--r-- 1 167 167   841960 Oct 18 03:31 018779.log
-rw-r--r-- 1 167 167 67290532 Oct 18 03:31 018781.sst
-rw-r--r-- 1 167 167 53287626 Oct 18 03:31 018782.sst

Then I have enabled LZ4 and LZ4HC compression in our small production
cluster (6 nodes, 96 OSDs) on 3 out of 5
monitors: compression=kLZ4Compression,bottommost_compression=kLZ4HCCompression.
I specifically went for LZ4 and LZ4HC because of the balance between
compression/decompression speed and impact on CPU usage. The compression
doesn't seem to affect the cluster in any negative way, the 3 monitors with
compression are operating normally. The effect of the compression on
RocksDB store size and disk writes is quite noticeable:

Compression disabled, 155 MB store.db, ~125 MB RocksDB sst, and ~530 MB
writes over 5 minutes:

-rw-r--r-- 1 167 167  4227337 Oct 18 03:58 3080868.log
-rw-r--r-- 1 167 167 67253592 Oct 18 03:57 3080870.sst
-rw-r--r-- 1 167 167 57783180 Oct 18 03:57 3080871.sst

# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
155M
 /var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph04/store.db/
2471602 be/4 167   6.05 M473.24 M  0.00 %  0.16 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2471633 be/4 167 188.00 K 40.91 M  0.00 %  0.02 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2471603 be/4 167  16.00 K 24.16 M  0.00 %  0.01 % ceph-mon -n
mon.ceph04 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]

Compression enabled, 60 MB store.db, ~23 MB RocksDB sst, and ~130 MB of
writes over 5 minutes:

-rw-r--r-- 1 167 167  5766659 Oct 18 03:56 3723355.log
-rw-r--r-- 1 167 167 22240390 Oct 18 03:56 3723357.sst

# du -hs
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/;
iotop -ao -bn 2 -d 300 2>&1 | grep ceph-mon
60M
/var/lib/ceph/3f50555a-ae2a-11eb-a2fc-ffde44714d86/mon.ceph03/store.db/
2052031 be/4 1671040.00 K 83.48 M  0.00 %  0.01 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:low0]
2052062 be/4 167   0.00 B 40.79 M  0.00 %  0.01 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [ms_dispatch]
2052032 be/4 167  16.00 K  4.68 M  0.00 %  0.00 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster-log-to-stderr=true [rocksdb:high0]
2052052 be/4 167  44.00 K  0.00 B  0.00 %  0.00 % ceph-mon -n
mon.ceph03 -f --setuser ceph --setgroup ceph --default-log-to-file=false
--default-log-to-stderr=true --default-log-stderr-prefix=debug
 --default-mon-cluster-log-to-file=false
--default-mon-cluster

[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure

2023-10-18 Thread Frank Schilder

Hi Loïc,

thanks for the pointer. Its kind of the opposite extreme to dropping just 
everything. I need to know the file name that is in cache. I'm looking for a 
middle way, say, "drop_caches -u USER" that drops all caches of files owned by 
user USER. This way I could try dropping caches for a bunch of users who are 
*not* running a job.

I guess I have to wait for the jobs to end.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Loïc Tortay 
Sent: Tuesday, October 17, 2023 3:40 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: stuck MDS warning: Client HOST failing to respond 
to cache pressure

On 17/10/2023 11:27, Frank Schilder wrote:
> Hi Stefan,
>
> probably. Its 2 compute nodes and there are jobs running. Our epilogue script 
> will drop the caches, at which point I indeed expect the warning to 
> disappear. We have no time limit on these nodes though, so this can be a 
> while. I was hoping there was an alternative to that, say, a user-level 
> command that I could execute on the client without possibly affecting other 
> users jobs.
>
Hello,
If you know the names of the files to flush from the cache (from
/proc/$PID/fd, lsof, batch job script, ...), you can use something like
https://github.com/tortay/cache-toys/blob/master/drop-from-pagecache.c
on the client.

See comments line 16 to 22 of the source code for caveats/limitations.


Loïc.
--
|   Loīc Tortay  - IN2P3 Computing Centre  |
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: stuck MDS warning: Client HOST failing to respond to cache pressure

2023-10-17 Thread Frank Schilder

Hi Stefan,

probably. Its 2 compute nodes and there are jobs running. Our epilogue script 
will drop the caches, at which point I indeed expect the warning to disappear. 
We have no time limit on these nodes though, so this can be a while. I was 
hoping there was an alternative to that, say, a user-level command that I could 
execute on the client without possibly affecting other users jobs.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Stefan Kooman 
Sent: Tuesday, October 17, 2023 11:13 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] stuck MDS warning: Client HOST failing to respond to 
cache pressure

On 17-10-2023 09:22, Frank Schilder wrote:
> Hi all,
>
> I'm affected by a stuck MDS warning for 2 clients: "failing to respond to 
> cache pressure". This is a false alarm as no MDS is under any cache pressure. 
> The warning is stuck already for a couple of days. I found some old threads 
> about cases where the MDS does not update flags/triggers for this warning in 
> certain situations. Dating back to luminous and I'm probably hitting one of 
> these.
>
> In these threads I could find a lot except for instructions for how to clear 
> this out in a nice way. Is there something I can do on the clients to clear 
> this warning? I don't want to evict/reboot just because of that.

echo 2 > /proc/sys/vm/drop_caches on the clients  does that help?

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] stuck MDS warning: Client HOST failing to respond to cache pressure

2023-10-17 Thread Frank Schilder

Hi all,

I'm affected by a stuck MDS warning for 2 clients: "failing to respond to cache 
pressure". This is a false alarm as no MDS is under any cache pressure. The 
warning is stuck already for a couple of days. I found some old threads about 
cases where the MDS does not update flags/triggers for this warning in certain 
situations. Dating back to luminous and I'm probably hitting one of these.

In these threads I could find a lot except for instructions for how to clear 
this out in a nice way. Is there something I can do on the clients to clear 
this warning? I don't want to evict/reboot just because of that.

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: find PG with large omap object

2023-10-16 Thread Frank Schilder

Hi Eugen,

the warning threshold is per omap object, not per PG (which apparently has more 
than 1 omap object). Still, I misread the numbers by 1 order, which means that 
the difference between the last 2 entries is about 10, which does point 
towards a large omap object in PG 12.193.

I issued a deep-scrub on this PG and the warning is resolved.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Monday, October 16, 2023 2:41 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: find PG with large omap object

Hi Frank,

> # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] |
> sort_by(.nk)' pgs.dump | tail
>   },
>   {
> "id": "12.17b",
> "nk": 1493776
>   },
>   {
> "id": "12.193",
> "nk": 1583589
>   }

those numbers are > 1 million and the warning threshold is 200k. So a
warning is expected.

Zitat von Frank Schilder :

> Hi Eugen,
>
> thanks for the one-liner :) I'm afraid I'm in the same position as
> before though.
>
> I dumped all PGs to a file and executed these 2 commands:
>
> # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_bytes}]
> | sort_by(.nk)' pgs.dump | tail
>   },
>   {
> "id": "12.193",
> "nk": 1002401056
>   },
>   {
> "id": "21.0",
> "nk": 1235777228
>   }
> ]
>
> # jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] |
> sort_by(.nk)' pgs.dump | tail
>   },
>   {
> "id": "12.17b",
> "nk": 1493776
>   },
>   {
> "id": "12.193",
> "nk": 1583589
>   }
> ]
>
> Neither is beyond the warn limit and pool 12 is indeed the pool
> where the warnings came from. OK, now back to the logs:
>
> # zgrep -i 'Large omap object found. Object:' /var/log/ceph/ceph.log-*
> /var/log/ceph/ceph.log-20231008.gz:2023-10-05T01:25:14.581962+0200
> osd.592 (osd.592) 104 : cluster [WRN] Large omap object found.
> Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key
> count: 21 Size (bytes): 230080309
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T04:33:02.678879+0200
> osd.949 (osd.949) 6897 : cluster [WRN] Large omap object found.
> Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key
> count: 200243 Size (bytes): 230307097
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T07:22:40.512228+0200
> osd.988 (osd.988) 4365 : cluster [WRN] Large omap object found.
> Object: 12:eb96322f:::637.:head PG: 12.f44c69d7 (12.1d7) Key
> count: 200329 Size (bytes): 230310393
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T15:08:03.785186+0200
> osd.50 (osd.50) 4549 : cluster [WRN] Large omap object found.
> Object: 12:08fb0eb7:::635.:head PG: 12.ed70df10 (12.110) Key
> count: 200183 Size (bytes): 230150641
> /var/log/ceph/ceph.log-20231008.gz:2023-10-07T16:37:12.901470+0200
> osd.18 (osd.18) 7011 : cluster [WRN] Large omap object found.
> Object: 12:d6758956:::634.:head PG: 12.6a91ae6b (12.6b) Key
> count: 200247 Size (bytes): 230343371
> /var/log/ceph/ceph.log-20231008.gz:2023-10-08T01:25:16.125068+0200
> osd.980 (osd.980) 308 : cluster [WRN] Large omap object found.
> Object: 12:63f985e7:::639.:head PG: 12.e7a19fc6 (12.1c6) Key
> count: 200160 Size (bytes): 230179282
> /var/log/ceph/ceph.log-20231015:2023-10-09T00:51:32.587849+0200
> osd.563 (osd.563) 3661 : cluster [WRN] Large omap object found.
> Object: 12:44346421:::632.:head PG: 12.84262c22 (12.22) Key
> count: 200325 Size (bytes): 230481029
> /var/log/ceph/ceph.log-20231015:2023-10-09T15:35:28.803117+0200
> osd.949 (osd.949) 7088 : cluster [WRN] Large omap object found.
> Object: 12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key
> count: 200327 Size (bytes): 230404872
> /var/log/ceph/ceph.log-20231015:2023-10-09T18:51:35.615096+0200
> osd.592 (osd.592) 461 : cluster [WRN] Large omap object found.
> Object: 12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key
> count: 200228 Size (bytes): 230347361
>
> The warnings report a key count > 20, but none of the PGs in the
> dump does. Apparently, all these PGs were (deep-) scrubbed already
> and the omap key count was updated (or am I misunderstanding
> something here). I still don't know and can neither conclude which
> PG the warning originates from. As far as I can tell, the warning
> should not be there.
>
> Do you have an idea how to continue diagnosis from here apart from
> just t

[ceph-users] Re: find PG with large omap object

2023-10-16 Thread Frank Schilder

Hi Eugen,

thanks for the one-liner :) I'm afraid I'm in the same position as before 
though.

I dumped all PGs to a file and executed these 2 commands:

# jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_bytes}] | 
sort_by(.nk)' pgs.dump | tail
  },
  {
"id": "12.193",
"nk": 1002401056
  },
  {
"id": "21.0",
"nk": 1235777228
  }
]

# jq '[.pg_stats[] | {"id": .pgid, "nk": .stat_sum.num_omap_keys}] | 
sort_by(.nk)' pgs.dump | tail
  },
  {
"id": "12.17b",
"nk": 1493776
  },
  {
"id": "12.193",
"nk": 1583589
  }
]

Neither is beyond the warn limit and pool 12 is indeed the pool where the 
warnings came from. OK, now back to the logs:

# zgrep -i 'Large omap object found. Object:' /var/log/ceph/ceph.log-*   
/var/log/ceph/ceph.log-20231008.gz:2023-10-05T01:25:14.581962+0200 osd.592 
(osd.592) 104 : cluster [WRN] Large omap object found. Object: 
12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key count: 21 Size 
(bytes): 230080309
/var/log/ceph/ceph.log-20231008.gz:2023-10-07T04:33:02.678879+0200 osd.949 
(osd.949) 6897 : cluster [WRN] Large omap object found. Object: 
12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key count: 200243 Size 
(bytes): 230307097
/var/log/ceph/ceph.log-20231008.gz:2023-10-07T07:22:40.512228+0200 osd.988 
(osd.988) 4365 : cluster [WRN] Large omap object found. Object: 
12:eb96322f:::637.:head PG: 12.f44c69d7 (12.1d7) Key count: 200329 Size 
(bytes): 230310393
/var/log/ceph/ceph.log-20231008.gz:2023-10-07T15:08:03.785186+0200 osd.50 
(osd.50) 4549 : cluster [WRN] Large omap object found. Object: 
12:08fb0eb7:::635.:head PG: 12.ed70df10 (12.110) Key count: 200183 Size 
(bytes): 230150641
/var/log/ceph/ceph.log-20231008.gz:2023-10-07T16:37:12.901470+0200 osd.18 
(osd.18) 7011 : cluster [WRN] Large omap object found. Object: 
12:d6758956:::634.:head PG: 12.6a91ae6b (12.6b) Key count: 200247 Size 
(bytes): 230343371
/var/log/ceph/ceph.log-20231008.gz:2023-10-08T01:25:16.125068+0200 osd.980 
(osd.980) 308 : cluster [WRN] Large omap object found. Object: 
12:63f985e7:::639.:head PG: 12.e7a19fc6 (12.1c6) Key count: 200160 Size 
(bytes): 230179282
/var/log/ceph/ceph.log-20231015:2023-10-09T00:51:32.587849+0200 osd.563 
(osd.563) 3661 : cluster [WRN] Large omap object found. Object: 
12:44346421:::632.:head PG: 12.84262c22 (12.22) Key count: 200325 Size 
(bytes): 230481029
/var/log/ceph/ceph.log-20231015:2023-10-09T15:35:28.803117+0200 osd.949 
(osd.949) 7088 : cluster [WRN] Large omap object found. Object: 
12:c9a32586:::63a.:head PG: 12.61a4c593 (12.193) Key count: 200327 Size 
(bytes): 230404872
/var/log/ceph/ceph.log-20231015:2023-10-09T18:51:35.615096+0200 osd.592 
(osd.592) 461 : cluster [WRN] Large omap object found. Object: 
12:c05de58b:::63b.:head PG: 12.d1a7ba03 (12.3) Key count: 200228 Size 
(bytes): 230347361

The warnings report a key count > 20, but none of the PGs in the dump does. 
Apparently, all these PGs were (deep-) scrubbed already and the omap key count 
was updated (or am I misunderstanding something here). I still don't know and 
can neither conclude which PG the warning originates from. As far as I can 
tell, the warning should not be there.

Do you have an idea how to continue diagnosis from here apart from just trying 
a deep scrub on all PGs in the list from the log?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Monday, October 16, 2023 1:41 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: find PG with large omap object

Hi,
not sure if this is what you need, but if you know the pool id (you
probably should) you could try this, it's from an Octopus test cluster
(assuming the warning was for the number of keys, not bytes):

$ ceph -f json pg dump pgs 2>/dev/null | jq -r '.pg_stats[] | select
(.pgid | startswith("17.")) |  .pgid + " " +
"\(.stat_sum.num_omap_keys)"'
17.6 191
17.7 759
17.4 358
17.5 0
17.2 177
17.3 1
17.0 375
17.1 176

If you don't know the pool you could sort the ouput by the second
column and see which PG has the largest number of omap_keys.

Regards,
Eugen

Zitat von Frank Schilder :

> Hi all,
>
> we had a bunch of large omap object warnings after a user deleted a
> lot of files on a ceph fs with snapshots. After the snapshots were
> rotated out, all but one of these warnings disappeared over time.
> However, one warning is stuck and I wonder if its something else.
>
> Is there a reasonable way (say, one-liner with no more than 120
> characters) to get ceph to tell me which PG this is coming from? I
> just want to issue a deep scrub to check if it disappears and going
> through th

[ceph-users] find PG with large omap object

2023-10-16 Thread Frank Schilder

Hi all,

we had a bunch of large omap object warnings after a user deleted a lot of 
files on a ceph fs with snapshots. After the snapshots were rotated out, all 
but one of these warnings disappeared over time. However, one warning is stuck 
and I wonder if its something else.

Is there a reasonable way (say, one-liner with no more than 120 characters) to 
get ceph to tell me which PG this is coming from? I just want to issue a deep 
scrub to check if it disappears and going through the logs and querying every 
single object for its key count seems a bit of a hassle for something that 
ought to be part of "ceph health detail".

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please help collecting stats of Ceph monitor disk writes

2023-10-13 Thread Frank Schilder

> so that an average number of compactions per hour can be calculated

For this you need the time window length for the counts. That's not automatic 
with the command you sent. For example, my MON log file started at 4am in the 
morning, there was a logrotate. So the assumption that the current time and 
count will give a correct average for the day starting at 00:00 failed in my 
situation. To get a correct count for today I would have had to combine 2 log 
files and send you the current time and count:

# grep -i "manual compaction from" /var/log/ceph/ceph-mon.ceph-01.log-20231013 
/var/log/ceph/ceph-mon.ceph-01.log | grep "2023-10-13" | wc -l
1226
# grep -i "manual compaction from" /var/log/ceph/ceph-mon.ceph-01.log | grep 
"2023-10-13" | wc -l
912
# date
Fri Oct 13 16:07:57 CEST 2023

Apart from that, I agree that a shorter time window will do. Its the start time 
of the log that's missing.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Please help collecting stats of Ceph monitor disk writes

2023-10-13 Thread Frank Schilder

Hi Zakhar,

I'm pretty sure you wanted the #manual compactions for an entire day, not from 
whenever the log starts to current time, which is most often not 23:59. You 
need to get the date from the previous day and make sure the log contains a 
full 00:00-23:59 window.

1) iotop results:
TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN  IOCOMMAND
   2256 be/4 ceph  0.00 B 17.48 M  0.00 %  0.80 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[safe_timer]
   2230 be/4 ceph  0.00 B   1514.19 M  0.00 %  0.37 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[rocksdb:low0]
   2250 be/4 ceph  0.00 B 36.23 M  0.00 %  0.15 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[fn_monstore]
   2231 be/4 ceph  0.00 B 50.52 M  0.00 %  0.02 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[rocksdb:high0]
   2225 be/4 ceph  0.00 B120.00 K  0.00 %  0.00 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [log]

2) manual compactions (over a full 24h window): 1882

3) monitor store.db size: 616M

4) cluster version and status:

ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)

  cluster:
id: xxx
health: HEALTH_WARN
1 large omap objects
 
  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 7w)
mgr: ceph-25(active, since 4w), standbys: ceph-26, ceph-01, ceph-03, ceph-02
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1284 osds: 1282 up (since 27h), 1282 in (since 3w)
 
  task status:
 
  data:
pools:   14 pools, 25065 pgs
objects: 2.18G objects, 3.9 PiB
usage:   4.8 PiB used, 8.3 PiB / 13 PiB avail
pgs: 25037 active+clean
 26active+clean+scrubbing+deep
 2 active+clean+scrubbing
 
  io:
client:   1.7 GiB/s rd, 1013 MiB/s wr, 3.02k op/s rd, 1.78k op/s wr

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Frank Schilder

Oh wow! I never bothered looking, because on our hardware the wear is so low:

# iotop -ao -bn 2 -d 300
Total DISK READ :   0.00 B/s | Total DISK WRITE :   6.46 M/s
Actual DISK READ:   0.00 B/s | Actual DISK WRITE:   6.47 M/s
TID  PRIO  USER DISK READ  DISK WRITE  SWAPIN  IOCOMMAND
   2230 be/4 ceph  0.00 B   1818.71 M  0.00 %  0.46 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[rocksdb:low0]
   2256 be/4 ceph  0.00 B 19.27 M  0.00 %  0.43 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[safe_timer]
   2250 be/4 ceph  0.00 B 42.38 M  0.00 %  0.26 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[fn_monstore]
   2231 be/4 ceph  0.00 B 58.36 M  0.00 %  0.01 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 
[rocksdb:high0]
644 be/3 root  0.00 B576.00 K  0.00 %  0.00 % [jbd2/sda3-8]
   2225 be/4 ceph  0.00 B128.00 K  0.00 %  0.00 % ceph-mon 
--cluster ceph --setuser ceph --setgroup ceph --foreground -i ceph-01 
--mon-data /var/lib/ceph/mon/ceph-ceph-01 --public-addr 192.168.32.65 [log]
1637141 be/4 root  0.00 B  0.00 B  0.00 %  0.00 % 
[kworker/u113:2-flush-8:0]
1636453 be/4 root  0.00 B  0.00 B  0.00 %  0.00 % 
[kworker/u112:0-ceph0]
   1560 be/4 root  0.00 B 20.00 K  0.00 %  0.00 % rsyslogd -n 
[in:imjournal]
   1561 be/4 root  0.00 B 56.00 K  0.00 %  0.00 % rsyslogd -n 
[rs:main Q:Reg]

1.8GB every 5 minutes, thats 518GB per day. The 400G drives we have are rated 
10DWPD and with the 6-drives RAID10 config this gives plenty of life-time. I 
guess this write load will kill any low-grade SSD (typical bood devices, even 
enterprise ones) specifically if its smaller drives and the controller doesn't 
reallocate cells according to remaining write endurance.

I guess there was a reason for the recommendations by Dell. I always thought 
that the recent recommendation for MON store storage in the ceph docs are a 
"bit unrealistic", apparently both, in size and in performance (including 
endurance). Well, I guess you need to look for write intensive drives with 
decent specs. If you do, also go for sufficient size. This will absorb 
temporary usage peaks that can be very large and also provide extra endurance 
with SSDs with good controllers.

I also think the recommendations on the ceph docs deserve a reality check.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Zakhar Kirpichenko 
Sent: Wednesday, October 11, 2023 4:30 PM
To: Eugen Block
Cc: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

Eugen,

Thanks for your response. May I ask what numbers you're referring to?

I am not referring to monitor store.db sizes. I am specifically referring to 
writes monitors do to their store.db file by frequently rotating and replacing 
them with new versions during compactions. The size of the store.db remains 
more or less the same.

This is a 300s iotop snippet, sorted by aggregated disk writes:

Total DISK READ:35.56 M/s | Total DISK WRITE:23.89 M/s
Current DISK READ:  35.64 M/s | Current DISK WRITE:  24.09 M/s
TID  PRIO  USER DISK READ DISK WRITE>  SWAPIN  IOCOMMAND
   4919 be/4 167  16.75 M  2.24 G  0.00 %  1.34 % ceph-mon -n 
mon.ceph03 -f --setuser ceph --setgr~lt-mon-cluster-log-to-stderr=true 
[rocksdb:low0]
  15122 be/4 167   0.00 B652.91 M  0.00 %  0.27 % ceph-osd -n 
osd.31 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug 
[bstore_kv_sync]
  17073 be/4 167   0.00 B651.86 M  0.00 %  0.27 % ceph-osd -n 
osd.32 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug 
[bstore_kv_sync]
  17268 be/4 167   0.00 B490.86 M  0.00 %  0.18 % ceph-osd -n 
osd.25 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug 
[bstore_kv_sync]
  18032 be/4 167   0.00 B463.57 M  0.00 %  0.17 % ceph-osd -n 
osd.26 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug 
[bstore_kv_sync]
  16855 be/4 167   0.00 B402.86 M  0.00 %  0.15 % ceph-osd -n 
osd.22 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug 
[bstore_kv_sync]
  17406 be/4 167   0.00 B387.03 M  0.00 %  0.14 % ceph-osd -n 
osd.27 -f --setuser ceph --setgroup ~default-log-stderr-prefix=debug 
[bstore_kv_sync]
  17932 be/4 167   0.00 B375.42 M  0.00 %  0.13 % ceph-osd -n 
osd.29 -f -

[ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

2023-10-11 Thread Frank Schilder

I need to ask here: where exactly do you observe the hundreds of GB written per 
day? Are the mon logs huge? Is it the mon store? Is your cluster unhealthy?

We have an octopus cluster with 1282 OSDs, 1650 ceph fs clients and about 800 
librbd clients. Per week our mon logs are  about 70M, the cluster logs about 
120M , the audit logs about 70M and I see between 100-200Kb/s writes to the mon 
store. That's in the lower-digit GB range per day. Hundreds of GB per day sound 
completely over the top on a healthy cluster, unless you have MGR modules 
changing the OSD/cluster map continuously.

Is autoscaler running and doing stuff?
Is balancer running and doing stuff?
Is backfill going on?
Is recovery going on?
Is your ceph version affected by the "excessive logging to MON store" issue 
that was present starting with pacific but should have been addressed by now?

@Eugen: Was there not an option to limit logging to the MON store?

For information to readers, we followed old recommendations from a Dell white 
paper for building a ceph cluster and have a 1TB Raid10 array on 6x write 
intensive SSDs for the MON stores. After 5 years we are below 10% wear. Average 
size of the MON store for a healthy cluster is 500M-1G, but we have seen this 
ballooning to 100+GB in degraded conditions.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Zakhar Kirpichenko 
Sent: Wednesday, October 11, 2023 12:00 PM
To: Eugen Block
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph 16.2.x mon compactions, disk writes

Thank you, Eugen.

I'm interested specifically to find out whether the huge amount of data
written by monitors is expected. It is eating through the endurance of our
system drives, which were not specced for high DWPD/TBW, as this is not a
documented requirement, and monitors produce hundreds of gigabytes of
writes per day. I am looking for ways to reduce the amount of writes, if
possible.

/Z

On Wed, 11 Oct 2023 at 12:41, Eugen Block  wrote:

> Hi,
>
> what you report is the expected behaviour, at least I see the same on
> all clusters. I can't answer why the compaction is required that
> often, but you can control the log level of the rocksdb output:
>
> ceph config set mon debug_rocksdb 1/5 (default is 4/5)
>
> This reduces the log entries and you wouldn't see the manual
> compaction logs anymore. There are a couple more rocksdb options but I
> probably wouldn't change too much, only if you know what you're doing.
> Maybe Igor can comment if some other tuning makes sense here.
>
> Regards,
> Eugen
>
> Zitat von Zakhar Kirpichenko :
>
> > Any input from anyone, please?
> >
> > On Tue, 10 Oct 2023 at 09:44, Zakhar Kirpichenko 
> wrote:
> >
> >> Any input from anyone, please?
> >>
> >> It's another thing that seems to be rather poorly documented: it's
> unclear
> >> what to expect, what 'normal' behavior should be, and what can be done
> >> about the huge amount of writes by monitors.
> >>
> >> /Z
> >>
> >> On Mon, 9 Oct 2023 at 12:40, Zakhar Kirpichenko 
> wrote:
> >>
> >>> Hi,
> >>>
> >>> Monitors in our 16.2.14 cluster appear to quite often run "manual
> >>> compaction" tasks:
> >>>
> >>> debug 2023-10-09T09:30:53.888+ 7f48a329a700  4 rocksdb:
> EVENT_LOG_v1
> >>> {"time_micros": 1696843853892760, "job": 64225, "event":
> "flush_started",
> >>> "num_memtables": 1, "num_entries": 715, "num_deletes": 251,
> >>> "total_data_size": 3870352, "memory_usage": 3886744, "flush_reason":
> >>> "Manual Compaction"}
> >>> debug 2023-10-09T09:30:53.904+ 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+ 7f48a3a9b700  4 rocksdb: (Original
> Log
> >>> Time 2023/10/09-09:30:53.910204)
> [db_impl/db_impl_compaction_flush.cc:2516]
> >>> [default] Manual compaction from level-0 to level-5 from 'paxos ..
> 'paxos;
> >>> will stop at (end)
> >>> debug 2023-10-09T09:30:53.908+ 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+ 7f4899286700  4 rocksdb:
> >>> [db_impl/db_impl_compaction_flush.cc:1443] [default] Manual compaction
> >>> starting
> >>> debug 2023-10-09T09:30:53.908+ 7f4899286700  4 rocksdb:
> >>> [db_impl

[ceph-users] Re: backfill_wait preventing deep scrubs

2023-09-21 Thread Frank Schilder

Thanks!

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mykola Golub 
Sent: Thursday, September 21, 2023 4:53 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] backfill_wait preventing deep scrubs

On Thu, Sep 21, 2023 at 1:57 PM Frank Schilder  wrote:
>
> Hi all,
>
> I replaced a disk in our octopus cluster and it is rebuilding. I noticed that 
> since the replacement there is no scrubbing going on. Apparently, an OSD 
> having a PG in backfill_wait state seems to block deep scrubbing all other 
> PGs on that OSD as well - at least this is how it looks.
>
> Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. 
> A total of 144 PGs needed backfilling (were remapped after replacing the 
> disk). After about 2 days we are down to 115 backfill_wait + 3 backfilling. 
> It will take a bit more than a week to complete.
>
> There is plenty of time and IOP/s available to deep-scrub PGs on the side, 
> but since the backfill started there is zero scrubbing/deep scrubbing going 
> on and "PGs not deep scrubbed in time" messages are piling up.
>
> Is there a way to allow (deep) scrub in this situation?

ceph config set osd osd_scrub_during_recovery true


--
Mykola Golub
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] backfill_wait preventing deep scrubs

2023-09-21 Thread Frank Schilder

Hi all,

I replaced a disk in our octopus cluster and it is rebuilding. I noticed that 
since the replacement there is no scrubbing going on. Apparently, an OSD having 
a PG in backfill_wait state seems to block deep scrubbing all other PGs on that 
OSD as well - at least this is how it looks.

Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. A 
total of 144 PGs needed backfilling (were remapped after replacing the disk). 
After about 2 days we are down to 115 backfill_wait + 3 backfilling. It will 
take a bit more than a week to complete.

There is plenty of time and IOP/s available to deep-scrub PGs on the side, but 
since the backfill started there is zero scrubbing/deep scrubbing going on and 
"PGs not deep scrubbed in time" messages are piling up.

Is there a way to allow (deep) scrub in this situation?

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: libceph: mds1 IP+PORT wrong peer at address

2023-09-20 Thread Frank Schilder

Hi, in our case the problem was on the client side. When you write "logs from a 
host", do you mean an OSD host or a host where client connections come from? 
Its not clear from your problem description *who* is requesting the wrong peer.

The reason for this message is that something tries to talk to an older 
instance of a daemon. The number after the / is a nonce that is assigned after 
each restart to be able to distinguish different instances of the same service 
(say, OSD or MDS). Usually, these get updated on peering.

If its clients that are stuck, you probably need to reboot the client hosts.

Please specify what is trying to reach the outdated OSD instances. Then a 
relevant developer is more likely to look at it. Since its not MDS-kclient 
interaction it might be useful to open a new case.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: U S 
Sent: Tuesday, September 19, 2023 5:35 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: libceph: mds1 IP+PORT wrong peer at address

Hi,

We are unable to resolve these issues and OSD restarts have made the ceph 
cluster unusable. We are wondering if downgrading ceph version from 18.2.0 to 
17.2.6. Please let us know if this is supported and if so, please point me to 
the procedure to do the same.

Thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS daemons don't report any more

2023-09-12 Thread Frank Schilder

Hi Patrick,

I'm not sure that its exactly the same issue. I observed that "ceph tell 
mds.xyz session ls" had all counters 0. On Friday before we had a power loss on 
a rack that took out a JBOD with a few meta-data disks and I suspect that the 
reporting of zeroes started after this crash. No hard evidence though.

I uploaded all logs with a bit of explanation to tag 
1c022c43-04a7-419d-bdb0-e33c97ef06b8.

I don't have any more detail than that recorded. It was 3 other MDSes that 
restarted on fail of rank 0, not 5 as I wrote before. The readme.txt contains 
pointers to find the info in the logs.

We didn't have any user complaints. Therefore, I'm reasonably confident that 
the file system was actually accessible the whole time (from Friday afternoon 
until Sunday night when I restarted everything).

I hope you can find something useful.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Patrick Donnelly 
Sent: Monday, September 11, 2023 7:51 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS daemons don't report any more

Hello Frank,

On Mon, Sep 11, 2023 at 8:39 AM Frank Schilder  wrote:
>
> Update: I did a systematic fail of all MDSes that didn't report, starting 
> with the stand-by daemons and continuing from high to low ranks. One by one 
> they started showing up again with version and stats and the fail went as 
> usual with one exception: rank 0.

It might be https://tracker.ceph.com/issues/24403

> The moment I failed rank 0 it took 5 other MDSes down with it.

Can you be more precise about what happened? Can you share logs?

> This is, in fact, the second time I have seen such an event, failing an MDS 
> crashes others. Given the weird observation in my previous e-mail together 
> with what I saw when restarting everything, does this indicate a problem with 
> data integrity or is this an annoying yet harmless bug?

It sounds like an annoyance but certainly one we'd like to track down.
Keep in mind that the "fix" for https://tracker.ceph.com/issues/24403
is not going to Octopus. You need to upgrade and Pacific will soon be
EOL.

> Thanks for any help!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Frank Schilder 
> Sent: Sunday, September 10, 2023 12:39 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] MDS daemons don't report any more
>
> Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to 
> report to the cluster any more:
>
> # ceph fs status
> con-fs2 - 1625 clients
> ===
> RANK  STATE MDS   ACTIVITY DNSINOS
>  0active  ceph-16  Reqs:0 /s 0  0
>  1active  ceph-09  Reqs:  128 /s  4251k  4250k
>  2active  ceph-17  Reqs:0 /s 0  0
>  3active  ceph-15  Reqs:0 /s 0  0
>  4active  ceph-24  Reqs:  269 /s  3567k  3567k
>  5active  ceph-11  Reqs:0 /s 0  0
>  6active  ceph-14  Reqs:0 /s 0  0
>  7active  ceph-23  Reqs:0 /s 0  0
> POOL   TYPE USED  AVAIL
>con-fs2-meta1 metadata  2169G  7081G
>con-fs2-meta2   data   0   7081G
> con-fs2-data   data1248T  4441T
> con-fs2-data-ec-ssddata 705G  22.1T
>con-fs2-data2   data3172T  4037T
> STANDBY MDS
>   ceph-08
>   ceph-10
>   ceph-12
>   ceph-13
> VERSION   
>DAEMONS
>   None
> ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
> (stable)ceph-09, ceph-24, ceph-08, ceph-13
>
> Version is "none" for these and there are no stats. Ceph versions reports 
> only 4 MDSes out of the 12. 8 are not shown at all:
>
> [root@gnosis ~]# ceph versions
> {
> "mon": {
> "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
> octopus (stable)": 5
> },
> "mgr": {
> "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
> octopus (stable)": 5
> },
> "osd": {
> "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
> octopus (stable)": 1282
> },
> "mds": {
> "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
> octopus (stable)": 4
> },
> "overall": {
> "ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632a

[ceph-users] Re: MDS daemons don't report any more

2023-09-11 Thread Frank Schilder

Update: I did a systematic fail of all MDSes that didn't report, starting with 
the stand-by daemons and continuing from high to low ranks. One by one they 
started showing up again with version and stats and the fail went as usual with 
one exception: rank 0. The moment I failed rank 0 it took 5 other MDSes down 
with it.

This is, in fact, the second time I have seen such an event, failing an MDS 
crashes others. Given the weird observation in my previous e-mail together with 
what I saw when restarting everything, does this indicate a problem with data 
integrity or is this an annoying yet harmless bug?

Thanks for any help!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Sunday, September 10, 2023 12:39 AM
To: ceph-users@ceph.io
Subject: [ceph-users] MDS daemons don't report any more

Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report 
to the cluster any more:

# ceph fs status
con-fs2 - 1625 clients
===
RANK  STATE MDS   ACTIVITY DNSINOS
 0active  ceph-16  Reqs:0 /s 0  0
 1active  ceph-09  Reqs:  128 /s  4251k  4250k
 2active  ceph-17  Reqs:0 /s 0  0
 3active  ceph-15  Reqs:0 /s 0  0
 4active  ceph-24  Reqs:  269 /s  3567k  3567k
 5active  ceph-11  Reqs:0 /s 0  0
 6active  ceph-14  Reqs:0 /s 0  0
 7active  ceph-23  Reqs:0 /s 0  0
POOL   TYPE USED  AVAIL
   con-fs2-meta1 metadata  2169G  7081G
   con-fs2-meta2   data   0   7081G
con-fs2-data   data1248T  4441T
con-fs2-data-ec-ssddata 705G  22.1T
   con-fs2-data2   data3172T  4037T
STANDBY MDS
  ceph-08
  ceph-10
  ceph-12
  ceph-13
VERSION 
 DAEMONS
  None  
  ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)ceph-09, ceph-24, ceph-08, ceph-13

Version is "none" for these and there are no stats. Ceph versions reports only 
4 MDSes out of the 12. 8 are not shown at all:

[root@gnosis ~]# ceph versions
{
"mon": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 5
},
"mgr": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 5
},
"osd": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 1282
},
"mds": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 4
},
"overall": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 1296
}
}

Ceph status reports everything as up and OK:

[root@gnosis ~]# ceph status
  cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_OK

  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w)
mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, 
ceph-26
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs

  data:
pools:   14 pools, 25065 pgs
objects: 2.14G objects, 3.7 PiB
usage:   4.7 PiB used, 8.4 PiB / 13 PiB avail
pgs: 79908208/18438361040 objects misplaced (0.433%)
 23063 active+clean
 1225  active+clean+snaptrim_wait
 317   active+remapped+backfill_wait
 250   active+remapped+backfilling
 208   active+clean+snaptrim
 2 active+clean+scrubbing+deep

  io:
client:   596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr
recovery: 8.7 GiB/s, 3.41k objects/s

My first thought is that the status module failed. However, I don't manage to 
restart it (always on). An MGR fail-over did not help.

Any ideas what is going on here?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MGR executes config rm all the time

2023-09-11 Thread Frank Schilder

Thanks for getting back. We are on octopus, looks like I can't do much about 
it. I hope it is just spam and harmless otherwise.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Mykola Golub 
Sent: Sunday, September 10, 2023 3:04 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] MGR executes config rm all the time

Hi Frank. It should be fixed in recent versions. See [1] and the
related backport tickets.

[1] https://tracker.ceph.com/issues/57726


On Sun, Sep 10, 2023 at 11:16 AM Frank Schilder  wrote:
>
> Hi all,
>
> I recently started observing that our MGR seems to execute the same "config 
> rm" commands over and over again, in the audit log:
>
> 9/10/23 10:03:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' 
> cmd=[{"prefix":"config 
> rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: 
> dispatch
>
> 9/10/23 10:03:24 AM[INF]from='mgr.252911336 192.168.32.68:0/63' 
> entity='mgr.ceph-25' cmd=[{"prefix":"config 
> rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: 
> dispatch
>
> 9/10/23 10:03:19 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' 
> cmd=[{"prefix":"config 
> rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: 
> dispatch
>
> 9/10/23 10:03:19 AM[INF]from='mgr.252911336 192.168.32.68:0/63' 
> entity='mgr.ceph-25' cmd=[{"prefix":"config 
> rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: 
> dispatch
>
> 9/10/23 10:02:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' 
> cmd=[{"prefix":"config 
> rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: 
> dispatch
>
> We don't have mirroring, ts a single cluster. What is going on here and how 
> can I stop that? I already restarted all MGR daemons to no avail.
>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



--
Mykola Golub
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MGR executes config rm all the time

2023-09-10 Thread Frank Schilder

Hi all,

I recently started observing that our MGR seems to execute the same "config rm" 
commands over and over again, in the audit log:

9/10/23 10:03:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' 
cmd=[{"prefix":"config 
rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: 
dispatch

9/10/23 10:03:24 AM[INF]from='mgr.252911336 192.168.32.68:0/63' 
entity='mgr.ceph-25' cmd=[{"prefix":"config 
rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: 
dispatch

9/10/23 10:03:19 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' 
cmd=[{"prefix":"config 
rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: 
dispatch

9/10/23 10:03:19 AM[INF]from='mgr.252911336 192.168.32.68:0/63' 
entity='mgr.ceph-25' cmd=[{"prefix":"config 
rm","who":"mgr","name":"mgr/rbd_support/ceph-25/trash_purge_schedule"}]: 
dispatch

9/10/23 10:02:24 AM[INF]from='mgr.252911336 ' entity='mgr.ceph-25' 
cmd=[{"prefix":"config 
rm","who":"mgr","name":"mgr/rbd_support/ceph-25/mirror_snapshot_schedule"}]: 
dispatch

We don't have mirroring, ts a single cluster. What is going on here and how can 
I stop that? I already restarted all MGR daemons to no avail.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS daemons don't report any more

2023-09-09 Thread Frank Schilder

Hi all, I make a weird observation. 8 out of 12 MDS daemons seem not to report 
to the cluster any more:

# ceph fs status
con-fs2 - 1625 clients
===
RANK  STATE MDS   ACTIVITY DNSINOS  
 0active  ceph-16  Reqs:0 /s 0  0   
 1active  ceph-09  Reqs:  128 /s  4251k  4250k  
 2active  ceph-17  Reqs:0 /s 0  0   
 3active  ceph-15  Reqs:0 /s 0  0   
 4active  ceph-24  Reqs:  269 /s  3567k  3567k  
 5active  ceph-11  Reqs:0 /s 0  0   
 6active  ceph-14  Reqs:0 /s 0  0   
 7active  ceph-23  Reqs:0 /s 0  0   
POOL   TYPE USED  AVAIL  
   con-fs2-meta1 metadata  2169G  7081G  
   con-fs2-meta2   data   0   7081G  
con-fs2-data   data1248T  4441T  
con-fs2-data-ec-ssddata 705G  22.1T  
   con-fs2-data2   data3172T  4037T  
STANDBY MDS  
  ceph-08
  ceph-10
  ceph-12
  ceph-13
VERSION 
 DAEMONS  
  None  
  ceph-16, ceph-17, ceph-15, ceph-11, ceph-14, ceph-23, ceph-10, ceph-12  
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus 
(stable)ceph-09, ceph-24, ceph-08, ceph-13  
  

Version is "none" for these and there are no stats. Ceph versions reports only 
4 MDSes out of the 12. 8 are not shown at all:

[root@gnosis ~]# ceph versions
{
"mon": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 5
},
"mgr": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 5
},
"osd": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 1282
},
"mds": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 4
},
"overall": {
"ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable)": 1296
}
}

Ceph status reports everything as up and OK:

[root@gnosis ~]# ceph status
  cluster:
id: e4ece518-f2cb-4708-b00f-b6bf511e91d9
health: HEALTH_OK
 
  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 2w)
mgr: ceph-03(active, since 61s), standbys: ceph-25, ceph-01, ceph-02, 
ceph-26
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1284 osds: 1282 up (since 31h), 1282 in (since 33h); 567 remapped pgs
 
  data:
pools:   14 pools, 25065 pgs
objects: 2.14G objects, 3.7 PiB
usage:   4.7 PiB used, 8.4 PiB / 13 PiB avail
pgs: 79908208/18438361040 objects misplaced (0.433%)
 23063 active+clean
 1225  active+clean+snaptrim_wait
 317   active+remapped+backfill_wait
 250   active+remapped+backfilling
 208   active+clean+snaptrim
 2 active+clean+scrubbing+deep
 
  io:
client:   596 MiB/s rd, 717 MiB/s wr, 4.16k op/s rd, 3.04k op/s wr
recovery: 8.7 GiB/s, 3.41k objects/s

My first thought is that the status module failed. However, I don't manage to 
restart it (always on). An MGR fail-over did not help.

Any ideas what is going on here?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?

2023-09-07 Thread Frank Schilder

Hi, I did something like that in the past. If you have a sufficient amount of 
cold data in general and you can bring the OSDs back with their original IDs, 
recovery was significantly faster than rebalancing. It really depends how 
trivial the version update per object is. In my case it could re-use thousands 
of clean objects per dirty object. If you are unsure its probably best to do a 
wipe + rebalance.

What can take quite a while at the beginning is the osdmap update if they were 
down for such a long time. The first boot until they show up as "in" will take 
a while. Set norecover and norebalance until you see in the OSD log that they 
have the latest OSD map version.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Malte Stroem 
Sent: Wednesday, September 6, 2023 4:16 PM
To: ceph-m...@rikdvk.mailer.me; ceph-users@ceph.io
Subject: [ceph-users] Re: Is it possible (or meaningful) to revive old OSDs?

Hi ceph-m...@rikdvk.mailer.me,

you could squeeze the OSDs back in but it does not make sense.

Just clean the disks with dd for example and add them as new disks to
your cluster.

Best,
Malte

Am 04.09.23 um 09:39 schrieb ceph-m...@rikdvk.mailer.me:
> Hello,
>
> I have a ten node cluster with about 150 OSDs. One node went down a while 
> back, several months. The OSDs on the node have been marked as down and out 
> since.
>
> I am now in the position to return the node to the cluster, with all the OS 
> and OSD disks. When I boot up the now working node, the OSDs do not start.
>
> Essentially , it seems to complain with "fail[ing]to load OSD map for 
> [various epoch]s, got 0 bytes".
>
> I'm guessing the OSDs on disk maps are so old, they can't get back into the 
> cluster?
>
> My questions are whether it's possible or worth it to try to squeeze these 
> OSDs back in or to just replace them. And if I should just replace them, 
> what's the best way? Manually remove [1] and recreate? Replace [2]? Purge in 
> dashboard?
>
> [1] 
> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#removing-osds-manual
> [2] 
> https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/#replacing-an-osd
>
> Many thanks!
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: A couple OSDs not starting after host reboot

2023-08-29 Thread Frank Schilder

Hi Alison,

I have observed exactly that with OSDs "converted" from ceph-disk to 
ceph-volume. Someone thought it would be a great idea to store the /dev-device 
name in the config instead of the uuid or any other stable device path:

# cat /etc/ceph/osd/287-2eaf591b-bced-4097-9499-5fda071c6161.json
{
...
"block": {
"path": "/dev/disk/by-partuuid/0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649",
"uuid": "0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649"
},
...
"data": {
"path": "/dev/sdm1",
"uuid": "2eaf591b-bced-4097-9499-5fda071c6161"
},
...
}

Funnily enough, it has the by-uuid path stored as well, but the /dev path is 
actually used during activation. My "fix" is to re-generate the OSD-json just 
before every ceph-disk OSD start.

You seem to be using LVM OSDs already, so this is a bit weird (can't be the 
exact same issue). Still, I would not be surprised if you are bitten by 
something similar, some stored config (cache) overrides the actual drive 
location. It is really a bliss that the developers implemented a check that a 
partition actually points to the data with the correct OSD ID, otherwise our 
cluster would be rigged by now.

I would start by using low-level commands (ceph-volume) directly to see if the 
issue is low-level or sits in some higher-level interface. Log-in to the OSD 
node and check what "ceph-volume inventory" says and if you can manually 
activate/deactivate the OSD on disk (be careful to include the --no-systemd 
option everywhere to avoid unintended change of persistent configurations).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: apeis...@fnal.gov 
Sent: Friday, August 25, 2023 10:29 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: A couple OSDs not starting after host reboot

Hi,

Thank you for your reply. I don’t think the device names changed, but ceph 
seems to be confused about which device the OSD is on. It’s reporting that 
there are 2 OSDs on the same device although this is not true.

ceph device ls-by-host  | grep sdu
ATA_HGST_HUH728080ALN600_VJH4GLUX sdu  osd.665
ATA_HGST_HUH728080ALN600_VJH60MAX sdu  osd.657

The osd.665 is actually on device sdm. Could this be the cause of the issue? Is 
there a way to correct it?
Thanks,
Alison
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Client failing to respond to capability release

2023-08-23 Thread Frank Schilder

Hi Dhairya,

this is the thing, the client appeared to be responsive and worked fine (file 
system was on-line and responsive as usual). There was something off though; 
see my response to Eugen.

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dhairya Parmar 
Sent: Wednesday, August 23, 2023 9:05 AM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Client failing to respond to capability release

Hi Frank,

This usually happens when the client is buggy/unresponsive. This warning is 
triggered when the client fails to respond to MDS's request to release caps in 
time which is determined by session_timeout(defaults to 60 secs). Did you make 
any config changes?


Dhairya Parmar

Associate Software Engineer, CephFS

Red Hat Inc.<https://www.redhat.com/>

dpar...@redhat.com<mailto:dpar...@redhat.com>

[https://static.redhat.com/libs/redhat/brand-assets/2/corp/logo--200.png]<https://www.redhat.com/>


On Tue, Aug 22, 2023 at 9:12 PM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:
Hi all,

I have this warning the whole day already (octopus latest cluster):

HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not 
deep-scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability 
release
mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 145698301
mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 189511877
mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 189511887
mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 231250695

If I look at the session info from mds.1 for these clients I see this:

# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: 
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: 
.num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 
189511877 -e 189511887 -e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk<http://sn350.hpc.ait.dtu.dk>","addr":"client.189511887
 
v1:192.168.57.221:0/4262844211<http://192.168.57.221:0/4262844211>","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk<http://sn403.hpc.ait.dtu.dk>","addr":"client.231250695
 
v1:192.168.58.18:0/1334540218<http://192.168.58.18:0/1334540218>","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk<http://sn463.hpc.ait.dtu.dk>","addr":"client.189511877
 
v1:192.168.58.78:0/3535879569<http://192.168.58.78:0/3535879569>","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk<http://sn352.hpc.ait.dtu.dk>","addr":"client.145698301
 
v1:192.168.57.223:0/2146607320<http://192.168.57.223:0/2146607320>","fs":"/hpc/groups","caps":7,"req":0}

We have mds_min_caps_per_client=4096, so it looks like the limit is well 
satisfied. Also, the file system is pretty idle at the moment.

Why and what exactly is the MDS complaining about here?

Thanks and best regards.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to 
ceph-users-le...@ceph.io<mailto:ceph-users-le...@ceph.io>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Client failing to respond to capability release

2023-08-23 Thread Frank Schilder

Hi Eugen, thanks for that :D

This time it was something different. Possibly a bug in the kclient. On these 
nodes I found sync commands stuck in D-state. I guess a file/dir was not 
possible to sync or there was some kind of corruption of buffer data. We had to 
reboot the servers to clear that out.

On first inspection these clients looked OK. Only some deeper debugging 
revealed that something was off.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Eugen Block 
Sent: Wednesday, August 23, 2023 8:55 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Client failing to respond to capability release

Hi,

pointing you to your own thread [1] ;-)

[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HFILR5NMUCEZH7TJSGSACPI4P23XTULI/

Zitat von Frank Schilder :

> Hi all,
>
> I have this warning the whole day already (octopus latest cluster):
>
> HEALTH_WARN 4 clients failing to respond to capability release; 1
> pgs not deep-scrubbed in time
> [WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to
> capability release
> mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 145698301
> mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 189511877
> mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 189511887
> mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc
> failing to respond to capability release client_id: 231250695
>
> If I look at the session info from mds.1 for these clients I see this:
>
> # ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h:
> .client_metadata.hostname, addr: .inst, fs: .client_metadata.root,
> caps: .num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep
> -e 145698301 -e 189511877 -e 189511887 -e 231250695
> {"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887
> v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
> {"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695
> v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
> {"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877
> v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
> {"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301
> v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}
>
> We have mds_min_caps_per_client=4096, so it looks like the limit is
> well satisfied. Also, the file system is pretty idle at the moment.
>
> Why and what exactly is the MDS complaining about here?
>
> Thanks and best regards.
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Client failing to respond to capability release

2023-08-22 Thread Frank Schilder

Hi all,

I have this warning the whole day already (octopus latest cluster):

HEALTH_WARN 4 clients failing to respond to capability release; 1 pgs not 
deep-scrubbed in time
[WRN] MDS_CLIENT_LATE_RELEASE: 4 clients failing to respond to capability 
release
mds.ceph-24(mds.1): Client sn352.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 145698301
mds.ceph-24(mds.1): Client sn463.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 189511877
mds.ceph-24(mds.1): Client sn350.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 189511887
mds.ceph-24(mds.1): Client sn403.hpc.ait.dtu.dk:con-fs2-hpc failing to 
respond to capability release client_id: 231250695

If I look at the session info from mds.1 for these clients I see this:

# ceph tell mds.1 session ls | jq -c '[.[] | {id: .id, h: 
.client_metadata.hostname, addr: .inst, fs: .client_metadata.root, caps: 
.num_caps, req: .request_load_avg}]|sort_by(.caps)|.[]' | grep -e 145698301 -e 
189511877 -e 189511887 -e 231250695
{"id":189511887,"h":"sn350.hpc.ait.dtu.dk","addr":"client.189511887 
v1:192.168.57.221:0/4262844211","fs":"/hpc/groups","caps":2,"req":0}
{"id":231250695,"h":"sn403.hpc.ait.dtu.dk","addr":"client.231250695 
v1:192.168.58.18:0/1334540218","fs":"/hpc/groups","caps":3,"req":0}
{"id":189511877,"h":"sn463.hpc.ait.dtu.dk","addr":"client.189511877 
v1:192.168.58.78:0/3535879569","fs":"/hpc/groups","caps":4,"req":0}
{"id":145698301,"h":"sn352.hpc.ait.dtu.dk","addr":"client.145698301 
v1:192.168.57.223:0/2146607320","fs":"/hpc/groups","caps":7,"req":0}

We have mds_min_caps_per_client=4096, so it looks like the limit is well 
satisfied. Also, the file system is pretty idle at the moment.

Why and what exactly is the MDS complaining about here?

Thanks and best regards.
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: snaptrim number of objects

2023-08-21 Thread Frank Schilder

Hi Angelo,

was this cluster upgraded (major version upgrade) before these issues started? 
We observed that with certain paths of a major version upgrade and the only way 
to fix that was to re-deploy all OSDs step by step.

You can try a rocks-DB compaction first. If that doesn't help, rebuilding the 
OSDs might be the only way out.

You should also confirm that all ceph-daemons are on the same version and that 
require-osd-release is reporting the same major version as well:

ceph report | jq '.osdmap.require_osd_release'

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Angelo Hongens 
Sent: Saturday, August 19, 2023 9:58 AM
To: Patrick Donnelly
Cc: ceph-users@ceph.io
Subject: [ceph-users] Re: snaptrim number of objects

On 07/08/2023 18:04, Patrick Donnelly wrote:
>> I'm trying to figure out what's happening to my backup cluster that
>> often grinds to a halt when cephfs automatically removes snapshots.
>
> CephFS does not "automatically" remove snapshots. Do you mean the
> snap_schedule mgr module?

Yup.

>> Almost all OSD's go to 100% CPU, ceph complains about slow ops, and
>> CephFS stops doing client i/o.
>
> What health warnings do you see? You can try configuring snap trim:
>
> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_snap_trim_sleep

Mostly a looot of SLOW_OPS. And I guess as a result of that
MDS_CLIENT_LATE_RELEASE, MDS_CLIENT_OLDEST_TID, MDS_SLOW_METADATA_IO,
MDS_TRIM warnings.

 >> That won't explain why my cluster bogs down, but at least it gives
 >> some visibility. Running 17.2.6 everywhere by the way.
 >
 > Please let us know how configuring snaptrim helps or not.
 >

When I set nosnaptrim, all I/O immediately restores. When I unset
nosnaptrim, i/o stops again.

One of the symptoms is that OSD's go to about 350% cpu per daemon.

I got the feeling for a while that setting osd_snap_trim_sleep_ssd to 1
helped. I have 120 HDD osd's with wal/journal on ssd, does it even use
this value? Everything seemed stable, but eventually another few days
passed, and suddenly removing a snapshot brought the cluster down again.
So I guess that wasn't the cause.

Now what I'm trying to do is set osd_max_trimming_pgs to 0 for all
disks, and slowly setting it to 1 for a few osd's. This seems to work
for a while, but still it brings the cluster down every now and then,
and if not, the cluster is so slow it's almost unusable.

This whole troubleshooting process is taking weeks. I just noticed that
when 'the problem occurs', a lot of OSD's on a host (15 osd's per host)
start using a lot of CPU, even though for example only 3 OSD's on this
machine have their osd_max_trimming_pgs set to 1, the rest to 0. Disk
doesn't seem to be the bottleneck.

Restarting the daemons seems to solve the problem for a while, although
the high cpu usage pops up on a different osd node every time.

I am at a loss here. I'm almost thinking it's some kind of bug in the
osd daemons, but I have no idea how to troubleshoot this.

Angelo.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

2023-08-08 Thread Frank Schilder

Dear Xiubo,

the nearfull pool is an RBD pool and has nothing to do with the file system. 
All pools for the file system have plenty of capacity.

I think we have an idea what kind of workload caused the issue. We had a user 
run a computation that reads the same file over and over again. He started 100 
such jobs in parallel and our storage servers were at 400% load. I saw 167K 
read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 11K. 
Clearly, most of this was served from RAM.

It is possible that this extreme load situation triggered a race that remained 
undetected/unreported. There is literally no related message in any logs near 
the time the warning started popping up. It shows up out of nowhere.

We asked the user to change his workflow to use local RAM disk for the input 
files. I don't think we can reproduce the problem anytime soon.

About the bug fixes, I'm eagerly waiting for this and another one. Any idea 
when they might show up in distro kernels?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Tuesday, August 8, 2023 2:57 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 8/7/23 21:54, Frank Schilder wrote:
> Dear Xiubo,
>
> I managed to collect some information. It looks like there is nothing in the 
> dmesg log around the time the client failed to advance its TID. I collected 
> short snippets around the critical time below. I have full logs in case you 
> are interested. Its large files, I will need to do an upload for that.
>
> I also have a dump of "mds session ls" output for clients that showed the 
> same issue later. Unfortunately, no consistent log information for a single 
> incident.
>
> Here the summary, please let me know if uploading the full package makes 
> sense:
>
> - Status:
>
> On July 29, 2023
>
> ceph status/df/pool stats/health detail at 01:05:14:
>cluster:
>  health: HEALTH_WARN
>  1 pools nearfull
>
> ceph status/df/pool stats/health detail at 01:05:28:
>cluster:
>  health: HEALTH_WARN
>  1 clients failing to advance oldest client/flush tid
>  1 pools nearfull

Okay, then this could be the root cause.

If the pool nearful it could block flushing the journal logs to the pool
and then the MDS couldn't safe reply to the requests and then block them
like this.

Could you fix the pool nearful issue first and then check could you see
it again ?


> [...]
>
> On July 31, 2023
>
> ceph status/df/pool stats/health detail at 10:36:16:
>cluster:
>  health: HEALTH_WARN
>  1 clients failing to advance oldest client/flush tid
>  1 pools nearfull
>
>cluster:
>  health: HEALTH_WARN
>  1 pools nearfull
>
> - client evict command (date, time, command):
>
> 2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457
>
> We have a 1h time difference between the date stamp of the command and the 
> dmesg date stamps. However, there seems to be a weird 10min delay from 
> issuing the evict command until it shows up in dmesg on the client.
>
> - dmesg:
>
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22

This is a known bug and we have fixed it i

[ceph-users] Re: MDS stuck in rejoin

2023-08-07 Thread Frank Schilder

1 09:46:41 2023] ceph: mds1 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds0 192.168.32.81:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds0
[Mon Jul 31 09:46:41 2023] ceph: mds0 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect start
[Mon Jul 31 09:46:41 2023] libceph: mds5 192.168.32.78:6801 connection reset
[Mon Jul 31 09:46:41 2023] libceph: reset on mds5
[Mon Jul 31 09:46:41 2023] ceph: mds5 closed our session
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect start
[Mon Jul 31 09:46:41 2023] ceph: mds2 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds1 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds0 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds5 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds3 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds7 reconnect denied
[Mon Jul 31 09:46:41 2023] ceph: mds4 reconnect denied

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Monday, July 31, 2023 12:14 PM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 7/31/23 16:50, Frank Schilder wrote:
> Hi Xiubo,
>
> its a kernel client. I actually made a mistake when trying to evict the 
> client and my command didn't do anything. I did another evict and this time 
> the client IP showed up in the blacklist. Furthermore, the warning 
> disappeared. I asked for the dmesg logs from the client node.

Yeah, after the client's sessions are closed the corresponding warning
should be cleared.

Thanks

> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Xiubo Li 
> Sent: Monday, July 31, 2023 4:12 AM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>
> Hi Frank,
>
> On 7/30/23 16:52, Frank Schilder wrote:
>> Hi Xiubo,
>>
>> it happened again. This time, we might be able to pull logs from the client 
>> node. Please take a look at my intermediate action below - thanks!
>>
>> I am in a bit of a calamity, I'm on holidays with terrible network 
>> connection and can't do much. My first priority is securing the cluster to 
>> avoid damage caused by this issue. I did an MDS evict by client ID on the 
>> MDS reporting the warning with the client ID reported in the warning. For 
>> some reason the client got blocked on 2 MDSes after this command, one of 
>> these is an ordinary stand-by daemon. Not sure if this is expected.
>>
>> Main question: is this sufficient to prevent any damaging IO on the cluster? 
>> I'm thinking here about the MDS eating through all its RAM until it crashes 
>> hard in an irrecoverable state (that was described as a consequence in an 
>> old post about this warning). If this is a safe state, I can keep it in this 
>> state until I return from holidays.
> Yeah, I think so.
>
> BTW, are u using the kclients or user space clients ? I checked both
> kclient and libcephfs, it seems buggy in libcephfs, which could cause
> this issue. But for kclient it's okay till now.
>
> Thanks
>
> - Xiubo
>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: Xiubo Li 
>> Sent: Friday, July 28, 2023 11:37 AM
>> To: Frank Schilder; ceph-users@ceph.io
>> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>>
>>
>> On 7/26/23 22:13, Frank Schilder wrote:
>>> Hi Xiubo.
>>>
>>>> ... I am more interested in the kclient side logs. Just want to
>>>> know why that oldest request got stuck so long.
>>> I'm afraid I'm a bad admin in this case. I don't have logs from the host 
>>> any more, I would have needed the output of dmesg and this is gone. In case 
>>> it happens again I will try to pull the info out.
>>>
>>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
>>> than our situation. We had no problems with the MDSes, the cache didn't 
>>> grow and the relevant one was also not put into read-only mode. It was just 
>>> this warning showing all the time, health was OK otherwise. I think the 
>>> warning was there for at least 16h before I failed the MDS.
>>>
>>> The MDS log contains nothing, this is the only line mentioning this client:
>>>
>>> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] 
>>> : client.145678382 does not advance its oldest_clie

[ceph-users] Re: MDS stuck in rejoin

2023-07-31 Thread Frank Schilder

Hi Xiubo,

its a kernel client. I actually made a mistake when trying to evict the client 
and my command didn't do anything. I did another evict and this time the client 
IP showed up in the blacklist. Furthermore, the warning disappeared. I asked 
for the dmesg logs from the client node.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Monday, July 31, 2023 4:12 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin

Hi Frank,

On 7/30/23 16:52, Frank Schilder wrote:
> Hi Xiubo,
>
> it happened again. This time, we might be able to pull logs from the client 
> node. Please take a look at my intermediate action below - thanks!
>
> I am in a bit of a calamity, I'm on holidays with terrible network connection 
> and can't do much. My first priority is securing the cluster to avoid damage 
> caused by this issue. I did an MDS evict by client ID on the MDS reporting 
> the warning with the client ID reported in the warning. For some reason the 
> client got blocked on 2 MDSes after this command, one of these is an ordinary 
> stand-by daemon. Not sure if this is expected.
>
> Main question: is this sufficient to prevent any damaging IO on the cluster? 
> I'm thinking here about the MDS eating through all its RAM until it crashes 
> hard in an irrecoverable state (that was described as a consequence in an old 
> post about this warning). If this is a safe state, I can keep it in this 
> state until I return from holidays.

Yeah, I think so.

BTW, are u using the kclients or user space clients ? I checked both
kclient and libcephfs, it seems buggy in libcephfs, which could cause
this issue. But for kclient it's okay till now.

Thanks

- Xiubo

>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ____
> From: Xiubo Li 
> Sent: Friday, July 28, 2023 11:37 AM
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: MDS stuck in rejoin
>
>
> On 7/26/23 22:13, Frank Schilder wrote:
>> Hi Xiubo.
>>
>>> ... I am more interested in the kclient side logs. Just want to
>>> know why that oldest request got stuck so long.
>> I'm afraid I'm a bad admin in this case. I don't have logs from the host any 
>> more, I would have needed the output of dmesg and this is gone. In case it 
>> happens again I will try to pull the info out.
>>
>> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
>> than our situation. We had no problems with the MDSes, the cache didn't grow 
>> and the relevant one was also not put into read-only mode. It was just this 
>> warning showing all the time, health was OK otherwise. I think the warning 
>> was there for at least 16h before I failed the MDS.
>>
>> The MDS log contains nothing, this is the only line mentioning this client:
>>
>> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] 
>> : client.145678382 does not advance its oldest_client_tid (16121616), 10 
>> completed requests recorded in session
> Okay, if so it's hard to say and dig out what has happened in client why
> it didn't advance the tid.
>
> Thanks
>
> - Xiubo
>
>
>> Best regards,
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

2023-07-30 Thread Frank Schilder

Hi Xiubo,

it happened again. This time, we might be able to pull logs from the client 
node. Please take a look at my intermediate action below - thanks!

I am in a bit of a calamity, I'm on holidays with terrible network connection 
and can't do much. My first priority is securing the cluster to avoid damage 
caused by this issue. I did an MDS evict by client ID on the MDS reporting the 
warning with the client ID reported in the warning. For some reason the client 
got blocked on 2 MDSes after this command, one of these is an ordinary stand-by 
daemon. Not sure if this is expected.

Main question: is this sufficient to prevent any damaging IO on the cluster? 
I'm thinking here about the MDS eating through all its RAM until it crashes 
hard in an irrecoverable state (that was described as a consequence in an old 
post about this warning). If this is a safe state, I can keep it in this state 
until I return from holidays.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Xiubo Li 
Sent: Friday, July 28, 2023 11:37 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin

On 7/26/23 22:13, Frank Schilder wrote:
> Hi Xiubo.
>
>> ... I am more interested in the kclient side logs. Just want to
>> know why that oldest request got stuck so long.
> I'm afraid I'm a bad admin in this case. I don't have logs from the host any 
> more, I would have needed the output of dmesg and this is gone. In case it 
> happens again I will try to pull the info out.
>
> The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
> than our situation. We had no problems with the MDSes, the cache didn't grow 
> and the relevant one was also not put into read-only mode. It was just this 
> warning showing all the time, health was OK otherwise. I think the warning 
> was there for at least 16h before I failed the MDS.
>
> The MDS log contains nothing, this is the only line mentioning this client:
>
> 2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : 
> client.145678382 does not advance its oldest_client_tid (16121616), 10 
> completed requests recorded in session

Okay, if so it's hard to say and dig out what has happened in client why
it didn't advance the tid.

Thanks

- Xiubo

> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

2023-07-26 Thread Frank Schilder

Hi Xiubo.

> ... I am more interested in the kclient side logs. Just want to
> know why that oldest request got stuck so long.

I'm afraid I'm a bad admin in this case. I don't have logs from the host any 
more, I would have needed the output of dmesg and this is gone. In case it 
happens again I will try to pull the info out.

The tracker https://tracker.ceph.com/issues/22885 sounds a lot more violent 
than our situation. We had no problems with the MDSes, the cache didn't grow 
and the relevant one was also not put into read-only mode. It was just this 
warning showing all the time, health was OK otherwise. I think the warning was 
there for at least 16h before I failed the MDS.

The MDS log contains nothing, this is the only line mentioning this client:

2023-07-20T00:22:05.518+0200 7fe13df59700  0 log_channel(cluster) log [WRN] : 
client.145678382 does not advance its oldest_client_tid (16121616), 10 
completed requests recorded in session

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: MDS stuck in rejoin

2023-07-24 Thread Frank Schilder

Hi Xiubo,

I seem to have gotten your e-mail twice.

Its a very old kclient. It was in that state when I came to work in the morning 
and I looked at it in the afternoon. Was hoping the problem would clear by 
itself.

It was probably a compute job that crashed it, its a compute node in our HPC 
cluster. I didn't look at dmesg, but I can take a look at he MDS log when I'm 
back at work. I guess you are interested in the time before the warning started 
appearing.

The MDS was probably stuck 5-10 minutes. This is very lng on our cluster. 
An MDS failover usually takes 30s only. Also I couldn't see any progress in the 
MDS log, that's why I decided to fail the rank a second time.

During the time it was stuck in rejoin it didn't log any other messages than he 
MDS map update messages. I don't think it was doing anything at all.

Some background: Before I started recovery I found an old post stating that the 
MDS_CLIENT_OLDEST_TID warning is quite serious, because the affected MDS will 
experience growing cache allocation and eventually fail in a practically 
irrecoverable state. I did not observe unusual RAM consumption and there were 
no MDS large cache messages either. Seems like our situation was of a more 
harmless nature. Still, the fail did not go entirely smooth.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Xiubo Li 
Sent: Friday, July 21, 2023 1:32 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: MDS stuck in rejoin

On 7/20/23 22:09, Frank Schilder wrote:
> Hi all,
>
> we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients 
> failing to advance oldest client/flush tid". I looked at the client and there 
> was nothing going on, so I rebooted it. After the client was back, the 
> message was still there. To clean this up I failed the MDS. Unfortunately, 
> the MDS that took over is remained stuck in rejoin without doing anything. 
> All that happened in the log was:

BTW, are you using the kclient or user space client ? How long was the
MDS stuck in rejoin state ?

This means in the client side the oldest client has been stuck too long,
maybe in heavy load case there were to many requests generated in a
short time and the oldest request was stuck too long in MDS.

> [root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log
> 2023-07-20T15:54:29.147+0200 7fedb9c9f700  1 mds.2.896604 rejoin_start
> 2023-07-20T15:54:29.161+0200 7fedb9c9f700  1 mds.2.896604 rejoin_joint_start
> 2023-07-20T15:55:28.005+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
> version 896614 from mon.4
> 2023-07-20T15:56:00.278+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
> version 896615 from mon.4
> [...]
> 2023-07-20T16:02:54.935+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
> version 896653 from mon.4
> 2023-07-20T16:03:07.276+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
> version 896654 from mon.4

Did you see any slow request log in the mds log files ? And any other
suspect logs from the dmesg if it's kclient ?

> After some time I decided to give another fail a try and, this time, the 
> replacement daemon went to active state really fast.
>
> If I have a message like the above, what is the clean way of getting the 
> client clean again (version: 15.2.17 
> (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))?

I think your steps are correct.

Thanks

- Xiubo

> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] MDS stuck in rejoin

2023-07-20 Thread Frank Schilder

Hi all,

we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients 
failing to advance oldest client/flush tid". I looked at the client and there 
was nothing going on, so I rebooted it. After the client was back, the message 
was still there. To clean this up I failed the MDS. Unfortunately, the MDS that 
took over is remained stuck in rejoin without doing anything. All that happened 
in the log was:

[root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log
2023-07-20T15:54:29.147+0200 7fedb9c9f700  1 mds.2.896604 rejoin_start
2023-07-20T15:54:29.161+0200 7fedb9c9f700  1 mds.2.896604 rejoin_joint_start
2023-07-20T15:55:28.005+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896614 from mon.4
2023-07-20T15:56:00.278+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896615 from mon.4
[...]
2023-07-20T16:02:54.935+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896653 from mon.4
2023-07-20T16:03:07.276+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896654 from mon.4

After some time I decided to give another fail a try and, this time, the 
replacement daemon went to active state really fast.

If I have a message like the above, what is the clean way of getting the client 
clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable))?

Thanks and best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD memory usage after cephadm adoption

2023-07-17 Thread Frank Schilder

Hi all,

now that host masks seem to work, could somebody please shed some light at the 
relative priority of these settings:

ceph config set osd memory_target X
ceph config set osd/host:A memory_target Y
ceph config set osd/class:B memory_target Z

Which one wins for an OSD on host A in class B?

Similar for an explicit ID. The expectation is that a setting for OSD.ID always 
wins. Then the masked values, then the generic osd setting, then the globals 
and last the defaults. The relative precedence of masked values is not defined 
anywhere, nor the precedence in general.

This is missing in even the latest docs: 
https://docs.ceph.com/en/quincy/rados/configuration/ceph-conf/#sections-and-masks
 . Would be great if someone could add this.

Thanks!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Luis Domingues 
Sent: Monday, July 17, 2023 9:36 AM
To: Sridhar Seshasayee
Cc: Mark Nelson; ceph-users@ceph.io
Subject: [ceph-users] Re: OSD memory usage after cephadm adoption

It looks indeed to be that bug that I hit.

Thanks.

Luis Domingues
Proton AG


--- Original Message ---
On Monday, July 17th, 2023 at 07:45, Sridhar Seshasayee  
wrote:


> Hello Luis,
>
> Please see my response below:
>
> But when I took a look on the memory usage of my OSDs, I was below of that
>
> > value, by quite a bite. Looking at the OSDs themselves, I have:
> >
> > "bluestore-pricache": {
> > "target_bytes": 4294967296,
> > "mapped_bytes": 1343455232,
> > "unmapped_bytes": 16973824,
> > "heap_bytes": 1360429056,
> > "cache_bytes": 2845415832
> > },
> >
> > And if I get the running config:
> > "osd_memory_target": "4294967296",
> > "osd_memory_target_autotune": "true",
> > "osd_memory_target_cgroup_limit_ratio": "0.80",
> >
> > Which is not the value I observe from the config. I have 4294967296
> > instead of something around 7219293672. Did I miss something?
>
> This is very likely due to https://tracker.ceph.com/issues/48750. The fix
> was recently merged into
> the main branch and should be backported soon all the way to pacific.
>
> Until then, the workaround would be to set the osd_memory_target on each
> OSD individually to
> the desired value.
>
> -Sridhar
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Cluster down after network outage

2023-07-13 Thread Frank Schilder

Hi all,

trying to answer Dan and Stefan in one go.

The OSDs were not booting (understood as the ceph-osd daemon starting up), they 
were running all the time. If they actually boot, they come up quite fast even 
though they are HDD and collocated. The time to be marked up is usually 30-60s 
after start, completely fine with me. We are not affected by pglog-dup 
accumulation or other issues causing excessive boot time and memory consumption.

On the affected part of the cluster we have 972 OSDs running on 900 disks, 48 
SSDs with 1 or 4 OSDs depending on size and 852 HDDs with 1 OSD. All OSDs were 
up and running, but network to their hosts was down for a few hours. After 
network came up, a peering-recover-remap frenzy broke loose and we observed the 
death spiral described in this old post: https://narkive.com/KAzvjjPc.4

Specifically, just setting nodown didn't help, I guess our cluster is simply 
too large. I really had to stop recovery as well (norebalance had no effect). I 
had to do a bit of guessing when all OSDs are both, marked up and seen up at 
the same time (not trivial when the nodown flag is present), but after all OSDs 
were seen up for a while and the initial peering was over, enabling recovery 
now worked as expected and didn't lead to further OSDs being seen as down again.

In this particular situation, when the flag nodown is set, it would be 
extremely helpful to have an extra piece of info in the OSD part of ceph status 
showing the number "seen up" so that it becomes easier to be sure that an OSD 
that was marked up some time ago is not "seen down" but not marked down again 
due to the flag. This was really critical for getting the cluster back.

During all this, no parts of the hardware were saturated. However, the way ceph 
handles such a simultaneous OSD mass-return-to-life event seems not very 
clever. As soon as some PGs manage to peer into an active state, they start 
with recovery irrespective of whether or not they are remappings and the 
missing OSDs are actually there and waiting to be marked up. This in turn 
increases the load on the daemons unnecessarily. On top of this 
impatience-implied load amplification comes that the operations scheduler is 
not good at pushing high-priority messages like MON heart beats through to busy 
daemons.

As a consequence, one has a constant up-and down marking of OSDs even though 
the OSDs are actually running fine and nothing crashed. This implies constant 
re-peering of all PGs and accumulation of additional pglog history for any 
successful recovery or remapped write.

The strategy in such a such a situation would actually be to wait for every 
single OSD to be marked up, start peering and only then look if something needs 
recovery. Trying to do this all at the same time is just a mess.

In this particular case, setting osd_recovery_delay_start to a high value looks 
very promising to help dealing with such a situation in a bit more hands-free 
way - if this flag is doing something in octopus. I can see it available in the 
ceph config options on the command line but not in the docs (its implemented 
but not documented). Thanks for bringing this to my attention! Looking at the 
time it took for all OSDs to come up and peering to complete, 10-30 minutes 
might do the job for our cluster. 10-30 minutes on an HDD pool after 
simultaneously marking 900 OSDs up is actually a good time. I assume that 
recovery for a PG will be postponed whenever this PG (or one of its OSDs) has a 
peering event?

When looking at the sequence of events, I would actually also like an 
osd_up_peer_delay_start to postpone peering a bit in case several OSDs are 
starting up to avoid redundant re-peering. The ideal recovery sequence after a 
mass-down of OSDs really seems to be to do each of these as atomic steps:

- get all OSDs up
- peer all PGs
- start recovery if needed

Being able to delay peering after an OSD comes up would give some space for the 
"I'm up" messages to arrive with priority at the MONs. Then the PGs have time 
to say "I'm active+clean" and only then will the trash be taken out.

Thanks for replying to my messages and for pointing me to 
osd_recovery_delay_start.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Dan van der Ster 
Sent: Wednesday, July 12, 2023 6:58 PM
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: Cluster down after network outage

On Wed, Jul 12, 2023 at 1:26 AM Frank Schilder 
mailto:fr...@dtu.dk>> wrote:

Hi all,

one problem solved, another coming up. For everyone ending up in the same 
situation, the trick seems to be to get all OSDs marked up and then allow 
recovery. Steps to take:

- set noout, nodown, norebalance, norecover
- wait patiently until all OSDs are shown as up
- unset norebalance, norecover
- wait wait wait, PGs will eventually become active as OSDs

[ceph-users] Re: Cluster down after network outage

2023-07-12 Thread Frank Schilder

Answering myself for posteriority. The rebalancing list disappeared after 
waiting even longer. Might just have been an MGR that needed to catch up.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Wednesday, July 12, 2023 10:25 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Cluster down after network outage

Hi all,

one problem solved, another coming up. For everyone ending up in the same 
situation, the trick seems to be to get all OSDs marked up and then allow 
recovery. Steps to take:

- set noout, nodown, norebalance, norecover
- wait patiently until all OSDs are shown as up
- unset norebalance, norecover
- wait wait wait, PGs will eventually become active as OSDs become responsive
- unset nodown, noout

Now the new problem. I now have an ever growing list of OSDs listed as 
rebalancing, but nothing is actually rebalancing. How can I stop this growth 
and how can I get rid of this list:

[root@gnosis ~]# ceph status
  cluster:
id: XXX
health: HEALTH_WARN
noout flag(s) set
Slow OSD heartbeats on back (longest 634775.858ms)
Slow OSD heartbeats on front (longest 635210.412ms)
1 pools nearfull

  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 6m)
mgr: ceph-25(active, since 57m), standbys: ceph-26, ceph-01, ceph-02, 
ceph-03
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1260 osds: 1258 up (since 24m), 1258 in (since 45m)
 flags noout

  data:
pools:   14 pools, 25065 pgs
objects: 1.97G objects, 3.5 PiB
usage:   4.4 PiB used, 8.7 PiB / 13 PiB avail
pgs: 25028 active+clean
 30active+clean+scrubbing+deep
 7 active+clean+scrubbing

  io:
client:   1.3 GiB/s rd, 718 MiB/s wr, 7.71k op/s rd, 2.54k op/s wr

  progress:
Rebalancing after osd.135 marked in (1s)
  [=...]
Rebalancing after osd.69 marked in (2s)
  []
Rebalancing after osd.75 marked in (2s)
  [===.]
Rebalancing after osd.173 marked in (2s)
  []
Rebalancing after osd.42 marked in (1s)
  [=...] (remaining: 2s)
Rebalancing after osd.104 marked in (2s)
  []
Rebalancing after osd.82 marked in (2s)
  []
Rebalancing after osd.107 marked in (2s)
  [===.]
Rebalancing after osd.19 marked in (2s)
  [===.]
Rebalancing after osd.67 marked in (2s)
  [=...]
Rebalancing after osd.46 marked in (2s)
  [===.] (remaining: 1s)
Rebalancing after osd.123 marked in (2s)
  [===.]
Rebalancing after osd.66 marked in (2s)
  []
Rebalancing after osd.12 marked in (2s)
  [==..] (remaining: 2s)
Rebalancing after osd.95 marked in (2s)
  [=...]
Rebalancing after osd.134 marked in (2s)
  [===.]
Rebalancing after osd.14 marked in (1s)
  [===.]
Rebalancing after osd.56 marked in (2s)
  [=...]
Rebalancing after osd.143 marked in (1s)
  []
Rebalancing after osd.118 marked in (2s)
  [===.]
Rebalancing after osd.96 marked in (2s)
  []
Rebalancing after osd.105 marked in (2s)
  [===.]
Rebalancing after osd.44 marked in (1s)
  [===.] (remaining: 5s)
Rebalancing after osd.41 marked in (1s)
  [==..] (remaining: 1s)
Rebalancing after osd.9 marked in (2s)
  [=...] (remaining: 37s)
Rebalancing after osd.58 marked in (2s)
  [==..] (remaining: 8s)
Rebalancing after osd.140 marked in (1s)
  [===.]
Rebalancing after osd.132 marked in (2s)
  []
Rebalancing after osd.31 marked in (1s)
  [=...]
Rebalancing after osd.110 marked in (2s)
  []
Rebalancing after osd.21 marked in (2s)
  [=...]
Rebalancing after osd.114 marked in (2s)
  [===.]
Rebalancing after osd.83 marked in (2s)
  [===.]
Rebalancing after osd.23 marked in (1s)
  [===.]
Rebalancing after osd.25 marked in (1s)
  [==..]
Rebalancing after osd.147 marked in (2s)
  []
Rebalancing after osd.62

[ceph-users] Re: Cluster down after network outage

2023-07-12 Thread Frank Schilder

)
  [] 
Rebalancing after osd.112 marked in (1s)
  [=...] 
Rebalancing after osd.154 marked in (1s)
  [=...] 
Rebalancing after osd.64 marked in (2s)
  [===.] 
Rebalancing after osd.34 marked in (2s)
  [] (remaining: 90s)
Rebalancing after osd.161 marked in (1s)
  [] 
Rebalancing after osd.160 marked in (2s)
  [===.] 
Rebalancing after osd.142 marked in (2s)
  [===.] 

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: Wednesday, July 12, 2023 9:53 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Cluster down after network outage

Hi all,

we had a network outage tonight (power loss) and restored network in the 
morning. All OSDs were running during this period. After restoring network 
peering hell broke loose and the cluster has a hard time coming back up again. 
OSDs get marked down all the time and come back later. Peering never stops.

Below is the current status, I had all OSDs shown as up for a while, but many 
were not responsive. Are there some flags that help bringing things up in a 
sequence that causes less overload on the system?

[root@gnosis ~]# ceph status
  cluster:
id: XXX
health: HEALTH_WARN
2 clients failing to respond to capability release
6 MDSs report slow metadata IOs
3 MDSs report slow requests
nodown,noout,nobackfill,norecover flag(s) set
176 osds down
Slow OSD heartbeats on back (longest 551718.679ms)
Slow OSD heartbeats on front (longest 549598.330ms)
Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161 
pgs peering, 1341 pgs stale
Degraded data redundancy: 1187354920/16402772667 objects degraded 
(7.239%), 6222 pgs degraded, 6231 pgs undersized
1 pools nearfull
17386 slow ops, oldest one blocked for 1811 sec, daemons 
[osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]...
 have slow ops.

  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m)
mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02, 
ceph-03
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs
 flags nodown,noout,nobackfill,norecover

  data:
pools:   14 pools, 25065 pgs
objects: 1.91G objects, 3.4 PiB
usage:   3.1 PiB used, 6.0 PiB / 9.0 PiB avail
pgs: 0.626% pgs unknown
 31.566% pgs not active
 1187354920/16402772667 objects degraded (7.239%)
 51/16402772667 objects misplaced (0.000%)
 11706 active+clean
 4752  active+undersized+degraded
 3286  down
 2702  peering
 799   undersized+degraded+peered
 464   stale+down
 418   stale+active+undersized+degraded
 214   remapped+peering
 157   unknown
 128   stale+peering
 117   stale+remapped+peering
 101   stale+undersized+degraded+peered
 57stale+active+undersized+degraded+remapped+backfilling
 35down+remapped
 26stale+undersized+degraded+remapped+backfilling+peered
 23undersized+degraded+remapped+backfilling+peered
 14active+clean+scrubbing+deep
 9 stale+active+undersized+degraded+remapped+backfill_wait
 7 active+recovering+undersized+degraded
 7 stale+active+recovering+undersized+degraded
 6 active+undersized+degraded+remapped+backfilling
 6 active+undersized
 5 active+undersized+degraded+remapped+backfill_wait
 5 stale+remapped
 4 stale+activating+undersized+degraded
 3 active+undersized+remapped
 3 stale+undersized+degraded+remapped+backfill_wait+peered
 1 activating+undersized+degraded
 1 activating+undersized+degraded+remapped
 1 undersized+degraded+remapped+backfill_wait+peered
 1 stale+active+clean
 1 active+recovering
 1 stale+down+remapped
 1 undersized+peered
 1 active+undersized+degraded+remapped
 1 active+clean+scrubbing
 1 active+clean+remapped
 1 active+recovering+degraded

  io:
client:   1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr

Thanks for any hints!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users

[ceph-users] Cluster down after network outage

2023-07-12 Thread Frank Schilder

Hi all,

we had a network outage tonight (power loss) and restored network in the 
morning. All OSDs were running during this period. After restoring network 
peering hell broke loose and the cluster has a hard time coming back up again. 
OSDs get marked down all the time and come back later. Peering never stops.

Below is the current status, I had all OSDs shown as up for a while, but many 
were not responsive. Are there some flags that help bringing things up in a 
sequence that causes less overload on the system?

[root@gnosis ~]# ceph status
  cluster:
id: XXX
health: HEALTH_WARN
2 clients failing to respond to capability release
6 MDSs report slow metadata IOs
3 MDSs report slow requests
nodown,noout,nobackfill,norecover flag(s) set
176 osds down
Slow OSD heartbeats on back (longest 551718.679ms)
Slow OSD heartbeats on front (longest 549598.330ms)
Reduced data availability: 8069 pgs inactive, 3786 pgs down, 3161 
pgs peering, 1341 pgs stale
Degraded data redundancy: 1187354920/16402772667 objects degraded 
(7.239%), 6222 pgs degraded, 6231 pgs undersized
1 pools nearfull
17386 slow ops, oldest one blocked for 1811 sec, daemons 
[osd.1128,osd.1152,osd.1154,osd.12,osd.1227,osd.1244,osd.328,osd.354,osd.381,osd.4]...
 have slow ops.
 
  services:
mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 28m)
mgr: ceph-25(active, since 30m), standbys: ceph-26, ceph-01, ceph-02, 
ceph-03
mds: con-fs2:8 4 up:standby 8 up:active
osd: 1260 osds: 1082 up (since 6m), 1258 in (since 18m); 266 remapped pgs
 flags nodown,noout,nobackfill,norecover
 
  data:
pools:   14 pools, 25065 pgs
objects: 1.91G objects, 3.4 PiB
usage:   3.1 PiB used, 6.0 PiB / 9.0 PiB avail
pgs: 0.626% pgs unknown
 31.566% pgs not active
 1187354920/16402772667 objects degraded (7.239%)
 51/16402772667 objects misplaced (0.000%)
 11706 active+clean
 4752  active+undersized+degraded
 3286  down
 2702  peering
 799   undersized+degraded+peered
 464   stale+down
 418   stale+active+undersized+degraded
 214   remapped+peering
 157   unknown
 128   stale+peering
 117   stale+remapped+peering
 101   stale+undersized+degraded+peered
 57stale+active+undersized+degraded+remapped+backfilling
 35down+remapped
 26stale+undersized+degraded+remapped+backfilling+peered
 23undersized+degraded+remapped+backfilling+peered
 14active+clean+scrubbing+deep
 9 stale+active+undersized+degraded+remapped+backfill_wait
 7 active+recovering+undersized+degraded
 7 stale+active+recovering+undersized+degraded
 6 active+undersized+degraded+remapped+backfilling
 6 active+undersized
 5 active+undersized+degraded+remapped+backfill_wait
 5 stale+remapped
 4 stale+activating+undersized+degraded
 3 active+undersized+remapped
 3 stale+undersized+degraded+remapped+backfill_wait+peered
 1 activating+undersized+degraded
 1 activating+undersized+degraded+remapped
 1 undersized+degraded+remapped+backfill_wait+peered
 1 stale+active+clean
 1 active+recovering
 1 stale+down+remapped
 1 undersized+peered
 1 active+undersized+degraded+remapped
 1 active+clean+scrubbing
 1 active+clean+remapped
 1 active+recovering+degraded
 
  io:
client:   1.8 MiB/s rd, 18 MiB/s wr, 409 op/s rd, 796 op/s wr

Thanks for any hints!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 1 pg inconsistent and does not recover

2023-06-28 Thread Frank Schilder

Hi Stefan,

after you wrote that you issue hundreds of deep-scrub commands per day I was 
already suspecting something like

> [...] deep_scrub daemon requests a deep-scrub [...]

Its not a Minion you hired that types these commends every so many seconds by 
hand and hopes for the best.  My guess is rather that you are actually running 
a cluster specifically configured for manual scrub scheduling as was discussed 
in a thread some time ago to solve the problem of "not deep scrubbed in time" 
messages due to the built-in scrub scheduler not using the last-scrubbed 
timestamp for priority (among other things).

On such a system I would not be surprised that these commands have their 
desired effect. To know why it works for you it would be helpful to disclose 
the whole story, for example what ceph config parameters are active within the 
context of your daemon executing the deep-scrub instructions and what other 
ceph-commands surround it in the same way that the pg repair is surrounded by 
injectargs instructions in the script I posted.

It doesn't work like that on a ceph cluster with default config. For example, 
on our cluster there is a very high likelihood that at least one OSD of any PG 
is part of a scrub at any time already. In that case, if a PG is not eligible 
for scrubbing because one of its OSDs has already max-scrubs (default=1) scrubs 
running, the reservation has no observable effect.

Some time ago I had a ceph-user thread discussing exactly that, I wanted to 
increase the concurrent scrubs running without increasing max-scrubs. The 
default scheduler seems to be very poor with ordering scrubs in such a way that 
a maximum number of pgs is scrubbed at any given time. One of the suggestions 
was to run manual scheduling, which seems exactly like what you are doing.

Best regards,
=====
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Stefan Kooman 
Sent: Wednesday, June 28, 2023 2:17 PM
To: Frank Schilder; Alexander E. Patrakov; Niklas Hambüchen
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: 1 pg inconsistent and does not recover

On 6/28/23 10:45, Frank Schilder wrote:
> Hi Stefan,
>
> we run Octopus. The deep-scrub request is (immediately) cancelled if the 
> PG/OSD is already part of another (deep-)scrub or if some peering happens. As 
> far as I understood, the commands osd/pg deep-scrub and pg repair do not 
> create persistent reservations. If you issue this command, when does the PG 
> actually start scrubbing? As soon as another one finishes or when it is its 
> natural turn? Do you monitor the scrub order to confirm it was the manual 
> command that initiated a scrub?

We request a deep-scrub ... a few seconds later it starts
deep-scrubbing. We do not verify in this process if the PG really did
start, but they do. See example from a PG below:

Jun 27 22:59:50 mon1 pg_scrub[2478540]: [27-06-2023 22:59:34] Scrub PG
5.48a (last deep-scrub: 2023-06-16T22:54:58.684038+0200)

^^ deep_scrub daemon requests a deep-scrub, based on latest deep-scrub
timestamp. After a couple of minutes it's deep-scrubbed. See below the
deep-scrub timestamp (info from a PG query of 5.48a):

"last_deep_scrub_stamp": "2023-06-27T23:06:01.823894+0200"

We have been using this in Octopus (actually since Luminous, but in a
different way). Now we are on Pacific.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 1 pg inconsistent and does not recover

2023-06-28 Thread Frank Schilder

A repair always comes with a deep scrub. You can replace it if you want.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Niklas Hambüchen 
Sent: Wednesday, June 28, 2023 3:05 PM
To: Frank Schilder; Alexander E. Patrakov
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: 1 pg inconsistent and does not recover

Hi Frank,

> The response to that is not to try manual repair but to issue a deep-scrub.

I am a bit confused, because in your script you do issue "ceph pg repair", not 
a scrub.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

1 2 3 4 5 6 7 8 >

1 - 100 of 725 matches

Mail list logo