[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-28 Thread Kai Stian Olstad

On 26.01.2024 23:09, Mark Nelson wrote:
For what it's worth, we saw this last week at Clyso on two separate 
customer clusters on 17.2.7 and also solved it by moving back to wpq.  
We've been traveling this week so haven't created an upstream tracker 
for it yet, but we're back to recommending wpq to our customers for all 
production cluster deployments until we figure out what's going on.


Thank you for confirming, switching to wpq solved my problem too,
and I have switch all production clusters to wpq.

I guess all my logs is gone by now, but I try to recreate the situation 
in the test cluster.



--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-28 Thread Kai Stian Olstad

On 26.01.2024 22:08, Wesley Dillingham wrote:
I faced a similar issue. The PG just would never finish recovery. 
Changing
all OSDs in the PG to "osd_op_queue wpq" and then restarting them 
serially
ultimately allowed the PG to recover. Seemed to be some issue with 
mclock.


Thank you Wes, switching to wpq and restart the OSDs fixed it for me 
too.



--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-26 Thread Mark Nelson
For what it's worth, we saw this last week at Clyso on two separate 
customer clusters on 17.2.7 and also solved it by moving back to wpq.  
We've been traveling this week so haven't created an upstream tracker 
for it yet, but we're back to recommending wpq to our customers for all 
production cluster deployments until we figure out what's going on.



Mark


On 1/26/24 15:08, Wesley Dillingham wrote:

I faced a similar issue. The PG just would never finish recovery. Changing
all OSDs in the PG to "osd_op_queue wpq" and then restarting them serially
ultimately allowed the PG to recover. Seemed to be some issue with mclock.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 26, 2024 at 7:57 AM Kai Stian Olstad 
wrote:


Hi,

This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January
2024.

On Monday 22 January we had 4 HDD all on different server with I/O-error
because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5
HDD share 1 SSD.
I set the OSD out, ceph osd out 223 269 290 318 and all hell broke
loose.

I took only minutes before the users complained about Ceph not working.
Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph
tell osd. dump_ops_in_flight” against the out OSDs it just hang,
after 30 minutes I stopped the dump command.
Long story short I ended up running “ceph osd set nobackfill” to slow
ops was gone and then unset it when the slow ops message disappeared.
I needed to run that all the time so the cluster didn’t come to a holt
so this oneliner loop was used
“while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" &&
(date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill);
sleep 10; done”


But now 4 days later the backfilling has stopped progressing completely
and the number of misplaced object is increasing.
Some PG has 0 misplaced object but sill have backfilling state, and been
in this state for over 24 hours now.

I have a hunch that it’s because of PG 404.6e7 is in state
“active+recovering+degraded+remapped” it’s been in this state for over
48 hours.
It’s has possible 2 missing object, but since they are not unfound I
can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete”

Could someone please help to solve this?
Down below is some output of ceph commands, I’ll also attache them.


ceph status (only removed information about no running scrub and
deep_scrub)
---
cluster:
  id: b321e76e-da3a-11eb-b75c-4f948441dcd0
  health: HEALTH_WARN
  Degraded data redundancy: 2/6294904971 objects degraded
(0.000%), 1 pg degraded

services:
  mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d)
  mgr: ceph-mon-1.ptrsea(active, since 11d), standbys:
ceph-mon-2.mfdanx
  mds: 1/1 daemons up, 1 standby
  osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped
pgs
  rgw: 7 daemons active (7 hosts, 1 zones)

data:
  volumes: 1/1 healthy
  pools:   14 pools, 3945 pgs
  objects: 1.14G objects, 1.1 PiB
  usage:   1.8 PiB used, 1.2 PiB / 3.0 PiB avail
  pgs: 2/6294904971 objects degraded (0.000%)
   2980455/6294904971 objects misplaced (0.047%)
   3901 active+clean
   22   active+clean+scrubbing+deep
   17   active+remapped+backfilling
   4active+clean+scrubbing
   1active+recovering+degraded+remapped

io:
  client:   167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr


ceph health detail (only removed information about no running scrub and
deep_scrub)
---
HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded
(0.000%), 1 pg degraded
[WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects
degraded (0.000%), 1 pg degraded
  pg 404.6e7 is active+recovering+degraded+remapped, acting
[223,274,243,290,286,283]


ceph pg 202.6e7 list_unfound
---
{
  "num_missing": 2,
  "num_unfound": 0,
  "objects": [],
  "state": "Active",
  "available_might_have_unfound": true,
  "might_have_unfound": [],
  "more": false
}

ceph pg 404.6e7 query | jq .recovery_state
---
[
{
  "name": "Started/Primary/Active",
  "enter_time": "2024-01-26T09:08:41.918637+",
  "might_have_unfound": [
{
  "osd": "243(2)",
  "status": "already probed"
},
{
  "osd": "274(1)",
  "status": "already probed"
},
{
  "osd": "275(0)",
  "status": "already probed"
},
{
  "osd": "283(5)",
  "status": "already probed"
},
{
  "osd": "286(4)",
  "status": "already probed"
},
{
  "osd": "290(3)",
  "status": "already probed"
},
{
  "osd": "335(3)",
  "status": "already probed"
}
  ],
  "recovery_progress": {
  

[ceph-users] Re: 17.2.7: Backfilling deadlock / stall / stuck / standstill

2024-01-26 Thread Wesley Dillingham
I faced a similar issue. The PG just would never finish recovery. Changing
all OSDs in the PG to "osd_op_queue wpq" and then restarting them serially
ultimately allowed the PG to recover. Seemed to be some issue with mclock.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jan 26, 2024 at 7:57 AM Kai Stian Olstad 
wrote:

> Hi,
>
> This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January
> 2024.
>
> On Monday 22 January we had 4 HDD all on different server with I/O-error
> because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5
> HDD share 1 SSD.
> I set the OSD out, ceph osd out 223 269 290 318 and all hell broke
> loose.
>
> I took only minutes before the users complained about Ceph not working.
> Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph
> tell osd. dump_ops_in_flight” against the out OSDs it just hang,
> after 30 minutes I stopped the dump command.
> Long story short I ended up running “ceph osd set nobackfill” to slow
> ops was gone and then unset it when the slow ops message disappeared.
> I needed to run that all the time so the cluster didn’t come to a holt
> so this oneliner loop was used
> “while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" &&
> (date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill);
> sleep 10; done”
>
>
> But now 4 days later the backfilling has stopped progressing completely
> and the number of misplaced object is increasing.
> Some PG has 0 misplaced object but sill have backfilling state, and been
> in this state for over 24 hours now.
>
> I have a hunch that it’s because of PG 404.6e7 is in state
> “active+recovering+degraded+remapped” it’s been in this state for over
> 48 hours.
> It’s has possible 2 missing object, but since they are not unfound I
> can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete”
>
> Could someone please help to solve this?
> Down below is some output of ceph commands, I’ll also attache them.
>
>
> ceph status (only removed information about no running scrub and
> deep_scrub)
> ---
>cluster:
>  id: b321e76e-da3a-11eb-b75c-4f948441dcd0
>  health: HEALTH_WARN
>  Degraded data redundancy: 2/6294904971 objects degraded
> (0.000%), 1 pg degraded
>
>services:
>  mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d)
>  mgr: ceph-mon-1.ptrsea(active, since 11d), standbys:
> ceph-mon-2.mfdanx
>  mds: 1/1 daemons up, 1 standby
>  osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped
> pgs
>  rgw: 7 daemons active (7 hosts, 1 zones)
>
>data:
>  volumes: 1/1 healthy
>  pools:   14 pools, 3945 pgs
>  objects: 1.14G objects, 1.1 PiB
>  usage:   1.8 PiB used, 1.2 PiB / 3.0 PiB avail
>  pgs: 2/6294904971 objects degraded (0.000%)
>   2980455/6294904971 objects misplaced (0.047%)
>   3901 active+clean
>   22   active+clean+scrubbing+deep
>   17   active+remapped+backfilling
>   4active+clean+scrubbing
>   1active+recovering+degraded+remapped
>
>io:
>  client:   167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr
>
>
> ceph health detail (only removed information about no running scrub and
> deep_scrub)
> ---
> HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded
> (0.000%), 1 pg degraded
> [WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects
> degraded (0.000%), 1 pg degraded
>  pg 404.6e7 is active+recovering+degraded+remapped, acting
> [223,274,243,290,286,283]
>
>
> ceph pg 202.6e7 list_unfound
> ---
> {
>  "num_missing": 2,
>  "num_unfound": 0,
>  "objects": [],
>  "state": "Active",
>  "available_might_have_unfound": true,
>  "might_have_unfound": [],
>  "more": false
> }
>
> ceph pg 404.6e7 query | jq .recovery_state
> ---
> [
>{
>  "name": "Started/Primary/Active",
>  "enter_time": "2024-01-26T09:08:41.918637+",
>  "might_have_unfound": [
>{
>  "osd": "243(2)",
>  "status": "already probed"
>},
>{
>  "osd": "274(1)",
>  "status": "already probed"
>},
>{
>  "osd": "275(0)",
>  "status": "already probed"
>},
>{
>  "osd": "283(5)",
>  "status": "already probed"
>},
>{
>  "osd": "286(4)",
>  "status": "already probed"
>},
>{
>  "osd": "290(3)",
>  "status": "already probed"
>},
>{
>  "osd": "335(3)",
>  "status": "already probed"
>}
>  ],
>  "recovery_progress": {
>"backfill_targets": [
>  "275(0)",
>  "335(3)"
>],
>"waiting_on_backfill": [],
>"last_backfill_started":
>
>