Hi Frank,

unfortunately IMO it's not an easy task to identify what are the relevant difference between mimic and octopus in this respect.. At least the question would be what minor Ceph releases are/were in use.

I recall there were some tricks with setting/clearing bluefs_buffered_io somewhere in that period. But I can hardly recall anything else... Curious if OS/Container or other third-party software was upgraded with Ceph upgrade as well? Just in case I presume you were using containers in mimic too, right?


Instead I'd rather approach the issue from a different side:

1) Learn how to [easily] reproduce the issue, preferably in a test lab rather than in the field. Better use exactly the same disk(s), H/W, OS and S/W versions as in the production.

2) (can be done in prod as well) - once an OSD stuck - set debug-bluefs and debug-bdev to 20 and collect verbose log - check what's happening there.  Meanwhile monitor disk activity - is there any load to the disk while in this state at all? Do disk reads (e.g. via dd) out of OSD container succeed at this point?


Thanks,

Igor

On 11/10/2022 5:23 PM, Frank Schilder wrote:
Hi all,

I have some kind of update on the matter of stuck OSDs. It seems not to be an 
LVM issue and it also seems not to be connected to the OSD size.

After moving all data from the tiny 100G OSDs to spare SSDs, I redeployed the 
400G disks with 1 OSD per disk and started to move data from the slow spare 
SSDs back to the fast ones. After moving about 100 out of 1024 PGs of the pool 
fast OSDs started failing again. It is kind of the same observation as before, 
I can't stop a container with a failed OSD. However, when I restart docker, 
everything comes up clean.

When I look at the logs, I see that the OSD aborts with suicide timeout after 
many osd_op_tp thread timeouts. However, the OSD process and with it the 
container does not terminate because of a hanging thread. The syslog has the 
message (all messages below in chronological order):

kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 122 seconds

about 30 seconds before the OSD aborts with suicide timeout with

ceph-osd: 2022-11-09T17:36:53.691+0100 7f663a23c700 -1 *** Caught signal 
(Aborted) **#012 in thread 7f663a23c700 thread_name:msgr-worker-2#012#012 ceph 
version 15.2.17 ...

What I see in the syslog is, that the thread bstore_kv_sync seems not to be 
terminated with the abort. These messages continue to show up:

kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 368 seconds.
kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 491 seconds.
kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 614 seconds.

On docker stop container the launch-script receives the term signal, but the 
OSD cannot be deactivated due to this thread:

journal: osd_lvm_start: deactivating OSD 959
journal: osd_lvm_start: unmounting /var/lib/ceph/osd/ceph-959
journal: umount: /var/lib/ceph/osd/ceph-959: target is busy.

Its probably busy because of bstore_kv_sync thread hanging. As a consequence, 
the container is still running, has a ceph-osd process shown with docker top 
and these messages continue to show up:

INFO: task bstore_kv_sync:1283156 blocked for more than 737 seconds.
INFO: task bstore_kv_sync:1283156 blocked for more than 860 seconds.

Although the bstore_kv_sync thread is unkillable, a restart of docker clears everything 
out and the OSD restarts fine. I'm somewhat hesitant to accept the simple "must be 
the firmware" statement, because these disks worked fine for 4 years with mimic. The 
only thing that changed was the ceph version from mimic to octopus, everything else 
stayed the same: OS version, kernel version, docker version, firmware version.

Since it happens only on this type of disks, it could very well have to do with 
firmware, but not without ceph having had a serious change in low-level disk 
access between mimic and cotopus. So, I'm wondering what features of the 
firmware octopus is using that mimic was not. Would be great if somebody has 
some pointers for what part of the software stack I should look at, I would 
like to avoid hunting ghosts.

Many thanks and best regards!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <[email protected]>
Sent: 10 October 2022 23:33:32
To: Igor Fedotov; [email protected]
Subject: [ceph-users] Re: LVM osds loose connection to disk

Hi Igor.

The problem of OSD crashes was resolved after migrating just a little bit of the meta-data 
pool to other disks (we decided to evacuate the small OSDs onto larger disks to make space). 
Therefore, I don't think its an LVM or disk issue. The cluster is working perfectly now 
after migrating some data away from the small OSDs. I rather believe that its tightly 
related to "OSD crashes during upgrade mimic->octopus", it happens only on OSDs 
where the repair command errs out with abort on enospc.

My hypothesis is now more along the lines of a dead-lock occurring as a 
consequence of an aborted daemon-thread. Is there any part of the bluestore 
code that acquires an exclusive device lock that gets passed through to the pv 
and can lead to a device freeze if not released? I'm wondering if something 
like this happened here as a consequence of the allocator fail. I saw a lot of 
lock-up warnings related to OSD threads in the syslog.

Regarding the 2 minutes time difference and heartbeats. The OSD seems to have 
been responding to heartbeats all the time, even after the suicide time-out; 
see description below. I executed docker stop container at 16:15:39. Until this 
moment, the OSD was considered up+in by the MONs.

Here is a recall of events from memory together with a description of how the 4 
OSDs on 1 disk are executed in a singe container. I will send detailed logs and 
scripts via our private communication. If anyone else is interested as well, 
I'm happy to make it available.

Being in the situation over a weekend and at night, we didn't take precise 
minutes. Our priority was to get everything working. I'm afraid this here is as 
accurate as it gets.

Let's start with how the processes are started inside the container. We have a 
main script M executed as the entry-point to the container. For each OSD found 
on a drive, M forks off a copy Mn of itself, which in turn forks off the OSD 
process:

M -> M1 -> OSD1
M -> M2 -> OSD2
M -> M3 -> OSD3
M -> M4 -> OSD4

At the end, we have 5 instances of the main script and 4 instances of OSDs 
running. This somewhat cumbersome looking startup is required to be able to 
forward signals sent by the docker daemon, most notably, SIGINT on docker stop 
container. In addition, all instances of M trap a number of signals, including 
SIGCHLD. If just one OSD dies, the entire container should stop and restart. On 
a disk fail all OSDs on that disk go down and will be rebuild in the background 
simultaneously.

Executing docker top container on the above situation gives:

M
M1
M2
M3
M4
OSD1
OSD2
OSD3
OSD4

After the crash of, say, OSD1, I saw something like this (docker top container):

M
M1
M2
M3
M4
OSD2
OSD3
OSD4

The OSD processes were reported to be in Sl-state by ps.

At this point, OSD1 was gone from the list but M1 was still running. There was 
no SIGCHILD! At the same time, OSDs 2-4 were marked down by the MONs, but not 
OSD1! Due to this, any IO targeting OSD1 got stuck and corresponding slow ops 
warnings started piling up.

My best bet is that not all threads of OSD1 were terminated and, therefore, no 
SIGCHLD was sent to M1. For some reason OSD1 was not marked down and I wonder 
if its left-overs might have responded to heartbeats.

At the same time the disk was not accessible to LVM commands any more. A "ceph-volume 
inventory /dev/sda" got stuck in "lvs" (in D-state). I did not try to access the raw 
device with dd. I was thinking about it, but attended to more pressing issues. I actually don't 
think the raw device was locked up, but that's just a guess.

In an attempt to clean-up the OSD down situation, I executed "docker stop 
container" (to be followed by docker start). The stop took a long time (I use an 
increased SIGKILL time-out) and resulted in this state (docker top container):

OSD2
OSD3
OSD4

The OSD processes were now reported in D-state by ps and the container was 
still reported to be running by docker. However, at this point all 4 OSDs were 
marked down, PGs peered and IO started again.

I'm wondering if a failed allocation attempt lead to a device/lvm lock being 
acquired but not released, leading to an LVM device freeze. There were thread 
lockup messages in the syslog. It smells a lot like a dead-lock situation 
created by not releasing a critical resource on SIGABRT. Unfortunately, there 
seem to be no log messages from the thread that got locked up.

Hope this makes some sense when interpreting the logs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <[email protected]>
Sent: 09 October 2022 22:07:16
To: Frank Schilder; [email protected]
Subject: Re: [ceph-users] LVM osds loose connection to disk

Hi Frank,

can't advise much on the disk issue  - just an obvious thought about
upgrading the firmware and/or contacting the vendor. IIUC disk is
totally inaccessible at this point, e.g. you're unable to read from it
bypassing LVM as well, right? If so this definitely looks like a
low-level problem.


As for OSD down issue - may I have some clarification please - did this
osd.975 never go down or it was just a few minutes later? In the log
snippet you shared I can see a 2 min gap between operation timeouts
indication and the final OSD suicide. I presume it had been able to
response heartbeats prior to that suicide and hence stayed online... But
mostly speculating so far...


Thanks,

Igor
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to