Hi Igor.

The problem of OSD crashes was resolved after migrating just a little bit of 
the meta-data pool to other disks (we decided to evacuate the small OSDs onto 
larger disks to make space). Therefore, I don't think its an LVM or disk issue. 
The cluster is working perfectly now after migrating some data away from the 
small OSDs. I rather believe that its tightly related to "OSD crashes during 
upgrade mimic->octopus", it happens only on OSDs where the repair command errs 
out with abort on enospc.

My hypothesis is now more along the lines of a dead-lock occurring as a 
consequence of an aborted daemon-thread. Is there any part of the bluestore 
code that acquires an exclusive device lock that gets passed through to the pv 
and can lead to a device freeze if not released? I'm wondering if something 
like this happened here as a consequence of the allocator fail. I saw a lot of 
lock-up warnings related to OSD threads in the syslog.

Regarding the 2 minutes time difference and heartbeats. The OSD seems to have 
been responding to heartbeats all the time, even after the suicide time-out; 
see description below. I executed docker stop container at 16:15:39. Until this 
moment, the OSD was considered up+in by the MONs.

Here is a recall of events from memory together with a description of how the 4 
OSDs on 1 disk are executed in a singe container. I will send detailed logs and 
scripts via our private communication. If anyone else is interested as well, 
I'm happy to make it available.

Being in the situation over a weekend and at night, we didn't take precise 
minutes. Our priority was to get everything working. I'm afraid this here is as 
accurate as it gets.

Let's start with how the processes are started inside the container. We have a 
main script M executed as the entry-point to the container. For each OSD found 
on a drive, M forks off a copy Mn of itself, which in turn forks off the OSD 
process:

M -> M1 -> OSD1
M -> M2 -> OSD2
M -> M3 -> OSD3
M -> M4 -> OSD4

At the end, we have 5 instances of the main script and 4 instances of OSDs 
running. This somewhat cumbersome looking startup is required to be able to 
forward signals sent by the docker daemon, most notably, SIGINT on docker stop 
container. In addition, all instances of M trap a number of signals, including 
SIGCHLD. If just one OSD dies, the entire container should stop and restart. On 
a disk fail all OSDs on that disk go down and will be rebuild in the background 
simultaneously.

Executing docker top container on the above situation gives:

M
M1
M2
M3
M4
OSD1
OSD2
OSD3
OSD4

After the crash of, say, OSD1, I saw something like this (docker top container):

M
M1
M2
M3
M4
OSD2
OSD3
OSD4

The OSD processes were reported to be in Sl-state by ps.

At this point, OSD1 was gone from the list but M1 was still running. There was 
no SIGCHILD! At the same time, OSDs 2-4 were marked down by the MONs, but not 
OSD1! Due to this, any IO targeting OSD1 got stuck and corresponding slow ops 
warnings started piling up.

My best bet is that not all threads of OSD1 were terminated and, therefore, no 
SIGCHLD was sent to M1. For some reason OSD1 was not marked down and I wonder 
if its left-overs might have responded to heartbeats.

At the same time the disk was not accessible to LVM commands any more. A 
"ceph-volume inventory /dev/sda" got stuck in "lvs" (in D-state). I did not try 
to access the raw device with dd. I was thinking about it, but attended to more 
pressing issues. I actually don't think the raw device was locked up, but 
that's just a guess.

In an attempt to clean-up the OSD down situation, I executed "docker stop 
container" (to be followed by docker start). The stop took a long time (I use 
an increased SIGKILL time-out) and resulted in this state (docker top 
container):

OSD2
OSD3
OSD4

The OSD processes were now reported in D-state by ps and the container was 
still reported to be running by docker. However, at this point all 4 OSDs were 
marked down, PGs peered and IO started again.

I'm wondering if a failed allocation attempt lead to a device/lvm lock being 
acquired but not released, leading to an LVM device freeze. There were thread 
lockup messages in the syslog. It smells a lot like a dead-lock situation 
created by not releasing a critical resource on SIGABRT. Unfortunately, there 
seem to be no log messages from the thread that got locked up.

Hope this makes some sense when interpreting the logs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <[email protected]>
Sent: 09 October 2022 22:07:16
To: Frank Schilder; [email protected]
Subject: Re: [ceph-users] LVM osds loose connection to disk

Hi Frank,

can't advise much on the disk issue  - just an obvious thought about
upgrading the firmware and/or contacting the vendor. IIUC disk is
totally inaccessible at this point, e.g. you're unable to read from it
bypassing LVM as well, right? If so this definitely looks like a
low-level problem.


As for OSD down issue - may I have some clarification please - did this
osd.975 never go down or it was just a few minutes later? In the log
snippet you shared I can see a 2 min gap between operation timeouts
indication and the final OSD suicide. I presume it had been able to
response heartbeats prior to that suicide and hence stayed online... But
mostly speculating so far...


Thanks,

Igor
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to