Hi Frank,

bfq was definitely broken, deadlocking io for a few CentOS Stream 8
kernels between EL 8.5 and 8.6 -- we also hit that in production and
switched over to `none`.

I don't recall exactly when the upstream kernel was also broken but
apparently this was the fix:
https://marc.info/?l=linux-block&m=164366111512992&w=2

In any case you might want to just use `none` with flash devs -- i'm
not sure the fair scheduling and mq is very meaningful anymore for
high iops devices and ceph.

Cheers, Dan



On Thu, Nov 17, 2022 at 1:23 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Hi Igor,
>
> I might have a smoking gun. Could it be that ceph (the kernel??) has issues 
> with certain disk schedulers? There was a recommendation on this list to use 
> bfq with bluestore. This was actually the one change other than ceph version 
> during upgrade: to make bfq default. Now, this might be a problem with 
> certain drives that have a preferred scheduler different than bfq. Here my 
> observation:
>
> I managed to get one of the OSDs to hang today. It was not the usual abort, I 
> don't know why the op_thread_timeout and suicide_timeout didn't trigger. The 
> OSD's worker thread was unresponsive for a bit more than 10 minutes before I 
> took action. Hence, nothing in the log (should maybe have used kill 
> sigabort). Now, this time I tried to check if I can access the disk with dd. 
> And, I could not. A
>
> dd if=/dev/sdn of=disk-dump bs=4096 count=100
>
> got stuck right away in D-state:
>
> 1652472 D+   dd if=/dev/sdn of=disk-dump bs=4096 count=100
>
> This time, since I was curious about the disk scheduler, I went to another 
> terminal on the same machine and did:
>
> # cat /sys/block/sdn/queue/scheduler
> mq-deadline kyber [bfq] none
> # echo none >> /sys/block/sdn/queue/scheduler
> # cat /sys/block/sdn/queue/scheduler
> [none] mq-deadline kyber bfq
>
> Going back to the stuck session, I see now (you can see my attempts to 
> interrupt the dd):
>
> # dd if=/dev/sdn of=disk-dump bs=4096 count=100
> ^C^C3+0 records in
> 2+0 records out
> 8192 bytes (8.2 kB) copied, 336.712 s, 0.0 kB/s
>
> Suddenly, the disk responds again! Also, the ceph container stopped (a docker 
> stop container returned without the container stopping - as before in this 
> situation).
>
> Could it be that recommendations for disk scheduler choice should be 
> reconsidered, or is this pointing towards a bug in either how ceph or the 
> kernel schedules disk IO? To confirm this hypothesis, I will retry the stress 
> test with the scheduler set to the default kernel choice.
>
> I did day-long fio benchmarks with all schedulers and all sorts of workloads 
> on our drives and could not find anything like that. It looks like it is very 
> difficult to impossible to reproduce a realistic ceph-osd IO pattern for 
> testing. Is there any tool available for this?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <fr...@dtu.dk>
> Sent: 14 November 2022 13:03:58
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: LVM osds loose connection to disk
>
> I can't reproduce the problem with artificial workloads, I need to get one of 
> these OSDs running in the meta-data pool until it crashes. My plan is to 
> reduce time-outs and increase log level for these specific OSDs to capture 
> what happened before an abort in the memory log. I can spare about 100G of 
> RAM for log entries. I found the following relevant options with settings I 
> think will work for my case:
>
> osd_op_thread_suicide_timeout 30 # default 150
> osd_op_thread_timeout 10 # default 15
> debug_bluefs 1/20 # default 1/5
> debug_bluestore 1/20 # default 1/5
> bluestore_kv_sync_util_logging_s 3 # default 10
> log_max_recent 100000 # default 10000
>
> It would be great if someone could confirm that these settings will achieve 
> what I want (or what is missing). I would like to capture at least 1 minute 
> worth of log entries in RAM with high debug settings. Does anyone have a good 
> estimate for how many log-entries are created per second with these settings 
> for tuning log_max_recent?
>
> Thanks for your help!
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <fr...@dtu.dk>
> Sent: 11 November 2022 10:25:17
> To: Igor Fedotov; ceph-users@ceph.io
> Subject: [ceph-users] Re: LVM osds loose connection to disk
>
> Hi Igor,
>
> thanks for your reply. We only exchanged the mimic containers with the 
> octopus ones. We didn't even reboot the servers during upgrade, only later 
> for trouble shooting. The only change since the upgrade is the ceph container.
>
> I'm trying to go down the stack and run a benchmark on the OSD directly. 
> Unfortunately, it seems that osd bench is too simplistic. I don't think we 
> have a problem with the disk, I rather think there is a race condition that 
> gets the bstore_kv_sync thread stuck. I would need to reproduce a realistic 
> workload on the OSD itself. Standard benchmarks on the disks directly did not 
> show any problems (even when running for days). Is there a way to address an 
> OSD directly with IO, for example, low-level rados commands?
>
> I'm almost down to thinking that the only way to create a realistic workload 
> is to create a size-1 pool on a single OSD, create an rbd image on this pool 
> and run fio with rbd engine.
>
> Another question. The osd bench command accepts 4 arguments:
>
> bench [<count:int>] [<size:int>] [<object_size:int>] [<object_num:int>] :  
> OSD benchmark: write <count> <size>-byte objects(with <obj_size> <obj_num>), 
> (default count=1G default size=4MB). Results in log.
>
> Only the first 2 are explained in the docs. Any idea what the other 2 are 
> doing?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Igor Fedotov <igor.fedo...@croit.io>
> Sent: 10 November 2022 15:48:23
> To: Frank Schilder; ceph-users@ceph.io
> Subject: Re: [ceph-users] Re: LVM osds loose connection to disk
>
> Hi Frank,
>
> unfortunately IMO it's not an easy task to identify what are the
> relevant difference between mimic and octopus in this respect.. At least
> the question would be what minor Ceph releases are/were in use.
>
> I recall there were some tricks with setting/clearing bluefs_buffered_io
> somewhere in that period. But I can hardly recall anything else...
> Curious if OS/Container or other third-party software was upgraded with
> Ceph upgrade as well? Just in case I presume you were using containers
> in mimic too, right?
>
>
> Instead I'd rather approach the issue from a different side:
>
> 1) Learn how to [easily] reproduce the issue, preferably in a test lab
> rather than in the field. Better use exactly the same disk(s), H/W, OS
> and S/W versions as in the production.
>
> 2) (can be done in prod as well) - once an OSD stuck - set debug-bluefs
> and debug-bdev to 20 and collect verbose log - check what's happening
> there.  Meanwhile monitor disk activity - is there any load to the disk
> while in this state at all? Do disk reads (e.g. via dd) out of OSD
> container succeed at this point?
>
>
> Thanks,
>
> Igor
>
> On 11/10/2022 5:23 PM, Frank Schilder wrote:
> > Hi all,
> >
> > I have some kind of update on the matter of stuck OSDs. It seems not to be 
> > an LVM issue and it also seems not to be connected to the OSD size.
> >
> > After moving all data from the tiny 100G OSDs to spare SSDs, I redeployed 
> > the 400G disks with 1 OSD per disk and started to move data from the slow 
> > spare SSDs back to the fast ones. After moving about 100 out of 1024 PGs of 
> > the pool fast OSDs started failing again. It is kind of the same 
> > observation as before, I can't stop a container with a failed OSD. However, 
> > when I restart docker, everything comes up clean.
> >
> > When I look at the logs, I see that the OSD aborts with suicide timeout 
> > after many osd_op_tp thread timeouts. However, the OSD process and with it 
> > the container does not terminate because of a hanging thread. The syslog 
> > has the message (all messages below in chronological order):
> >
> > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 122 seconds
> >
> > about 30 seconds before the OSD aborts with suicide timeout with
> >
> > ceph-osd: 2022-11-09T17:36:53.691+0100 7f663a23c700 -1 *** Caught signal 
> > (Aborted) **#012 in thread 7f663a23c700 thread_name:msgr-worker-2#012#012 
> > ceph version 15.2.17 ...
> >
> > What I see in the syslog is, that the thread bstore_kv_sync seems not to be 
> > terminated with the abort. These messages continue to show up:
> >
> > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 368 seconds.
> > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 491 seconds.
> > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 614 seconds.
> >
> > On docker stop container the launch-script receives the term signal, but 
> > the OSD cannot be deactivated due to this thread:
> >
> > journal: osd_lvm_start: deactivating OSD 959
> > journal: osd_lvm_start: unmounting /var/lib/ceph/osd/ceph-959
> > journal: umount: /var/lib/ceph/osd/ceph-959: target is busy.
> >
> > Its probably busy because of bstore_kv_sync thread hanging. As a 
> > consequence, the container is still running, has a ceph-osd process shown 
> > with docker top and these messages continue to show up:
> >
> > INFO: task bstore_kv_sync:1283156 blocked for more than 737 seconds.
> > INFO: task bstore_kv_sync:1283156 blocked for more than 860 seconds.
> >
> > Although the bstore_kv_sync thread is unkillable, a restart of docker 
> > clears everything out and the OSD restarts fine. I'm somewhat hesitant to 
> > accept the simple "must be the firmware" statement, because these disks 
> > worked fine for 4 years with mimic. The only thing that changed was the 
> > ceph version from mimic to octopus, everything else stayed the same: OS 
> > version, kernel version, docker version, firmware version.
> >
> > Since it happens only on this type of disks, it could very well have to do 
> > with firmware, but not without ceph having had a serious change in 
> > low-level disk access between mimic and cotopus. So, I'm wondering what 
> > features of the firmware octopus is using that mimic was not. Would be 
> > great if somebody has some pointers for what part of the software stack I 
> > should look at, I would like to avoid hunting ghosts.
> >
> > Many thanks and best regards!
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Frank Schilder <fr...@dtu.dk>
> > Sent: 10 October 2022 23:33:32
> > To: Igor Fedotov; ceph-users@ceph.io
> > Subject: [ceph-users] Re: LVM osds loose connection to disk
> >
> > Hi Igor.
> >
> > The problem of OSD crashes was resolved after migrating just a little bit 
> > of the meta-data pool to other disks (we decided to evacuate the small OSDs 
> > onto larger disks to make space). Therefore, I don't think its an LVM or 
> > disk issue. The cluster is working perfectly now after migrating some data 
> > away from the small OSDs. I rather believe that its tightly related to "OSD 
> > crashes during upgrade mimic->octopus", it happens only on OSDs where the 
> > repair command errs out with abort on enospc.
> >
> > My hypothesis is now more along the lines of a dead-lock occurring as a 
> > consequence of an aborted daemon-thread. Is there any part of the bluestore 
> > code that acquires an exclusive device lock that gets passed through to the 
> > pv and can lead to a device freeze if not released? I'm wondering if 
> > something like this happened here as a consequence of the allocator fail. I 
> > saw a lot of lock-up warnings related to OSD threads in the syslog.
> >
> > Regarding the 2 minutes time difference and heartbeats. The OSD seems to 
> > have been responding to heartbeats all the time, even after the suicide 
> > time-out; see description below. I executed docker stop container at 
> > 16:15:39. Until this moment, the OSD was considered up+in by the MONs.
> >
> > Here is a recall of events from memory together with a description of how 
> > the 4 OSDs on 1 disk are executed in a singe container. I will send 
> > detailed logs and scripts via our private communication. If anyone else is 
> > interested as well, I'm happy to make it available.
> >
> > Being in the situation over a weekend and at night, we didn't take precise 
> > minutes. Our priority was to get everything working. I'm afraid this here 
> > is as accurate as it gets.
> >
> > Let's start with how the processes are started inside the container. We 
> > have a main script M executed as the entry-point to the container. For each 
> > OSD found on a drive, M forks off a copy Mn of itself, which in turn forks 
> > off the OSD process:
> >
> > M -> M1 -> OSD1
> > M -> M2 -> OSD2
> > M -> M3 -> OSD3
> > M -> M4 -> OSD4
> >
> > At the end, we have 5 instances of the main script and 4 instances of OSDs 
> > running. This somewhat cumbersome looking startup is required to be able to 
> > forward signals sent by the docker daemon, most notably, SIGINT on docker 
> > stop container. In addition, all instances of M trap a number of signals, 
> > including SIGCHLD. If just one OSD dies, the entire container should stop 
> > and restart. On a disk fail all OSDs on that disk go down and will be 
> > rebuild in the background simultaneously.
> >
> > Executing docker top container on the above situation gives:
> >
> > M
> > M1
> > M2
> > M3
> > M4
> > OSD1
> > OSD2
> > OSD3
> > OSD4
> >
> > After the crash of, say, OSD1, I saw something like this (docker top 
> > container):
> >
> > M
> > M1
> > M2
> > M3
> > M4
> > OSD2
> > OSD3
> > OSD4
> >
> > The OSD processes were reported to be in Sl-state by ps.
> >
> > At this point, OSD1 was gone from the list but M1 was still running. There 
> > was no SIGCHILD! At the same time, OSDs 2-4 were marked down by the MONs, 
> > but not OSD1! Due to this, any IO targeting OSD1 got stuck and 
> > corresponding slow ops warnings started piling up.
> >
> > My best bet is that not all threads of OSD1 were terminated and, therefore, 
> > no SIGCHLD was sent to M1. For some reason OSD1 was not marked down and I 
> > wonder if its left-overs might have responded to heartbeats.
> >
> > At the same time the disk was not accessible to LVM commands any more. A 
> > "ceph-volume inventory /dev/sda" got stuck in "lvs" (in D-state). I did not 
> > try to access the raw device with dd. I was thinking about it, but attended 
> > to more pressing issues. I actually don't think the raw device was locked 
> > up, but that's just a guess.
> >
> > In an attempt to clean-up the OSD down situation, I executed "docker stop 
> > container" (to be followed by docker start). The stop took a long time (I 
> > use an increased SIGKILL time-out) and resulted in this state (docker top 
> > container):
> >
> > OSD2
> > OSD3
> > OSD4
> >
> > The OSD processes were now reported in D-state by ps and the container was 
> > still reported to be running by docker. However, at this point all 4 OSDs 
> > were marked down, PGs peered and IO started again.
> >
> > I'm wondering if a failed allocation attempt lead to a device/lvm lock 
> > being acquired but not released, leading to an LVM device freeze. There 
> > were thread lockup messages in the syslog. It smells a lot like a dead-lock 
> > situation created by not releasing a critical resource on SIGABRT. 
> > Unfortunately, there seem to be no log messages from the thread that got 
> > locked up.
> >
> > Hope this makes some sense when interpreting the logs.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Igor Fedotov <igor.fedo...@croit.io>
> > Sent: 09 October 2022 22:07:16
> > To: Frank Schilder; ceph-users@ceph.io
> > Subject: Re: [ceph-users] LVM osds loose connection to disk
> >
> > Hi Frank,
> >
> > can't advise much on the disk issue  - just an obvious thought about
> > upgrading the firmware and/or contacting the vendor. IIUC disk is
> > totally inaccessible at this point, e.g. you're unable to read from it
> > bypassing LVM as well, right? If so this definitely looks like a
> > low-level problem.
> >
> >
> > As for OSD down issue - may I have some clarification please - did this
> > osd.975 never go down or it was just a few minutes later? In the log
> > snippet you shared I can see a 2 min gap between operation timeouts
> > indication and the final OSD suicide. I presume it had been able to
> > response heartbeats prior to that suicide and hence stayed online... But
> > mostly speculating so far...
> >
> >
> > Thanks,
> >
> > Igor
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to