Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Eugen Block

Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.


Thanks for the clarification!


Zitat von Ilya Dryomov :


On Thu, Aug 30, 2018 at 1:04 PM Eugen Block  wrote:


Hi again,

we still didn't figure out the reason for the flapping, but I wanted
to get back on the dmesg entries.
They just reflect what happened in the past, they're no indicator to
predict anything.


The kernel client is just that, a client.  Almost by definition,
everything it sees has already happened.



For example, when I changed the primary-affinity of OSD.24 last week,
one of the clients realized that only today, 4 days later. If the
clients don't have to communicate with the respective host/osd in the
meantime, they log those events on the next reconnect.


Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.

Thanks,

Ilya




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Ilya Dryomov
On Thu, Aug 30, 2018 at 1:04 PM Eugen Block  wrote:
>
> Hi again,
>
> we still didn't figure out the reason for the flapping, but I wanted
> to get back on the dmesg entries.
> They just reflect what happened in the past, they're no indicator to
> predict anything.

The kernel client is just that, a client.  Almost by definition,
everything it sees has already happened.

>
> For example, when I changed the primary-affinity of OSD.24 last week,
> one of the clients realized that only today, 4 days later. If the
> clients don't have to communicate with the respective host/osd in the
> meantime, they log those events on the next reconnect.

Correct, except it doesn't have to be a specific host or a specific
OSD.  What matters here is whether the client is idle.  As soon as the
client is woken up and sends a request to _any_ OSD, it receives a new
osdmap and applies it, possibly emitting those dmesg entries.

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-30 Thread Eugen Block

Hi again,

we still didn't figure out the reason for the flapping, but I wanted  
to get back on the dmesg entries.
They just reflect what happened in the past, they're no indicator to  
predict anything.


For example, when I changed the primary-affinity of OSD.24 last week,  
one of the clients realized that only today, 4 days later. If the  
clients don't have to communicate with the respective host/osd in the  
meantime, they log those events on the next reconnect.
I just wanted to share that in case anybody else is wondering (or  
maybe it was just me).


Regards,
Eugen


Zitat von Eugen Block :


Update:
I changed the primary affinity of one OSD back to 1.0 to test if  
those metrics change, and indeed they do:

OSD.24 immediately shows values greater than 0.
I guess the metrics are completely unrelated to the flapping.

So the search goes on...


Zitat von Eugen Block :

An hour ago host5 started to report the OSDs on host4 as down  
(still no clue why), resulting in slow requests. This time no  
flapping occured, the cluster recovered a couple of minutes later.  
No other OSDs reported that, only those two on host5. There's  
nothing in the logs of the reporting or the affected OSDs.


Then I compared a perf dump of one healthy OSD with one on host4.  
There's something strange about the metrics (many of them are 0), I  
just can't tell if it's related to the fact that host4 has no  
primary OSDs. But even with no primary OSD I would expect different  
values for OSDs that are running for a week now.


---cut here---
host1:~ # diff -u perfdump.osd1 perfdump.osd24
--- perfdump.osd1   2018-08-23 11:03:03.695927316 +0200
+++ perfdump.osd24  2018-08-23 11:02:09.919927375 +0200
@@ -1,99 +1,99 @@
{
"osd": {
"op_wip": 0,
-"op": 7878594,
-"op_in_bytes": 852767683202,
-"op_out_bytes": 1019871565411,
+"op": 0,
+"op_in_bytes": 0,
+"op_out_bytes": 0,
"op_latency": {
-"avgcount": 7878594,
-"sum": 1018863.131206702,
-"avgtime": 0.129320425
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_process_latency": {
-"avgcount": 7878594,
-"sum": 879970.400440694,
-"avgtime": 0.111691299
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_prepare_latency": {
-"avgcount": 8321733,
-"sum": 41376.442963329,
-"avgtime": 0.004972094
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_r": 3574792,
-"op_r_out_bytes": 1019871565411,
+"op_r": 0,
+"op_r_out_bytes": 0,
"op_r_latency": {
-"avgcount": 3574792,
-"sum": 54750.502669010,
-"avgtime": 0.015315717
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_r_process_latency": {
-"avgcount": 3574792,
-"sum": 34107.703579874,
-"avgtime": 0.009541171
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_r_prepare_latency": {
-"avgcount": 3574817,
-"sum": 34262.515884817,
-"avgtime": 0.009584411
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_w": 4249520,
-"op_w_in_bytes": 847518164870,
+"op_w": 0,
+"op_w_in_bytes": 0,
"op_w_latency": {
-"avgcount": 4249520,
-"sum": 960898.540843217,
-"avgtime": 0.226119312
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_w_process_latency": {
-"avgcount": 4249520,
-"sum": 844398.804808119,
-"avgtime": 0.198704513
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_w_prepare_latency": {
-"avgcount": 4692618,
-"sum": 7032.358957948,
-"avgtime": 0.001498600
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
-"op_rw": 54282,
-"op_rw_in_bytes": 5249518332,
+"op_rw": 0,
+"op_rw_in_bytes": 0,
"op_rw_out_bytes": 0,
"op_rw_latency": {
-"avgcount": 54282,
-"sum": 3214.087694475,
-"avgtime": 0.059210929
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
},
"op_rw_process_latency": {
-"avgcount": 54282,
-"sum": 1463.892052701,
-"avgtime": 0.026968277
+"avgcount": 0,
+"sum": 0.0,
+

Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-24 Thread Eugen Block

Update:
I changed the primary affinity of one OSD back to 1.0 to test if those  
metrics change, and indeed they do:

OSD.24 immediately shows values greater than 0.
I guess the metrics are completely unrelated to the flapping.

So the search goes on...


Zitat von Eugen Block :

An hour ago host5 started to report the OSDs on host4 as down (still  
no clue why), resulting in slow requests. This time no flapping  
occured, the cluster recovered a couple of minutes later. No other  
OSDs reported that, only those two on host5. There's nothing in the  
logs of the reporting or the affected OSDs.


Then I compared a perf dump of one healthy OSD with one on host4.  
There's something strange about the metrics (many of them are 0), I  
just can't tell if it's related to the fact that host4 has no  
primary OSDs. But even with no primary OSD I would expect different  
values for OSDs that are running for a week now.


---cut here---
host1:~ # diff -u perfdump.osd1 perfdump.osd24
--- perfdump.osd1   2018-08-23 11:03:03.695927316 +0200
+++ perfdump.osd24  2018-08-23 11:02:09.919927375 +0200
@@ -1,99 +1,99 @@
 {
 "osd": {
 "op_wip": 0,
-"op": 7878594,
-"op_in_bytes": 852767683202,
-"op_out_bytes": 1019871565411,
+"op": 0,
+"op_in_bytes": 0,
+"op_out_bytes": 0,
 "op_latency": {
-"avgcount": 7878594,
-"sum": 1018863.131206702,
-"avgtime": 0.129320425
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_process_latency": {
-"avgcount": 7878594,
-"sum": 879970.400440694,
-"avgtime": 0.111691299
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_prepare_latency": {
-"avgcount": 8321733,
-"sum": 41376.442963329,
-"avgtime": 0.004972094
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_r": 3574792,
-"op_r_out_bytes": 1019871565411,
+"op_r": 0,
+"op_r_out_bytes": 0,
 "op_r_latency": {
-"avgcount": 3574792,
-"sum": 54750.502669010,
-"avgtime": 0.015315717
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_r_process_latency": {
-"avgcount": 3574792,
-"sum": 34107.703579874,
-"avgtime": 0.009541171
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_r_prepare_latency": {
-"avgcount": 3574817,
-"sum": 34262.515884817,
-"avgtime": 0.009584411
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_w": 4249520,
-"op_w_in_bytes": 847518164870,
+"op_w": 0,
+"op_w_in_bytes": 0,
 "op_w_latency": {
-"avgcount": 4249520,
-"sum": 960898.540843217,
-"avgtime": 0.226119312
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_w_process_latency": {
-"avgcount": 4249520,
-"sum": 844398.804808119,
-"avgtime": 0.198704513
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_w_prepare_latency": {
-"avgcount": 4692618,
-"sum": 7032.358957948,
-"avgtime": 0.001498600
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_rw": 54282,
-"op_rw_in_bytes": 5249518332,
+"op_rw": 0,
+"op_rw_in_bytes": 0,
 "op_rw_out_bytes": 0,
 "op_rw_latency": {
-"avgcount": 54282,
-"sum": 3214.087694475,
-"avgtime": 0.059210929
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_rw_process_latency": {
-"avgcount": 54282,
-"sum": 1463.892052701,
-"avgtime": 0.026968277
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_rw_prepare_latency": {
-"avgcount": 54298,
-"sum": 81.568120564,
-"avgtime": 0.001502230
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_before_queue_op_lat": {
-"avgcount": 25469574,
-"sum": 6654.779033909,
-"avgtime": 0.000261283
+"avgcount": 4307123,
+"sum": 2361.603323307,
+"avgtime": 0.000548301
 },
 "op_before_dequeue_op_lat": {
-   

Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-23 Thread Eugen Block
An hour ago host5 started to report the OSDs on host4 as down (still  
no clue why), resulting in slow requests. This time no flapping  
occured, the cluster recovered a couple of minutes later. No other  
OSDs reported that, only those two on host5. There's nothing in the  
logs of the reporting or the affected OSDs.


Then I compared a perf dump of one healthy OSD with one on host4.  
There's something strange about the metrics (many of them are 0), I  
just can't tell if it's related to the fact that host4 has no primary  
OSDs. But even with no primary OSD I would expect different values for  
OSDs that are running for a week now.


---cut here---
host1:~ # diff -u perfdump.osd1 perfdump.osd24
--- perfdump.osd1   2018-08-23 11:03:03.695927316 +0200
+++ perfdump.osd24  2018-08-23 11:02:09.919927375 +0200
@@ -1,99 +1,99 @@
 {
 "osd": {
 "op_wip": 0,
-"op": 7878594,
-"op_in_bytes": 852767683202,
-"op_out_bytes": 1019871565411,
+"op": 0,
+"op_in_bytes": 0,
+"op_out_bytes": 0,
 "op_latency": {
-"avgcount": 7878594,
-"sum": 1018863.131206702,
-"avgtime": 0.129320425
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_process_latency": {
-"avgcount": 7878594,
-"sum": 879970.400440694,
-"avgtime": 0.111691299
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_prepare_latency": {
-"avgcount": 8321733,
-"sum": 41376.442963329,
-"avgtime": 0.004972094
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_r": 3574792,
-"op_r_out_bytes": 1019871565411,
+"op_r": 0,
+"op_r_out_bytes": 0,
 "op_r_latency": {
-"avgcount": 3574792,
-"sum": 54750.502669010,
-"avgtime": 0.015315717
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_r_process_latency": {
-"avgcount": 3574792,
-"sum": 34107.703579874,
-"avgtime": 0.009541171
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_r_prepare_latency": {
-"avgcount": 3574817,
-"sum": 34262.515884817,
-"avgtime": 0.009584411
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_w": 4249520,
-"op_w_in_bytes": 847518164870,
+"op_w": 0,
+"op_w_in_bytes": 0,
 "op_w_latency": {
-"avgcount": 4249520,
-"sum": 960898.540843217,
-"avgtime": 0.226119312
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_w_process_latency": {
-"avgcount": 4249520,
-"sum": 844398.804808119,
-"avgtime": 0.198704513
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_w_prepare_latency": {
-"avgcount": 4692618,
-"sum": 7032.358957948,
-"avgtime": 0.001498600
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
-"op_rw": 54282,
-"op_rw_in_bytes": 5249518332,
+"op_rw": 0,
+"op_rw_in_bytes": 0,
 "op_rw_out_bytes": 0,
 "op_rw_latency": {
-"avgcount": 54282,
-"sum": 3214.087694475,
-"avgtime": 0.059210929
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_rw_process_latency": {
-"avgcount": 54282,
-"sum": 1463.892052701,
-"avgtime": 0.026968277
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_rw_prepare_latency": {
-"avgcount": 54298,
-"sum": 81.568120564,
-"avgtime": 0.001502230
+"avgcount": 0,
+"sum": 0.0,
+"avgtime": 0.0
 },
 "op_before_queue_op_lat": {
-"avgcount": 25469574,
-"sum": 6654.779033909,
-"avgtime": 0.000261283
+"avgcount": 4307123,
+"sum": 2361.603323307,
+"avgtime": 0.000548301
 },
 "op_before_dequeue_op_lat": {
-"avgcount": 25557113,
-"sum": 390126.982287712,
-"avgtime": 0.015264908
+"avgcount": 4307125,
+"sum": 247993.250531351,
+"avgtime": 0.057577444
 },
-"subop": 6438431,
-"subop_in_bytes": 

Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-23 Thread Eugen Block

Greg, thanks for your reply.


So, this is actually just noisy logging from the client processing an
OSDMap. That should probably be turned down, as it's not really an
indicator of...anything...as far as I can tell.


I usually stick with the defaults:

host4:~ # ceph daemon osd.21 config show | grep debug_client
"debug_client": "0/5",
host4:~ # ceph daemon osd.22 config show | grep debug_client
"debug_client": "0/5",
host4:~ # ceph daemon osd.23 config show | grep debug_client
"debug_client": "0/5",
host4:~ # ceph daemon osd.24 config show | grep debug_client
"debug_client": "0/5",

Should the memory level be turned off as well? I'll give it a try.

Regards,
Eugen


Zitat von Gregory Farnum :


On Wed, Aug 22, 2018 at 6:46 AM Eugen Block  wrote:


Hello *,

we have an issue with a Luminous cluster (all 12.2.5, except one on
12.2.7) for RBD (OpenStack) and CephFS. This is the osd tree:

host1:~ # ceph osd tree
ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
  -1   22.57602 root default
  -41.81998 host host5
  14   hdd  0.90999 osd.14up  0.84999 0.5
  15   hdd  0.90999 osd.15up  0.84999 0.5
  -26.27341 host host1
   1   hdd  0.92429 osd.1 up  1.0 1.0
   4   hdd  0.92429 osd.4 up  1.0 1.0
   6   hdd  0.92429 osd.6 up  1.0 1.0
  13   hdd  0.92429 osd.13up  1.0 1.0
  16   hdd  0.92429 osd.16up  1.0 1.0
  18   hdd  0.92429 osd.18up  1.0 1.0
  10   ssd  0.72769 osd.10up  1.0 1.0
  -36.27341 host host2
   2   hdd  0.92429 osd.2 up  1.0 1.0
   5   hdd  0.92429 osd.5 up  1.0 1.0
   7   hdd  0.92429 osd.7 up  1.0 1.0
  12   hdd  0.92429 osd.12up  1.0 1.0
  17   hdd  0.92429 osd.17up  1.0 1.0
  19   hdd  0.92429 osd.19up  1.0 1.0
   9   ssd  0.72769 osd.9 up  1.0 1.0
  -54.57043 host host3
   0   hdd  0.92429 osd.0 up  1.0 1.0
   3   hdd  0.92429 osd.3 up  1.0 1.0
   8   hdd  0.92429 osd.8 up  1.0 1.0
  11   hdd  0.92429 osd.11up  1.0 1.0
  20   ssd  0.87329 osd.20up  1.0   0
-163.63879 host host4
  21   hdd  0.90970 osd.21up  1.0   0
  22   hdd  0.90970 osd.22up  1.0   0
  23   hdd  0.90970 osd.23up  1.0   0
  24   hdd  0.90970 osd.24up  1.0   0


A couple of weeks ago a new host was added to the cluster (host4),
containing four bluestore OSDs (HDD) with block.db on LVM (SSD). All
went well and the cluster was in HEALTH_OK state for some time.

Then suddenly we experienced flapping OSDs, first on host3 (MON, MGR,
OSD) for a single OSD (OSD.20 on SSD). Later host4 (OSD only) started
flapping, too, this time all four OSDs (OSD.21 - OSD.24) were
affected. Only two reboots brought the node back up.

We found segfaults from safe_timer and were pretty sure that the
cluster was hit by [1], it all sounded very much like our experience.
That's why we started to upgrade the new host to 12.2.7, we waited
before upgrading the other nodes in case some other issues would come
up. Two days later the same host was flapping again, but without a
segfault or any other trace of the cause. We started to assume that
the segfault could be a result of the segfault, not the cause.

Since it seems impossible to predict that flapping we don't have debug
logs for those OSDs. But the usual logs don't reveal anything
extra-ordinary. The cluster ist healthy again for 5 days now.

Then I found some clients (CephFS mounted for home directories and
shared storage for compute nodes) reporting this multiple times:

---cut here---
[Mi Aug 22 10:31:33 2018] libceph: osd21 down
[Mi Aug 22 10:31:33 2018] libceph: osd22 down
[Mi Aug 22 10:31:33 2018] libceph: osd23 down
[Mi Aug 22 10:31:33 2018] libceph: osd24 down
[Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x0 (out)
[Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x1 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd21 up
[Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x1 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd22 up
[Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x1 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd24 up
[Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x1 (in)
[Mi Aug 22 10:31:33 2018] libceph: osd23 up
---cut here---

This output repeats about 20 times per OSD (except for osd20, only one
occurence). But there's no health warning, 

Re: [ceph-users] Clients report OSDs down/up (dmesg) nothing in Ceph logs (flapping OSDs)

2018-08-22 Thread Gregory Farnum
On Wed, Aug 22, 2018 at 6:46 AM Eugen Block  wrote:

> Hello *,
>
> we have an issue with a Luminous cluster (all 12.2.5, except one on
> 12.2.7) for RBD (OpenStack) and CephFS. This is the osd tree:
>
> host1:~ # ceph osd tree
> ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
>   -1   22.57602 root default
>   -41.81998 host host5
>   14   hdd  0.90999 osd.14up  0.84999 0.5
>   15   hdd  0.90999 osd.15up  0.84999 0.5
>   -26.27341 host host1
>1   hdd  0.92429 osd.1 up  1.0 1.0
>4   hdd  0.92429 osd.4 up  1.0 1.0
>6   hdd  0.92429 osd.6 up  1.0 1.0
>   13   hdd  0.92429 osd.13up  1.0 1.0
>   16   hdd  0.92429 osd.16up  1.0 1.0
>   18   hdd  0.92429 osd.18up  1.0 1.0
>   10   ssd  0.72769 osd.10up  1.0 1.0
>   -36.27341 host host2
>2   hdd  0.92429 osd.2 up  1.0 1.0
>5   hdd  0.92429 osd.5 up  1.0 1.0
>7   hdd  0.92429 osd.7 up  1.0 1.0
>   12   hdd  0.92429 osd.12up  1.0 1.0
>   17   hdd  0.92429 osd.17up  1.0 1.0
>   19   hdd  0.92429 osd.19up  1.0 1.0
>9   ssd  0.72769 osd.9 up  1.0 1.0
>   -54.57043 host host3
>0   hdd  0.92429 osd.0 up  1.0 1.0
>3   hdd  0.92429 osd.3 up  1.0 1.0
>8   hdd  0.92429 osd.8 up  1.0 1.0
>   11   hdd  0.92429 osd.11up  1.0 1.0
>   20   ssd  0.87329 osd.20up  1.0   0
> -163.63879 host host4
>   21   hdd  0.90970 osd.21up  1.0   0
>   22   hdd  0.90970 osd.22up  1.0   0
>   23   hdd  0.90970 osd.23up  1.0   0
>   24   hdd  0.90970 osd.24up  1.0   0
>
>
> A couple of weeks ago a new host was added to the cluster (host4),
> containing four bluestore OSDs (HDD) with block.db on LVM (SSD). All
> went well and the cluster was in HEALTH_OK state for some time.
>
> Then suddenly we experienced flapping OSDs, first on host3 (MON, MGR,
> OSD) for a single OSD (OSD.20 on SSD). Later host4 (OSD only) started
> flapping, too, this time all four OSDs (OSD.21 - OSD.24) were
> affected. Only two reboots brought the node back up.
>
> We found segfaults from safe_timer and were pretty sure that the
> cluster was hit by [1], it all sounded very much like our experience.
> That's why we started to upgrade the new host to 12.2.7, we waited
> before upgrading the other nodes in case some other issues would come
> up. Two days later the same host was flapping again, but without a
> segfault or any other trace of the cause. We started to assume that
> the segfault could be a result of the segfault, not the cause.
>
> Since it seems impossible to predict that flapping we don't have debug
> logs for those OSDs. But the usual logs don't reveal anything
> extra-ordinary. The cluster ist healthy again for 5 days now.
>
> Then I found some clients (CephFS mounted for home directories and
> shared storage for compute nodes) reporting this multiple times:
>
> ---cut here---
> [Mi Aug 22 10:31:33 2018] libceph: osd21 down
> [Mi Aug 22 10:31:33 2018] libceph: osd22 down
> [Mi Aug 22 10:31:33 2018] libceph: osd23 down
> [Mi Aug 22 10:31:33 2018] libceph: osd24 down
> [Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x0 (out)
> [Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x0 (out)
> [Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x0 (out)
> [Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x0 (out)
> [Mi Aug 22 10:31:33 2018] libceph: osd21 weight 0x1 (in)
> [Mi Aug 22 10:31:33 2018] libceph: osd21 up
> [Mi Aug 22 10:31:33 2018] libceph: osd22 weight 0x1 (in)
> [Mi Aug 22 10:31:33 2018] libceph: osd22 up
> [Mi Aug 22 10:31:33 2018] libceph: osd24 weight 0x1 (in)
> [Mi Aug 22 10:31:33 2018] libceph: osd24 up
> [Mi Aug 22 10:31:33 2018] libceph: osd23 weight 0x1 (in)
> [Mi Aug 22 10:31:33 2018] libceph: osd23 up
> ---cut here---
>
> This output repeats about 20 times per OSD (except for osd20, only one
> occurence). But there's no health warning, no trace of that in the
> logs, no flapping (yet?), as if nothing has happened. Since these are
> those OSDs that were affected by flapping there has to be a
> connection, but I can't seem to find it.
>
> Why isn't there anything in the logs related to these dmesg events?
> Why would a client report OSDs down if they haven't been? We checked
> the disks for errors, we searched for network issues, no hint for
> anything going wrong.
>

So, this is actually just noisy logging from the client processing an
OSDMap. That should probably be turned down, as it's