On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> We are starting to work on it. First step is getting the structure out and
> dumping the current value as you say.
>
> And you were correct we did not run force_create_pg.
Great.
So, eager to see what the current structure is... please attach once you
have it.
The new replacement one should look like this (when hexdump -C'd):
00000000 02 01 18 00 00 00 10 00 00 00 00 00 00 00 01 00 |................|
00000010 00 00 42 00 00 00 00 00 00 00 00 00 00 00 |..B...........|
0000001e
...except that from byte 6 you want to put in a recent OSDMap epoch, in
hex, little endian (least significant byte first), in place of the 0x10
that is there now. It should dump like this:
$ ceph-dencoder type creating_pgs_t import myfile decode dump_json
{
"last_scan_epoch": 16, <--- but with a recent epoch here
"creating_pgs": [],
"queue": [],
"created_pools": [
66
]
}
sage
>
> > On 3 Oct 2018, at 17:52, Sage Weil <[email protected]> wrote:
> >
> > On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >> Sage,
> >>
> >> Pool 66 is the only pool it shows right now. This a pool created months
> >> ago.
> >> ceph osd lspools
> >> 66 mypool
> >>
> >> As we recreated mon db from OSDs, the pools for MDS was unusable. So we
> >> deleted them.
> >> After we create another cephfs fs and pools we started MDS and it stucked
> >> on creation. So we stopped MDS and removed fs and fs pools. Right now we
> >> do not have MDS running nor we have cephfs related things.
> >>
> >> ceph fs dump
> >> dumped fsmap epoch 1 e1
> >> enable_multiple, ever_enabled_multiple: 0,0
> >> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> >> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> >> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor
> >> table,9=file layout v2,10=snaprealm v2}
> >> legacy client fscid: -1
> >>
> >> No filesystems configured
> >>
> >> ceph fs ls
> >> No filesystems enabled
> >>
> >> Now pool 66 seems to only pool we have and it has been created months ago.
> >> Then I guess there is something hidden out there.
> >>
> >> Is there any way to find and delete it?
> >
> > Ok, I'm concerned that the creating pg is in there if this is an old
> > pool... did you perhaps run force_create_pg at some point? Assuming you
> > didn't, I think this is a bug in the process for rebuilding the mon
> > store.. one that doesn't normally come up because the impact is this
> > osdmap scan that is cheap in our test scenarios but clearly not cheap for
> > your aged cluster.
> >
> > In any case, there is a way to clear those out of the mon, but it's a bit
> > dicey.
> >
> > 1. stop all mons
> > 2. make a backup of all mons
> > 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating
> > key=creating key on one of the mons
> > 4. dump the object with ceph-dencoder type creating_pgs_t import FILE
> > dump_json
> > 5. hex edit the structure to remove all of the creating pgs, and adds pool
> > 66 to the created_pgs member.
> > 6. verify with ceph-dencoder dump that the edit was correct...
> > 7. inject the updated structure into all of the mons
> > 8. start all mons
> >
> > 4-6 will probably be an iterative process... let's start by getting the
> > structure out and dumping the current value?
> >
> > The code to refer to to understand the structure is src/mon/CreatingPGs.h
> > encode/decode methods.
> >
> > sage
> >
> >
> >>
> >>
> >>> On 3 Oct 2018, at 16:46, Sage Weil <[email protected]> wrote:
> >>>
> >>> Oh... I think this is the problem:
> >>>
> >>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883
> >>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196
> >>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196
> >>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>>
> >>> You are in the midst of creating new pgs, and unfortunately pg create is
> >>> one of the last remaining places where the OSDs need to look at a full
> >>> history of map changes between then and the current map epoch. In this
> >>> case, the pool was created in 60196 and it is now 72883, ~12k epochs
> >>> later.
> >>>
> >>> What is this new pool for? Is it still empty, and if so, can we delete
> >>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end
> >>> and restarting the OSDs will make these pg_creates go away.
> >>>
> >>> s
> >>>
> >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> It seems nothing has changed.
> >>>>
> >>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/
> >>>> <https://paste.ubuntu.com/p/MtvTr5HYW4/>
> >>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/
> >>>> <https://paste.ubuntu.com/p/7Sx64xGzkR/>
> >>>>
> >>>>
> >>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius <[email protected]> wrote:
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>>
> >>>>> You can also reduce the osd map updates by adding this to your ceph
> >>>>> config file. "osd crush update on start = false". This should remove
> >>>>> and update that is generated when osd starts.
> >>>>>
> >>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 mon.SRV-SBKUARK14@0(leader)
> >>>>> e14 handle_command mon_command({"prefix": "osd crush
> >>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 log_channel(audit) log [INF] :
> >>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>>>> dispatch
> >>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 mon.SRV-SBKUARK14@0(leader)
> >>>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>>>> "root=default"]} v 0) v1
> >>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 log_channel(audit) log [INF] :
> >>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0
> >>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>>>> 'osd.47' initial_weight 3.6396 at location
> >>>>> {host=SRV-SEKUARK8,root=default}
> >>>>> 2018-10-03 14:03:22.250 7fe1615e0700 1
> >>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> >>>>> enabled
> >>>>>
> >>>>>
> >>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi Sage,
> >>>>>>
> >>>>>> Thank you for your response. Now I am sure this incident is going to
> >>>>>> be resolved.
> >>>>>>
> >>>>>> The problem started when 7 server crashed same time and they came back
> >>>>>> after ~5 minutes.
> >>>>>>
> >>>>>> Two of our 3 mon services were restarted in this crash. Since mon
> >>>>>> services are enabled they should be started nearly at the same time. I
> >>>>>> dont know if this makes any difference but some of the guys on IRC
> >>>>>> told it is required that they start in order not at the same time.
> >>>>>> Otherwise it could break things badly.
> >>>>>>
> >>>>>> After 9 days we still see 3400-3500 active+clear PG. But in the end we
> >>>>>> have so many STUCK request and our cluster can not heal itself.
> >>>>>>
> >>>>>> When we set noup flag, OSDs can catch up epoch easily. But when we
> >>>>>> unset the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >>>>>> I/O load on all of my OSD disks are at around %95 utilization and
> >>>>>> never ends. CPU and RAM usage are OK.
> >>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >>>>>>
> >>>>>> Also we tried to change RBD pool replication size 2 to 1. Our goal was
> >>>>>> the eliminate older PG's and leaving cluster with good ones.
> >>>>>> With replication size=1 we saw "%13 PGS not active”. But it didn’t
> >>>>>> solve our problem.
> >>>>>>
> >>>>>> Of course we have to save %100 of data. But we feel like even saving
> >>>>>> %50 of our data will be make us very happy right now.
> >>>>>>
> >>>>>> This is what happens when the cluster starts. I believe it explains
> >>>>>> the whole story very nicely.
> >>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >>>>>>
> >>>>>> This is our ceph.conf:
> >>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >>>>>>
> >>>>>> This is the output of "osd stat && osd epochs && ceph -s && ceph
> >>>>>> health”:
> >>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >>>>>>
> >>>>>> This is pg dump:
> >>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/
> >>>>>>
> >>>>>> This is iostat & perf top:
> >>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >>>>>>
> >>>>>> This strace output of ceph-osd:
> >>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >>>>>>
> >>>>>> This is OSD log (default debug):
> >>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >>>>>>
> >>>>>> This is leader MON log (default debug):
> >>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >>>>>>
> >>>>>> These are OSDs failed to start. Total number is 58.
> >>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> >>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/
> >>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/
> >>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/
> >>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/
> >>>>>> https://paste.ubuntu.com/p/smyT6Y976b/
> >>>>>>
> >>>>>>
> >>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and
> >>>>>> debug_filestore = 20.
> >>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
> >>>>>>
> >>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and
> >>>>>> debug_filestore = 20.
> >>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
> >>>>>>
> >>>>>> As far as I understand OSD catchs up with the mon epoch and exceeds
> >>>>>> mon epoch somehow??
> >>>>>>
> >>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8
> >>>>>> e60196@2018-09-28 23:57:08.251119
> >>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642
> >>>>>> build_initial_pg_history 66.f8 created 60196
> >>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 -
> >>>>>> loading and decoding 0x19da8400
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process
> >>>>>> 66.d8 to_process <> waiting <> waiting_peering {}
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process
> >>>>>> OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested:
> >>>>>> 72642 NullEvt +create_info) prio 255 cost 10 e72642) queued
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process
> >>>>>> 66.d8 to_process <OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642
> >>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 e72642)>
> >>>>>> waiting <> waiting_peering {}
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process
> >>>>>> OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested:
> >>>>>> 72642 NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642
> >>>>>> pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934]
> >>>>>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c 72206/72206
> >>>>>> les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 lpr=72642
> >>>>>> pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering mbc={}
> >>>>>> ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642
> >>>>>> NullEvt +create_info
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642
> >>>>>> queue_want_up_thru want 72642 <= queued 72642, currently 72206
> >>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process
> >>>>>> empty q, waiting
> >>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 add_map_bl 60196
> >>>>>> 50012 bytes
> >>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map 60197 -
> >>>>>> loading and decoding 0x19da8880
> >>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 add_map_bl 60197
> >>>>>> 50012 bytes
> >>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map 60198 -
> >>>>>> loading and decoding 0x19da9180
> >>>>>>
> >>>>>>
> >>>>>> On 3 Oct 2018, at 05:14, Sage Weil <[email protected]> wrote:
> >>>>>>
> >>>>>> osd_find_best_info_ignore_history_les is a dangerous option and you
> >>>>>> should
> >>>>>> only use it in very specific circumstances when directed by a
> >>>>>> developer.
> >>>>>> In such cases it will allow a stuck PG to peer. But you're not
> >>>>>> getting to
> >>>>>> that point...you're seeing some sort of resource exhaustion.
> >>>>>>
> >>>>>> The noup trick works when OSDs are way behind on maps and all need to
> >>>>>> catch up. The way to tell if they are behind is by looking at the
> >>>>>> 'ceph
> >>>>>> daemon osd.NNN status' output and comparing to the latest OSDMap epoch
> >>>>>> tha
> >>>>>> t the mons have. Were they really caught up when you unset noup?
> >>>>>>
> >>>>>> I'm just catching up and haven't read the whole thread but I haven't
> >>>>>> seen
> >>>>>> anything that explains why teh OSDs are dong lots of disk IO.
> >>>>>> Catching up
> >>>>>> on maps could explain it but not why they wouldn't peer once they were
> >>>>>> all
> >>>>>> marked up...
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote:
> >>>>>>
> >>>>>> Anyone heart about osd_find_best_info_ignore_history_les = true ?
> >>>>>> Is that be usefull here? There is such a less information about it.
> >>>>>>
> >>>>>> Goktug Yildirim <[email protected]> şunları yazdı (2 Eki 2018
> >>>>>> 22:11):
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> Indeed I left ceph-disk to decide the wal and db partitions when I
> >>>>>> read somewhere that that will do the proper sizing.
> >>>>>> For the blustore cache size I have plenty of RAM. I will increase 8GB
> >>>>>> for each and decide a more calculated number after cluster settles.
> >>>>>>
> >>>>>> For the osd map loading I’ve also figured it out. And it is in loop.
> >>>>>> For that reason I started cluster with noup flag and waited OSDs to
> >>>>>> reach the uptodate epoch number. After that I unset noup. But I did
> >>>>>> not pay attention to manager logs. Let me check it, thank you!
> >>>>>>
> >>>>>> I am not forcing jmellac or anything else really. I have a very
> >>>>>> standard installation and no tweaks or tunings. All we ask for the
> >>>>>> stability versus speed from the begining. And here we are :/
> >>>>>>
> >>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius <[email protected]> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>
> >>>>>> I can see some issues from the osd log file. You have an extremely low
> >>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. I would
> >>>>>> recommend cranking up rocksdb cache size as much as possible. If you
> >>>>>> have RAM you can also increase bluestores cache size for hdd. Default
> >>>>>> is 1GB be as liberal as you can without getting OOM kills. You also
> >>>>>> have lots of osd map loading and decoding in the log. Are you sure all
> >>>>>> monitors/managers/osds are up to date? Plus make sure you aren't
> >>>>>> forcing jemalloc loading. I had a funny interaction after upgrading to
> >>>>>> mimic.
> >>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim
> >>>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>> Hello Darius,
> >>>>>>
> >>>>>> Thanks for reply!
> >>>>>>
> >>>>>> The main problem is we can not query PGs. “ceph pg 67.54f query” does
> >>>>>> stucks and wait forever since OSD is unresponsive.
> >>>>>> We are certain that OSD gets unresponsive as soon as it UP. And we are
> >>>>>> certain that OSD responds again after its disk utilization stops.
> >>>>>>
> >>>>>> So we have a small test like that:
> >>>>>> * Stop all OSDs (168 of them)
> >>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It takes 8
> >>>>>> mins to finish. Only after that “ceph pg 67.54f query” works!
> >>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & OSD2
> >>>>>> starts %95 disk utilization. This takes 17 minutes to finish.
> >>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O and it
> >>>>>> takes 25 mins to settle.
> >>>>>> * If you happen to start 5 of them at the same all of the OSDs start
> >>>>>> high I/O again. And it takes 1 hour to finish.
> >>>>>>
> >>>>>> So in the light of these findings we flagged noup, started all OSDs.
> >>>>>> At first there was no I/O. After 10 minutes we unset noup. All of 168
> >>>>>> OSD started to make high I/O. And we thought that if we wait long
> >>>>>> enough it will finish & OSDs will be responsive again. After 24hours
> >>>>>> they did not because I/O did not finish or even slowed down.
> >>>>>> One can think that is a lot of data there to scan. But it is just 33TB.
> >>>>>>
> >>>>>> So at short we dont know which PG is stuck so we can remove it.
> >>>>>>
> >>>>>> However we met an weird thing half an hour ago. We exported the same
> >>>>>> PG from two different OSDs. One was 4.2GB and the other is 500KB! So
> >>>>>> we decided to export all OSDs for backup. Then we will delete strange
> >>>>>> sized ones and start the cluster all over. Maybe then we could solve
> >>>>>> the stucked or unfound PGs as you advise.
> >>>>>>
> >>>>>> Any thought would be greatly appreciated.
> >>>>>>
> >>>>>>
> >>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius <[email protected]> wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> Currently you have 15 objects missing. I would recommend finding them
> >>>>>> and making backups of them. Ditch all other osds that are failing to
> >>>>>> start and concentrate on bringing online those that have missing
> >>>>>> objects. Then slowly turn off nodown and noout on the cluster and see
> >>>>>> if it stabilises. If it stabilises leave these setting if not turn
> >>>>>> them back on.
> >>>>>> Now get some of the pg's that are blocked and querry the pgs to check
> >>>>>> why they are blocked. Try removing as much blocks as possible and then
> >>>>>> remove the norebalance/norecovery flags and see if it starts to fix
> >>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin
> >>>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>>
> >>>>>> One of ceph experts indicated that bluestore is somewhat preview tech
> >>>>>> (as for Redhat).
> >>>>>> So it could be best to checkout bluestore and rocksdb. There are some
> >>>>>> tools to check health and also repair. But there are limited
> >>>>>> documentation.
> >>>>>> Anyone who has experince with it?
> >>>>>> Anyone lead/help to a proper check would be great.
> >>>>>> Goktug Yildirim <[email protected]>, 1 Eki 2018 Pzt, 22:55
> >>>>>> tarihinde şunu yazdı:
> >>>>>>
> >>>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> We have recently upgraded from luminous to mimic. It’s been 6 days
> >>>>>> since this cluster is offline. The long short story is here:
> >>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html
> >>>>>>
> >>>>>> I’ve also CC’ed developers since I believe this is a bug. If this is
> >>>>>> not to correct way I apology and please let me know.
> >>>>>>
> >>>>>> For the 6 days lots of thing happened and there were some outcomes
> >>>>>> about the problem. Some of them was misjudged and some of them are not
> >>>>>> looked deeper.
> >>>>>> However the most certain diagnosis is this: each OSD causes very high
> >>>>>> disk I/O to its bluestore disk (WAL and DB are fine). After that OSDs
> >>>>>> become unresponsive or very very less responsive. For example "ceph
> >>>>>> tell osd.x version” stucks like for ever.
> >>>>>>
> >>>>>> So due to unresponsive OSDs cluster does not settle. This is our
> >>>>>> problem!
> >>>>>>
> >>>>>> This is the one we are very sure of. But we are not sure of the reason.
> >>>>>>
> >>>>>> Here is the latest ceph status:
> >>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/.
> >>>>>>
> >>>>>> This is the status after we started all of the OSDs 24 hours ago.
> >>>>>> Some of the OSDs are not started. However it didnt make any difference
> >>>>>> when all of them was online.
> >>>>>>
> >>>>>> Here is the debug=20 log of an OSD which is same for all others:
> >>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>> As we figure out there is a loop pattern. I am sure it wont caught
> >>>>>> from eye.
> >>>>>>
> >>>>>> This the full log the same OSD.
> >>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> >>>>>>
> >>>>>> Here is the strace of the same OSD process:
> >>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/
> >>>>>>
> >>>>>> Recently we hear more to uprade mimic. I hope none get hurts as we do.
> >>>>>> I am sure we have done lots of mistakes to let this happening. And
> >>>>>> this situation may be a example for other user and could be a
> >>>>>> potential bug for ceph developer.
> >>>>>>
> >>>>>> Any help to figure out what is going on would be great.
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> Goktug Yildirim
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> [email protected]
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com