This is ceph-object-store tool logs for OSD.0. https://paste.ubuntu.com/p/jNwf4DC46H/
There is something wrong. But we are not sure if we cant use the tool or there is something wrong with OSD. > On 4 Oct 2018, at 06:17, Sage Weil <[email protected]> wrote: > > On Thu, 4 Oct 2018, Goktug Yildirim wrote: >> This is our cluster state right now. I can reach rbd list and thats good! >> Thanks a lot Sage!!! >> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/ > > Progress! Not out of the woods yet, though... > >> As you can see we have 2 unfound pg since some of our OSDs can not start. 58 >> OSD gives different errors. >> How can I fix these OSD's? If I remember correctly it should not be so much >> trouble. >> >> These are OSDs' failed logs. >> https://paste.ubuntu.com/p/ZfRD5ZtvpS/ >> https://paste.ubuntu.com/p/pkRdVjCH4D/ > > These are both failing in rocksdb code, with something like > > Can't access /032949.sst: NotFound: > > Can you check whether that .sst file actually exists? Might be a > weird path issue. > >> https://paste.ubuntu.com/p/zJTf2fzSj9/ >> https://paste.ubuntu.com/p/xpJRK6YhRX/ > > These are failing in the rocksdb CheckConstency code. Not sure what to > make of that. > >> https://paste.ubuntu.com/p/SY3576dNbJ/ >> https://paste.ubuntu.com/p/smyT6Y976b/ > > These are failing in BlueStore code. The ceph-blustore-tool fsck may help > here, can you give it a shot? > > sage > > >> >>> On 3 Oct 2018, at 21:37, Sage Weil <[email protected]> wrote: >>> >>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote: >>>> I'm so sorry about that I missed "out" parameter. My bad.. >>>> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/ >>> >>> Excellent, thanks. That looks like it confirms the problem is that teh >>> recovery tool didn't repopulate the creating pgs properly. >>> >>> If you take that 30 byte file I sent earlier (as hex) and update the >>> osdmap epoch to the latest on the mon, confirm it decodes and dumps >>> properly, and then inject it on the 3 mons, that should get you past this >>> hump (and hopefully back up!). >>> >>> sage >>> >>> >>>> >>>> Sage Weil <[email protected]> şunları yazdı (3 Eki 2018 21:13): >>>> >>>>> I bet the kvstore output it in a hexdump format? There is another option >>>>> to get the raw data iirc >>>>> >>>>> >>>>> >>>>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM >>>>>> <[email protected]> wrote: >>>>>> I changed the file name to make it clear. >>>>>> When I use your command with "+decode" I'm getting an error like this: >>>>>> >>>>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json >>>>>> error: buffer::malformed_input: void >>>>>> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer >>>>>> understand old encoding version 2 < 111 >>>>>> >>>>>> My ceph version: 13.2.2 >>>>>> >>>>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <[email protected]> şunu >>>>>> yazdı: >>>>>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote: >>>>>>>> If I didn't do it wrong, I got the output as below. >>>>>>>> >>>>>>>> ceph-kvstore-tool rocksdb >>>>>>>> /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating >>>>>>>> creating > dump >>>>>>>> 2018-10-03 20:08:52.070 7f07f5659b80 1 rocksdb: do_open column >>>>>>>> families: [default] >>>>>>>> >>>>>>>> ceph-dencoder type creating_pgs_t import dump dump_json >>>>>>> >>>>>>> Sorry, should be >>>>>>> >>>>>>> ceph-dencoder type creating_pgs_t import dump decode dump_json >>>>>>> >>>>>>> s >>>>>>> >>>>>>>> { >>>>>>>> "last_scan_epoch": 0, >>>>>>>> "creating_pgs": [], >>>>>>>> "queue": [], >>>>>>>> "created_pools": [] >>>>>>>> } >>>>>>>> >>>>>>>> You can find the "dump" link below. >>>>>>>> >>>>>>>> dump: >>>>>>>> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing >>>>>>>> >>>>>>>> >>>>>>>> Sage Weil <[email protected]> şunları yazdı (3 Eki 2018 18:45): >>>>>>>> >>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >>>>>>>>>> We are starting to work on it. First step is getting the structure >>>>>>>>>> out and dumping the current value as you say. >>>>>>>>>> >>>>>>>>>> And you were correct we did not run force_create_pg. >>>>>>>>> >>>>>>>>> Great. >>>>>>>>> >>>>>>>>> So, eager to see what the current structure is... please attach once >>>>>>>>> you >>>>>>>>> have it. >>>>>>>>> >>>>>>>>> The new replacement one should look like this (when hexdump -C'd): >>>>>>>>> >>>>>>>>> 00000000 02 01 18 00 00 00 10 00 00 00 00 00 00 00 01 00 >>>>>>>>> |................| >>>>>>>>> 00000010 00 00 42 00 00 00 00 00 00 00 00 00 00 00 >>>>>>>>> |..B...........| >>>>>>>>> 0000001e >>>>>>>>> >>>>>>>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, >>>>>>>>> in >>>>>>>>> hex, little endian (least significant byte first), in place of the >>>>>>>>> 0x10 >>>>>>>>> that is there now. It should dump like this: >>>>>>>>> >>>>>>>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json >>>>>>>>> { >>>>>>>>> "last_scan_epoch": 16, <--- but with a recent epoch here >>>>>>>>> "creating_pgs": [], >>>>>>>>> "queue": [], >>>>>>>>> "created_pools": [ >>>>>>>>> 66 >>>>>>>>> ] >>>>>>>>> } >>>>>>>>> >>>>>>>>> sage >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On 3 Oct 2018, at 17:52, Sage Weil <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >>>>>>>>>>>> Sage, >>>>>>>>>>>> >>>>>>>>>>>> Pool 66 is the only pool it shows right now. This a pool created >>>>>>>>>>>> months ago. >>>>>>>>>>>> ceph osd lspools >>>>>>>>>>>> 66 mypool >>>>>>>>>>>> >>>>>>>>>>>> As we recreated mon db from OSDs, the pools for MDS was unusable. >>>>>>>>>>>> So we deleted them. >>>>>>>>>>>> After we create another cephfs fs and pools we started MDS and it >>>>>>>>>>>> stucked on creation. So we stopped MDS and removed fs and fs >>>>>>>>>>>> pools. Right now we do not have MDS running nor we have cephfs >>>>>>>>>>>> related things. >>>>>>>>>>>> >>>>>>>>>>>> ceph fs dump >>>>>>>>>>>> dumped fsmap epoch 1 e1 >>>>>>>>>>>> enable_multiple, ever_enabled_multiple: 0,0 >>>>>>>>>>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client >>>>>>>>>>>> writeable ranges,3=default file layouts on dirs,4=dir inode in >>>>>>>>>>>> separate object,5=mds uses versioned encoding,6=dirfrag is stored >>>>>>>>>>>> in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} >>>>>>>>>>>> legacy client fscid: -1 >>>>>>>>>>>> >>>>>>>>>>>> No filesystems configured >>>>>>>>>>>> >>>>>>>>>>>> ceph fs ls >>>>>>>>>>>> No filesystems enabled >>>>>>>>>>>> >>>>>>>>>>>> Now pool 66 seems to only pool we have and it has been created >>>>>>>>>>>> months ago. Then I guess there is something hidden out there. >>>>>>>>>>>> >>>>>>>>>>>> Is there any way to find and delete it? >>>>>>>>>>> >>>>>>>>>>> Ok, I'm concerned that the creating pg is in there if this is an >>>>>>>>>>> old >>>>>>>>>>> pool... did you perhaps run force_create_pg at some point? >>>>>>>>>>> Assuming you >>>>>>>>>>> didn't, I think this is a bug in the process for rebuilding the mon >>>>>>>>>>> store.. one that doesn't normally come up because the impact is >>>>>>>>>>> this >>>>>>>>>>> osdmap scan that is cheap in our test scenarios but clearly not >>>>>>>>>>> cheap for >>>>>>>>>>> your aged cluster. >>>>>>>>>>> >>>>>>>>>>> In any case, there is a way to clear those out of the mon, but it's >>>>>>>>>>> a bit >>>>>>>>>>> dicey. >>>>>>>>>>> >>>>>>>>>>> 1. stop all mons >>>>>>>>>>> 2. make a backup of all mons >>>>>>>>>>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating >>>>>>>>>>> key=creating key on one of the mons >>>>>>>>>>> 4. dump the object with ceph-dencoder type creating_pgs_t import >>>>>>>>>>> FILE dump_json >>>>>>>>>>> 5. hex edit the structure to remove all of the creating pgs, and >>>>>>>>>>> adds pool >>>>>>>>>>> 66 to the created_pgs member. >>>>>>>>>>> 6. verify with ceph-dencoder dump that the edit was correct... >>>>>>>>>>> 7. inject the updated structure into all of the mons >>>>>>>>>>> 8. start all mons >>>>>>>>>>> >>>>>>>>>>> 4-6 will probably be an iterative process... let's start by getting >>>>>>>>>>> the >>>>>>>>>>> structure out and dumping the current value? >>>>>>>>>>> >>>>>>>>>>> The code to refer to to understand the structure is >>>>>>>>>>> src/mon/CreatingPGs.h >>>>>>>>>>> encode/decode methods. >>>>>>>>>>> >>>>>>>>>>> sage >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On 3 Oct 2018, at 16:46, Sage Weil <[email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Oh... I think this is the problem: >>>>>>>>>>>>> >>>>>>>>>>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op >>>>>>>>>>>>> osd_pg_create(e72883 >>>>>>>>>>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 >>>>>>>>>>>>> 66.124:60196 >>>>>>>>>>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 >>>>>>>>>>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916 >>>>>>>>>>>>> >>>>>>>>>>>>> You are in the midst of creating new pgs, and unfortunately pg >>>>>>>>>>>>> create is >>>>>>>>>>>>> one of the last remaining places where the OSDs need to look at a >>>>>>>>>>>>> full >>>>>>>>>>>>> history of map changes between then and the current map epoch. >>>>>>>>>>>>> In this >>>>>>>>>>>>> case, the pool was created in 60196 and it is now 72883, ~12k >>>>>>>>>>>>> epochs >>>>>>>>>>>>> later. >>>>>>>>>>>>> >>>>>>>>>>>>> What is this new pool for? Is it still empty, and if so, can we >>>>>>>>>>>>> delete >>>>>>>>>>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the >>>>>>>>>>>>> mon end >>>>>>>>>>>>> and restarting the OSDs will make these pg_creates go away. >>>>>>>>>>>>> >>>>>>>>>>>>> s >>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> It seems nothing has changed. >>>>>>>>>>>>>> >>>>>>>>>>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ >>>>>>>>>>>>>> <https://paste.ubuntu.com/p/MtvTr5HYW4/> >>>>>>>>>>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ >>>>>>>>>>>>>> <https://paste.ubuntu.com/p/7Sx64xGzkR/> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius >>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> You can also reduce the osd map updates by adding this to your >>>>>>>>>>>>>>> ceph >>>>>>>>>>>>>>> config file. "osd crush update on start = false". This should >>>>>>>>>>>>>>> remove >>>>>>>>>>>>>>> and update that is generated when osd starts. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 >>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader) >>>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush >>>>>>>>>>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1 >>>>>>>>>>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 log_channel(audit) log >>>>>>>>>>>>>>> [INF] : >>>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' >>>>>>>>>>>>>>> cmd=[{"prefix": >>>>>>>>>>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]: >>>>>>>>>>>>>>> dispatch >>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 >>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader) >>>>>>>>>>>>>>> e14 handle_command mon_command({"prefix": "osd crush >>>>>>>>>>>>>>> create-or-move", >>>>>>>>>>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8", >>>>>>>>>>>>>>> "root=default"]} v 0) v1 >>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 log_channel(audit) log >>>>>>>>>>>>>>> [INF] : >>>>>>>>>>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' >>>>>>>>>>>>>>> cmd=[{"prefix": >>>>>>>>>>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args": >>>>>>>>>>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch >>>>>>>>>>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 >>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush >>>>>>>>>>>>>>> item name >>>>>>>>>>>>>>> 'osd.47' initial_weight 3.6396 at location >>>>>>>>>>>>>>> {host=SRV-SEKUARK8,root=default} >>>>>>>>>>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700 1 >>>>>>>>>>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full >>>>>>>>>>>>>>> prune >>>>>>>>>>>>>>> enabled >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim >>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Sage, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thank you for your response. Now I am sure this incident is >>>>>>>>>>>>>>>> going to be resolved. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The problem started when 7 server crashed same time and they >>>>>>>>>>>>>>>> came back after ~5 minutes. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Two of our 3 mon services were restarted in this crash. Since >>>>>>>>>>>>>>>> mon services are enabled they should be started nearly at the >>>>>>>>>>>>>>>> same time. I dont know if this makes any difference but some >>>>>>>>>>>>>>>> of the guys on IRC told it is required that they start in >>>>>>>>>>>>>>>> order not at the same time. Otherwise it could break things >>>>>>>>>>>>>>>> badly. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in >>>>>>>>>>>>>>>> the end we have so many STUCK request and our cluster can not >>>>>>>>>>>>>>>> heal itself. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But >>>>>>>>>>>>>>>> when we unset the flag we see so many STUCKS and SLOW OPS in 1 >>>>>>>>>>>>>>>> hour. >>>>>>>>>>>>>>>> I/O load on all of my OSD disks are at around %95 utilization >>>>>>>>>>>>>>>> and never ends. CPU and RAM usage are OK. >>>>>>>>>>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our >>>>>>>>>>>>>>>> goal was the eliminate older PG's and leaving cluster with >>>>>>>>>>>>>>>> good ones. >>>>>>>>>>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it >>>>>>>>>>>>>>>> didn’t solve our problem. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Of course we have to save %100 of data. But we feel like even >>>>>>>>>>>>>>>> saving %50 of our data will be make us very happy right now. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is what happens when the cluster starts. I believe it >>>>>>>>>>>>>>>> explains the whole story very nicely. >>>>>>>>>>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is our ceph.conf: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && >>>>>>>>>>>>>>>> ceph health”: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is pg dump: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is iostat & perf top: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This strace output of ceph-osd: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is OSD log (default debug): >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is leader MON log (default debug): >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> These are OSDs failed to start. Total number is 58. >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/ >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/ >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/ >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/ >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/ >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and >>>>>>>>>>>>>>>> debug_filestore = 20. >>>>>>>>>>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and >>>>>>>>>>>>>>>> debug_filestore = 20. >>>>>>>>>>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As far as I understand OSD catchs up with the mon epoch and >>>>>>>>>>>>>>>> exceeds mon epoch somehow?? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg >>>>>>>>>>>>>>>> 66.f8 e60196@2018-09-28 23:57:08.251119 >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 >>>>>>>>>>>>>>>> build_initial_pg_history 66.f8 created 60196 >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map >>>>>>>>>>>>>>>> 60196 - loading and decoding 0x19da8400 >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>>>> _process 66.d8 to_process <> waiting <> waiting_peering {} >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 >>>>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 >>>>>>>>>>>>>>>> e72642) queued >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>>>> _process 66.d8 to_process <OpQueueItem(66.d8 >>>>>>>>>>>>>>>> PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 >>>>>>>>>>>>>>>> NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> >>>>>>>>>>>>>>>> waiting_peering {} >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 >>>>>>>>>>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 >>>>>>>>>>>>>>>> e72642) pg 0xb579400 >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: >>>>>>>>>>>>>>>> 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] >>>>>>>>>>>>>>>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c >>>>>>>>>>>>>>>> 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 >>>>>>>>>>>>>>>> lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod >>>>>>>>>>>>>>>> 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: >>>>>>>>>>>>>>>> 72642 epoch_requested: 72642 NullEvt +create_info >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 >>>>>>>>>>>>>>>> queue_want_up_thru want 72642 <= queued 72642, currently 72206 >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) >>>>>>>>>>>>>>>> _process empty q, waiting >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 >>>>>>>>>>>>>>>> add_map_bl 60196 50012 bytes >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map >>>>>>>>>>>>>>>> 60197 - loading and decoding 0x19da8880 >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 >>>>>>>>>>>>>>>> add_map_bl 60197 50012 bytes >>>>>>>>>>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map >>>>>>>>>>>>>>>> 60198 - loading and decoding 0x19da9180 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option >>>>>>>>>>>>>>>> and you should >>>>>>>>>>>>>>>> only use it in very specific circumstances when directed by a >>>>>>>>>>>>>>>> developer. >>>>>>>>>>>>>>>> In such cases it will allow a stuck PG to peer. But you're >>>>>>>>>>>>>>>> not getting to >>>>>>>>>>>>>>>> that point...you're seeing some sort of resource exhaustion. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The noup trick works when OSDs are way behind on maps and all >>>>>>>>>>>>>>>> need to >>>>>>>>>>>>>>>> catch up. The way to tell if they are behind is by looking at >>>>>>>>>>>>>>>> the 'ceph >>>>>>>>>>>>>>>> daemon osd.NNN status' output and comparing to the latest >>>>>>>>>>>>>>>> OSDMap epoch tha >>>>>>>>>>>>>>>> t the mons have. Were they really caught up when you unset >>>>>>>>>>>>>>>> noup? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'm just catching up and haven't read the whole thread but I >>>>>>>>>>>>>>>> haven't seen >>>>>>>>>>>>>>>> anything that explains why teh OSDs are dong lots of disk IO. >>>>>>>>>>>>>>>> Catching up >>>>>>>>>>>>>>>> on maps could explain it but not why they wouldn't peer once >>>>>>>>>>>>>>>> they were all >>>>>>>>>>>>>>>> marked up... >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> sage >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = >>>>>>>>>>>>>>>> true ? >>>>>>>>>>>>>>>> Is that be usefull here? There is such a less information >>>>>>>>>>>>>>>> about it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Goktug Yildirim <[email protected]> şunları yazdı (2 >>>>>>>>>>>>>>>> Eki 2018 22:11): >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions >>>>>>>>>>>>>>>> when I read somewhere that that will do the proper sizing. >>>>>>>>>>>>>>>> For the blustore cache size I have plenty of RAM. I will >>>>>>>>>>>>>>>> increase 8GB for each and decide a more calculated number >>>>>>>>>>>>>>>> after cluster settles. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For the osd map loading I’ve also figured it out. And it is in >>>>>>>>>>>>>>>> loop. For that reason I started cluster with noup flag and >>>>>>>>>>>>>>>> waited OSDs to reach the uptodate epoch number. After that I >>>>>>>>>>>>>>>> unset noup. But I did not pay attention to manager logs. Let >>>>>>>>>>>>>>>> me check it, thank you! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am not forcing jmellac or anything else really. I have a >>>>>>>>>>>>>>>> very standard installation and no tweaks or tunings. All we >>>>>>>>>>>>>>>> ask for the stability versus speed from the begining. And here >>>>>>>>>>>>>>>> we are :/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius >>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I can see some issues from the osd log file. You have an >>>>>>>>>>>>>>>> extremely low >>>>>>>>>>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. >>>>>>>>>>>>>>>> I would >>>>>>>>>>>>>>>> recommend cranking up rocksdb cache size as much as possible. >>>>>>>>>>>>>>>> If you >>>>>>>>>>>>>>>> have RAM you can also increase bluestores cache size for hdd. >>>>>>>>>>>>>>>> Default >>>>>>>>>>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You >>>>>>>>>>>>>>>> also >>>>>>>>>>>>>>>> have lots of osd map loading and decoding in the log. Are you >>>>>>>>>>>>>>>> sure all >>>>>>>>>>>>>>>> monitors/managers/osds are up to date? Plus make sure you >>>>>>>>>>>>>>>> aren't >>>>>>>>>>>>>>>> forcing jemalloc loading. I had a funny interaction after >>>>>>>>>>>>>>>> upgrading to >>>>>>>>>>>>>>>> mimic. >>>>>>>>>>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim >>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello Darius, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for reply! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f >>>>>>>>>>>>>>>> query” does stucks and wait forever since OSD is unresponsive. >>>>>>>>>>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. >>>>>>>>>>>>>>>> And we are certain that OSD responds again after its disk >>>>>>>>>>>>>>>> utilization stops. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So we have a small test like that: >>>>>>>>>>>>>>>> * Stop all OSDs (168 of them) >>>>>>>>>>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It >>>>>>>>>>>>>>>> takes 8 mins to finish. Only after that “ceph pg 67.54f query” >>>>>>>>>>>>>>>> works! >>>>>>>>>>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & >>>>>>>>>>>>>>>> OSD2 starts %95 disk utilization. This takes 17 minutes to >>>>>>>>>>>>>>>> finish. >>>>>>>>>>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O >>>>>>>>>>>>>>>> and it takes 25 mins to settle. >>>>>>>>>>>>>>>> * If you happen to start 5 of them at the same all of the OSDs >>>>>>>>>>>>>>>> start high I/O again. And it takes 1 hour to finish. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So in the light of these findings we flagged noup, started all >>>>>>>>>>>>>>>> OSDs. At first there was no I/O. After 10 minutes we unset >>>>>>>>>>>>>>>> noup. All of 168 OSD started to make high I/O. And we thought >>>>>>>>>>>>>>>> that if we wait long enough it will finish & OSDs will be >>>>>>>>>>>>>>>> responsive again. After 24hours they did not because I/O did >>>>>>>>>>>>>>>> not finish or even slowed down. >>>>>>>>>>>>>>>> One can think that is a lot of data there to scan. But it is >>>>>>>>>>>>>>>> just 33TB. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So at short we dont know which PG is stuck so we can remove it. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> However we met an weird thing half an hour ago. We exported >>>>>>>>>>>>>>>> the same PG from two different OSDs. One was 4.2GB and the >>>>>>>>>>>>>>>> other is 500KB! So we decided to export all OSDs for backup. >>>>>>>>>>>>>>>> Then we will delete strange sized ones and start the cluster >>>>>>>>>>>>>>>> all over. Maybe then we could solve the stucked or unfound PGs >>>>>>>>>>>>>>>> as you advise. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Any thought would be greatly appreciated. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius >>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Currently you have 15 objects missing. I would recommend >>>>>>>>>>>>>>>> finding them >>>>>>>>>>>>>>>> and making backups of them. Ditch all other osds that are >>>>>>>>>>>>>>>> failing to >>>>>>>>>>>>>>>> start and concentrate on bringing online those that have >>>>>>>>>>>>>>>> missing >>>>>>>>>>>>>>>> objects. Then slowly turn off nodown and noout on the cluster >>>>>>>>>>>>>>>> and see >>>>>>>>>>>>>>>> if it stabilises. If it stabilises leave these setting if not >>>>>>>>>>>>>>>> turn >>>>>>>>>>>>>>>> them back on. >>>>>>>>>>>>>>>> Now get some of the pg's that are blocked and querry the pgs >>>>>>>>>>>>>>>> to check >>>>>>>>>>>>>>>> why they are blocked. Try removing as much blocks as possible >>>>>>>>>>>>>>>> and then >>>>>>>>>>>>>>>> remove the norebalance/norecovery flags and see if it starts >>>>>>>>>>>>>>>> to fix >>>>>>>>>>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin >>>>>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> One of ceph experts indicated that bluestore is somewhat >>>>>>>>>>>>>>>> preview tech >>>>>>>>>>>>>>>> (as for Redhat). >>>>>>>>>>>>>>>> So it could be best to checkout bluestore and rocksdb. There >>>>>>>>>>>>>>>> are some >>>>>>>>>>>>>>>> tools to check health and also repair. But there are limited >>>>>>>>>>>>>>>> documentation. >>>>>>>>>>>>>>>> Anyone who has experince with it? >>>>>>>>>>>>>>>> Anyone lead/help to a proper check would be great. >>>>>>>>>>>>>>>> Goktug Yildirim <[email protected]>, 1 Eki 2018 Pzt, >>>>>>>>>>>>>>>> 22:55 >>>>>>>>>>>>>>>> tarihinde şunu yazdı: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 >>>>>>>>>>>>>>>> days since this cluster is offline. The long short story is >>>>>>>>>>>>>>>> here: >>>>>>>>>>>>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If >>>>>>>>>>>>>>>> this is not to correct way I apology and please let me know. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> For the 6 days lots of thing happened and there were some >>>>>>>>>>>>>>>> outcomes about the problem. Some of them was misjudged and >>>>>>>>>>>>>>>> some of them are not looked deeper. >>>>>>>>>>>>>>>> However the most certain diagnosis is this: each OSD causes >>>>>>>>>>>>>>>> very high disk I/O to its bluestore disk (WAL and DB are >>>>>>>>>>>>>>>> fine). After that OSDs become unresponsive or very very less >>>>>>>>>>>>>>>> responsive. For example "ceph tell osd.x version” stucks like >>>>>>>>>>>>>>>> for ever. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So due to unresponsive OSDs cluster does not settle. This is >>>>>>>>>>>>>>>> our problem! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is the one we are very sure of. But we are not sure of >>>>>>>>>>>>>>>> the reason. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is the latest ceph status: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This is the status after we started all of the OSDs 24 hours >>>>>>>>>>>>>>>> ago. >>>>>>>>>>>>>>>> Some of the OSDs are not started. However it didnt make any >>>>>>>>>>>>>>>> difference when all of them was online. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is the debug=20 log of an OSD which is same for all >>>>>>>>>>>>>>>> others: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ >>>>>>>>>>>>>>>> As we figure out there is a loop pattern. I am sure it wont >>>>>>>>>>>>>>>> caught from eye. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This the full log the same OSD. >>>>>>>>>>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is the strace of the same OSD process: >>>>>>>>>>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts >>>>>>>>>>>>>>>> as we do. I am sure we have done lots of mistakes to let this >>>>>>>>>>>>>>>> happening. And this situation may be a example for other user >>>>>>>>>>>>>>>> and could be a potential bug for ceph developer. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Any help to figure out what is going on would be great. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best Regards, >>>>>>>>>>>>>>>> Goktug Yildirim >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> ceph-users mailing list >>>>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>>>>>>>> >>>>>>>> >>>>>>>> >> _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
