On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote: > I'm so sorry about that I missed "out" parameter. My bad.. > This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
Excellent, thanks. That looks like it confirms the problem is that teh recovery tool didn't repopulate the creating pgs properly. If you take that 30 byte file I sent earlier (as hex) and update the osdmap epoch to the latest on the mon, confirm it decodes and dumps properly, and then inject it on the 3 mons, that should get you past this hump (and hopefully back up!). sage > > Sage Weil <[email protected]> şunları yazdı (3 Eki 2018 21:13): > > > I bet the kvstore output it in a hexdump format? There is another option > > to get the raw data iirc > > > > > > > >> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM > >> <[email protected]> wrote: > >> I changed the file name to make it clear. > >> When I use your command with "+decode" I'm getting an error like this: > >> > >> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json > >> error: buffer::malformed_input: void > >> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand > >> old encoding version 2 < 111 > >> > >> My ceph version: 13.2.2 > >> > >> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil <[email protected]> şunu > >> yazdı: > >>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote: > >>> > If I didn't do it wrong, I got the output as below. > >>> > > >>> > ceph-kvstore-tool rocksdb > >>> > /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating > >>> > creating > dump > >>> > 2018-10-03 20:08:52.070 7f07f5659b80 1 rocksdb: do_open column > >>> > families: [default] > >>> > > >>> > ceph-dencoder type creating_pgs_t import dump dump_json > >>> > >>> Sorry, should be > >>> > >>> ceph-dencoder type creating_pgs_t import dump decode dump_json > >>> > >>> s > >>> > >>> > { > >>> > "last_scan_epoch": 0, > >>> > "creating_pgs": [], > >>> > "queue": [], > >>> > "created_pools": [] > >>> > } > >>> > > >>> > You can find the "dump" link below. > >>> > > >>> > dump: > >>> > https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing > >>> > > >>> > > >>> > Sage Weil <[email protected]> şunları yazdı (3 Eki 2018 18:45): > >>> > > >>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote: > >>> > >> We are starting to work on it. First step is getting the structure > >>> > >> out and dumping the current value as you say. > >>> > >> > >>> > >> And you were correct we did not run force_create_pg. > >>> > > > >>> > > Great. > >>> > > > >>> > > So, eager to see what the current structure is... please attach once > >>> > > you > >>> > > have it. > >>> > > > >>> > > The new replacement one should look like this (when hexdump -C'd): > >>> > > > >>> > > 00000000 02 01 18 00 00 00 10 00 00 00 00 00 00 00 01 00 > >>> > > |................| > >>> > > 00000010 00 00 42 00 00 00 00 00 00 00 00 00 00 00 > >>> > > |..B...........| > >>> > > 0000001e > >>> > > > >>> > > ...except that from byte 6 you want to put in a recent OSDMap epoch, > >>> > > in > >>> > > hex, little endian (least significant byte first), in place of the > >>> > > 0x10 > >>> > > that is there now. It should dump like this: > >>> > > > >>> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json > >>> > > { > >>> > > "last_scan_epoch": 16, <--- but with a recent epoch here > >>> > > "creating_pgs": [], > >>> > > "queue": [], > >>> > > "created_pools": [ > >>> > > 66 > >>> > > ] > >>> > > } > >>> > > > >>> > > sage > >>> > > > >>> > > > >>> > >> > >>> > >>> On 3 Oct 2018, at 17:52, Sage Weil <[email protected]> wrote: > >>> > >>> > >>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: > >>> > >>>> Sage, > >>> > >>>> > >>> > >>>> Pool 66 is the only pool it shows right now. This a pool created > >>> > >>>> months ago. > >>> > >>>> ceph osd lspools > >>> > >>>> 66 mypool > >>> > >>>> > >>> > >>>> As we recreated mon db from OSDs, the pools for MDS was unusable. > >>> > >>>> So we deleted them. > >>> > >>>> After we create another cephfs fs and pools we started MDS and it > >>> > >>>> stucked on creation. So we stopped MDS and removed fs and fs > >>> > >>>> pools. Right now we do not have MDS running nor we have cephfs > >>> > >>>> related things. > >>> > >>>> > >>> > >>>> ceph fs dump > >>> > >>>> dumped fsmap epoch 1 e1 > >>> > >>>> enable_multiple, ever_enabled_multiple: 0,0 > >>> > >>>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client > >>> > >>>> writeable ranges,3=default file layouts on dirs,4=dir inode in > >>> > >>>> separate object,5=mds uses versioned encoding,6=dirfrag is stored > >>> > >>>> in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} > >>> > >>>> legacy client fscid: -1 > >>> > >>>> > >>> > >>>> No filesystems configured > >>> > >>>> > >>> > >>>> ceph fs ls > >>> > >>>> No filesystems enabled > >>> > >>>> > >>> > >>>> Now pool 66 seems to only pool we have and it has been created > >>> > >>>> months ago. Then I guess there is something hidden out there. > >>> > >>>> > >>> > >>>> Is there any way to find and delete it? > >>> > >>> > >>> > >>> Ok, I'm concerned that the creating pg is in there if this is an > >>> > >>> old > >>> > >>> pool... did you perhaps run force_create_pg at some point? > >>> > >>> Assuming you > >>> > >>> didn't, I think this is a bug in the process for rebuilding the mon > >>> > >>> store.. one that doesn't normally come up because the impact is > >>> > >>> this > >>> > >>> osdmap scan that is cheap in our test scenarios but clearly not > >>> > >>> cheap for > >>> > >>> your aged cluster. > >>> > >>> > >>> > >>> In any case, there is a way to clear those out of the mon, but it's > >>> > >>> a bit > >>> > >>> dicey. > >>> > >>> > >>> > >>> 1. stop all mons > >>> > >>> 2. make a backup of all mons > >>> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating > >>> > >>> key=creating key on one of the mons > >>> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t import > >>> > >>> FILE dump_json > >>> > >>> 5. hex edit the structure to remove all of the creating pgs, and > >>> > >>> adds pool > >>> > >>> 66 to the created_pgs member. > >>> > >>> 6. verify with ceph-dencoder dump that the edit was correct... > >>> > >>> 7. inject the updated structure into all of the mons > >>> > >>> 8. start all mons > >>> > >>> > >>> > >>> 4-6 will probably be an iterative process... let's start by getting > >>> > >>> the > >>> > >>> structure out and dumping the current value? > >>> > >>> > >>> > >>> The code to refer to to understand the structure is > >>> > >>> src/mon/CreatingPGs.h > >>> > >>> encode/decode methods. > >>> > >>> > >>> > >>> sage > >>> > >>> > >>> > >>> > >>> > >>>> > >>> > >>>> > >>> > >>>>> On 3 Oct 2018, at 16:46, Sage Weil <[email protected]> wrote: > >>> > >>>>> > >>> > >>>>> Oh... I think this is the problem: > >>> > >>>>> > >>> > >>>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op > >>> > >>>>> osd_pg_create(e72883 > >>> > >>>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 > >>> > >>>>> 66.124:60196 > >>> > >>>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 > >>> > >>>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916 > >>> > >>>>> > >>> > >>>>> You are in the midst of creating new pgs, and unfortunately pg > >>> > >>>>> create is > >>> > >>>>> one of the last remaining places where the OSDs need to look at a > >>> > >>>>> full > >>> > >>>>> history of map changes between then and the current map epoch. > >>> > >>>>> In this > >>> > >>>>> case, the pool was created in 60196 and it is now 72883, ~12k > >>> > >>>>> epochs > >>> > >>>>> later. > >>> > >>>>> > >>> > >>>>> What is this new pool for? Is it still empty, and if so, can we > >>> > >>>>> delete > >>> > >>>>> it? If yes, I'm ~70% sure that will then get cleaned out at the > >>> > >>>>> mon end > >>> > >>>>> and restarting the OSDs will make these pg_creates go away. > >>> > >>>>> > >>> > >>>>> s > >>> > >>>>> > >>> > >>>>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote: > >>> > >>>>>> > >>> > >>>>>> Hello, > >>> > >>>>>> > >>> > >>>>>> It seems nothing has changed. > >>> > >>>>>> > >>> > >>>>>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ > >>> > >>>>>> <https://paste.ubuntu.com/p/MtvTr5HYW4/> > >>> > >>>>>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ > >>> > >>>>>> <https://paste.ubuntu.com/p/7Sx64xGzkR/> > >>> > >>>>>> > >>> > >>>>>> > >>> > >>>>>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius > >>> > >>>>>>> <[email protected]> wrote: > >>> > >>>>>>> > >>> > >>>>>>> Hello, > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> You can also reduce the osd map updates by adding this to your > >>> > >>>>>>> ceph > >>> > >>>>>>> config file. "osd crush update on start = false". This should > >>> > >>>>>>> remove > >>> > >>>>>>> and update that is generated when osd starts. > >>> > >>>>>>> > >>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 > >>> > >>>>>>> mon.SRV-SBKUARK14@0(leader) > >>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush > >>> > >>>>>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1 > >>> > >>>>>>> 2018-10-03 14:03:21.534 7fe15eddb700 0 log_channel(audit) log > >>> > >>>>>>> [INF] : > >>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' > >>> > >>>>>>> cmd=[{"prefix": > >>> > >>>>>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]: > >>> > >>>>>>> dispatch > >>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 > >>> > >>>>>>> mon.SRV-SBKUARK14@0(leader) > >>> > >>>>>>> e14 handle_command mon_command({"prefix": "osd crush > >>> > >>>>>>> create-or-move", > >>> > >>>>>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8", > >>> > >>>>>>> "root=default"]} v 0) v1 > >>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 log_channel(audit) log > >>> > >>>>>>> [INF] : > >>> > >>>>>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' > >>> > >>>>>>> cmd=[{"prefix": > >>> > >>>>>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args": > >>> > >>>>>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch > >>> > >>>>>>> 2018-10-03 14:03:21.538 7fe15eddb700 0 > >>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush > >>> > >>>>>>> item name > >>> > >>>>>>> 'osd.47' initial_weight 3.6396 at location > >>> > >>>>>>> {host=SRV-SEKUARK8,root=default} > >>> > >>>>>>> 2018-10-03 14:03:22.250 7fe1615e0700 1 > >>> > >>>>>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full > >>> > >>>>>>> prune > >>> > >>>>>>> enabled > >>> > >>>>>>> > >>> > >>>>>>> > >>> > >>>>>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim > >>> > >>>>>>> <[email protected]> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>> Hi Sage, > >>> > >>>>>>>> > >>> > >>>>>>>> Thank you for your response. Now I am sure this incident is > >>> > >>>>>>>> going to be resolved. > >>> > >>>>>>>> > >>> > >>>>>>>> The problem started when 7 server crashed same time and they > >>> > >>>>>>>> came back after ~5 minutes. > >>> > >>>>>>>> > >>> > >>>>>>>> Two of our 3 mon services were restarted in this crash. Since > >>> > >>>>>>>> mon services are enabled they should be started nearly at the > >>> > >>>>>>>> same time. I dont know if this makes any difference but some > >>> > >>>>>>>> of the guys on IRC told it is required that they start in > >>> > >>>>>>>> order not at the same time. Otherwise it could break things > >>> > >>>>>>>> badly. > >>> > >>>>>>>> > >>> > >>>>>>>> After 9 days we still see 3400-3500 active+clear PG. But in > >>> > >>>>>>>> the end we have so many STUCK request and our cluster can not > >>> > >>>>>>>> heal itself. > >>> > >>>>>>>> > >>> > >>>>>>>> When we set noup flag, OSDs can catch up epoch easily. But > >>> > >>>>>>>> when we unset the flag we see so many STUCKS and SLOW OPS in 1 > >>> > >>>>>>>> hour. > >>> > >>>>>>>> I/O load on all of my OSD disks are at around %95 utilization > >>> > >>>>>>>> and never ends. CPU and RAM usage are OK. > >>> > >>>>>>>> OSDs get stuck that we even can't run “ceph pg osd.0 query”. > >>> > >>>>>>>> > >>> > >>>>>>>> Also we tried to change RBD pool replication size 2 to 1. Our > >>> > >>>>>>>> goal was the eliminate older PG's and leaving cluster with > >>> > >>>>>>>> good ones. > >>> > >>>>>>>> With replication size=1 we saw "%13 PGS not active”. But it > >>> > >>>>>>>> didn’t solve our problem. > >>> > >>>>>>>> > >>> > >>>>>>>> Of course we have to save %100 of data. But we feel like even > >>> > >>>>>>>> saving %50 of our data will be make us very happy right now. > >>> > >>>>>>>> > >>> > >>>>>>>> This is what happens when the cluster starts. I believe it > >>> > >>>>>>>> explains the whole story very nicely. > >>> > >>>>>>>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing > >>> > >>>>>>>> > >>> > >>>>>>>> This is our ceph.conf: > >>> > >>>>>>>> https://paste.ubuntu.com/p/8sQhfPDXnW/ > >>> > >>>>>>>> > >>> > >>>>>>>> This is the output of "osd stat && osd epochs && ceph -s && > >>> > >>>>>>>> ceph health”: > >>> > >>>>>>>> https://paste.ubuntu.com/p/g5t8xnrjjZ/ > >>> > >>>>>>>> > >>> > >>>>>>>> This is pg dump: > >>> > >>>>>>>> https://paste.ubuntu.com/p/zYqsN5T95h/ > >>> > >>>>>>>> > >>> > >>>>>>>> This is iostat & perf top: > >>> > >>>>>>>> https://paste.ubuntu.com/p/Pgf3mcXXX8/ > >>> > >>>>>>>> > >>> > >>>>>>>> This strace output of ceph-osd: > >>> > >>>>>>>> https://paste.ubuntu.com/p/YCdtfh5qX8/ > >>> > >>>>>>>> > >>> > >>>>>>>> This is OSD log (default debug): > >>> > >>>>>>>> https://paste.ubuntu.com/p/Z2JrrBzzkM/ > >>> > >>>>>>>> > >>> > >>>>>>>> This is leader MON log (default debug): > >>> > >>>>>>>> https://paste.ubuntu.com/p/RcGmsVKmzG/ > >>> > >>>>>>>> > >>> > >>>>>>>> These are OSDs failed to start. Total number is 58. > >>> > >>>>>>>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/ > >>> > >>>>>>>> https://paste.ubuntu.com/p/pkRdVjCH4D/ > >>> > >>>>>>>> https://paste.ubuntu.com/p/zJTf2fzSj9/ > >>> > >>>>>>>> https://paste.ubuntu.com/p/xpJRK6YhRX/ > >>> > >>>>>>>> https://paste.ubuntu.com/p/SY3576dNbJ/ > >>> > >>>>>>>> https://paste.ubuntu.com/p/smyT6Y976b/ > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> This is OSD video with debug osd = 20 and debug ms = 1 and > >>> > >>>>>>>> debug_filestore = 20. > >>> > >>>>>>>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing > >>> > >>>>>>>> > >>> > >>>>>>>> This is OSD logfile with debug osd = 20 and debug ms = 1 and > >>> > >>>>>>>> debug_filestore = 20. > >>> > >>>>>>>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing > >>> > >>>>>>>> > >>> > >>>>>>>> As far as I understand OSD catchs up with the mon epoch and > >>> > >>>>>>>> exceeds mon epoch somehow?? > >>> > >>>>>>>> > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg > >>> > >>>>>>>> 66.f8 e60196@2018-09-28 23:57:08.251119 > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 > >>> > >>>>>>>> build_initial_pg_history 66.f8 created 60196 > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map > >>> > >>>>>>>> 60196 - loading and decoding 0x19da8400 > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) > >>> > >>>>>>>> _process 66.d8 to_process <> waiting <> waiting_peering {} > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) > >>> > >>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 > >>> > >>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 > >>> > >>>>>>>> e72642) queued > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) > >>> > >>>>>>>> _process 66.d8 to_process <OpQueueItem(66.d8 > >>> > >>>>>>>> PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 > >>> > >>>>>>>> NullEvt +create_info) prio 255 cost 10 e72642)> waiting <> > >>> > >>>>>>>> waiting_peering {} > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) > >>> > >>>>>>>> _process OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 > >>> > >>>>>>>> epoch_requested: 72642 NullEvt +create_info) prio 255 cost 10 > >>> > >>>>>>>> e72642) pg 0xb579400 > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: > >>> > >>>>>>>> 72642 pg[66.d8( v 39934'8971934 (38146'8968839,39934'8971934] > >>> > >>>>>>>> local-lis/les=72206/72212 n=2206 ec=50786/50786 lis/c > >>> > >>>>>>>> 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] r=0 > >>> > >>>>>>>> lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod > >>> > >>>>>>>> 0'0 peering mbc={} ps=[1~11]] do_peering_event: epoch_sent: > >>> > >>>>>>>> 72642 epoch_requested: 72642 NullEvt +create_info > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 > >>> > >>>>>>>> queue_want_up_thru want 72642 <= queued 72642, currently 72206 > >>> > >>>>>>>> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) > >>> > >>>>>>>> _process empty q, waiting > >>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 10 osd.150 72642 > >>> > >>>>>>>> add_map_bl 60196 50012 bytes > >>> > >>>>>>>> 2018-10-03 14:55:08.665 7f66c0bf9700 20 osd.150 72642 get_map > >>> > >>>>>>>> 60197 - loading and decoding 0x19da8880 > >>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 10 osd.150 72642 > >>> > >>>>>>>> add_map_bl 60197 50012 bytes > >>> > >>>>>>>> 2018-10-03 14:55:08.669 7f66c0bf9700 20 osd.150 72642 get_map > >>> > >>>>>>>> 60198 - loading and decoding 0x19da9180 > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> On 3 Oct 2018, at 05:14, Sage Weil <[email protected]> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>> osd_find_best_info_ignore_history_les is a dangerous option > >>> > >>>>>>>> and you should > >>> > >>>>>>>> only use it in very specific circumstances when directed by a > >>> > >>>>>>>> developer. > >>> > >>>>>>>> In such cases it will allow a stuck PG to peer. But you're > >>> > >>>>>>>> not getting to > >>> > >>>>>>>> that point...you're seeing some sort of resource exhaustion. > >>> > >>>>>>>> > >>> > >>>>>>>> The noup trick works when OSDs are way behind on maps and all > >>> > >>>>>>>> need to > >>> > >>>>>>>> catch up. The way to tell if they are behind is by looking at > >>> > >>>>>>>> the 'ceph > >>> > >>>>>>>> daemon osd.NNN status' output and comparing to the latest > >>> > >>>>>>>> OSDMap epoch tha > >>> > >>>>>>>> t the mons have. Were they really caught up when you unset > >>> > >>>>>>>> noup? > >>> > >>>>>>>> > >>> > >>>>>>>> I'm just catching up and haven't read the whole thread but I > >>> > >>>>>>>> haven't seen > >>> > >>>>>>>> anything that explains why teh OSDs are dong lots of disk IO. > >>> > >>>>>>>> Catching up > >>> > >>>>>>>> on maps could explain it but not why they wouldn't peer once > >>> > >>>>>>>> they were all > >>> > >>>>>>>> marked up... > >>> > >>>>>>>> > >>> > >>>>>>>> sage > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> On Tue, 2 Oct 2018, Göktuğ Yıldırım wrote: > >>> > >>>>>>>> > >>> > >>>>>>>> Anyone heart about osd_find_best_info_ignore_history_les = > >>> > >>>>>>>> true ? > >>> > >>>>>>>> Is that be usefull here? There is such a less information > >>> > >>>>>>>> about it. > >>> > >>>>>>>> > >>> > >>>>>>>> Goktug Yildirim <[email protected]> şunları yazdı (2 > >>> > >>>>>>>> Eki 2018 22:11): > >>> > >>>>>>>> > >>> > >>>>>>>> Hi, > >>> > >>>>>>>> > >>> > >>>>>>>> Indeed I left ceph-disk to decide the wal and db partitions > >>> > >>>>>>>> when I read somewhere that that will do the proper sizing. > >>> > >>>>>>>> For the blustore cache size I have plenty of RAM. I will > >>> > >>>>>>>> increase 8GB for each and decide a more calculated number > >>> > >>>>>>>> after cluster settles. > >>> > >>>>>>>> > >>> > >>>>>>>> For the osd map loading I’ve also figured it out. And it is in > >>> > >>>>>>>> loop. For that reason I started cluster with noup flag and > >>> > >>>>>>>> waited OSDs to reach the uptodate epoch number. After that I > >>> > >>>>>>>> unset noup. But I did not pay attention to manager logs. Let > >>> > >>>>>>>> me check it, thank you! > >>> > >>>>>>>> > >>> > >>>>>>>> I am not forcing jmellac or anything else really. I have a > >>> > >>>>>>>> very standard installation and no tweaks or tunings. All we > >>> > >>>>>>>> ask for the stability versus speed from the begining. And here > >>> > >>>>>>>> we are :/ > >>> > >>>>>>>> > >>> > >>>>>>>> On 2 Oct 2018, at 21:53, Darius Kasparavičius > >>> > >>>>>>>> <[email protected]> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>> Hi, > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> I can see some issues from the osd log file. You have an > >>> > >>>>>>>> extremely low > >>> > >>>>>>>> size db and wal partitions. Only 1GB for DB and 576MB for wal. > >>> > >>>>>>>> I would > >>> > >>>>>>>> recommend cranking up rocksdb cache size as much as possible. > >>> > >>>>>>>> If you > >>> > >>>>>>>> have RAM you can also increase bluestores cache size for hdd. > >>> > >>>>>>>> Default > >>> > >>>>>>>> is 1GB be as liberal as you can without getting OOM kills. You > >>> > >>>>>>>> also > >>> > >>>>>>>> have lots of osd map loading and decoding in the log. Are you > >>> > >>>>>>>> sure all > >>> > >>>>>>>> monitors/managers/osds are up to date? Plus make sure you > >>> > >>>>>>>> aren't > >>> > >>>>>>>> forcing jemalloc loading. I had a funny interaction after > >>> > >>>>>>>> upgrading to > >>> > >>>>>>>> mimic. > >>> > >>>>>>>> On Tue, Oct 2, 2018 at 9:02 PM Goktug Yildirim > >>> > >>>>>>>> <[email protected]> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> Hello Darius, > >>> > >>>>>>>> > >>> > >>>>>>>> Thanks for reply! > >>> > >>>>>>>> > >>> > >>>>>>>> The main problem is we can not query PGs. “ceph pg 67.54f > >>> > >>>>>>>> query” does stucks and wait forever since OSD is unresponsive. > >>> > >>>>>>>> We are certain that OSD gets unresponsive as soon as it UP. > >>> > >>>>>>>> And we are certain that OSD responds again after its disk > >>> > >>>>>>>> utilization stops. > >>> > >>>>>>>> > >>> > >>>>>>>> So we have a small test like that: > >>> > >>>>>>>> * Stop all OSDs (168 of them) > >>> > >>>>>>>> * Start OSD1. %95 osd disk utilization immediately starts. It > >>> > >>>>>>>> takes 8 mins to finish. Only after that “ceph pg 67.54f query” > >>> > >>>>>>>> works! > >>> > >>>>>>>> * While OSD1 is “up" start OSD2. As soon as OSD2 starts OSD1 & > >>> > >>>>>>>> OSD2 starts %95 disk utilization. This takes 17 minutes to > >>> > >>>>>>>> finish. > >>> > >>>>>>>> * Now start OSD3 and it is the same. All OSDs start high I/O > >>> > >>>>>>>> and it takes 25 mins to settle. > >>> > >>>>>>>> * If you happen to start 5 of them at the same all of the OSDs > >>> > >>>>>>>> start high I/O again. And it takes 1 hour to finish. > >>> > >>>>>>>> > >>> > >>>>>>>> So in the light of these findings we flagged noup, started all > >>> > >>>>>>>> OSDs. At first there was no I/O. After 10 minutes we unset > >>> > >>>>>>>> noup. All of 168 OSD started to make high I/O. And we thought > >>> > >>>>>>>> that if we wait long enough it will finish & OSDs will be > >>> > >>>>>>>> responsive again. After 24hours they did not because I/O did > >>> > >>>>>>>> not finish or even slowed down. > >>> > >>>>>>>> One can think that is a lot of data there to scan. But it is > >>> > >>>>>>>> just 33TB. > >>> > >>>>>>>> > >>> > >>>>>>>> So at short we dont know which PG is stuck so we can remove it. > >>> > >>>>>>>> > >>> > >>>>>>>> However we met an weird thing half an hour ago. We exported > >>> > >>>>>>>> the same PG from two different OSDs. One was 4.2GB and the > >>> > >>>>>>>> other is 500KB! So we decided to export all OSDs for backup. > >>> > >>>>>>>> Then we will delete strange sized ones and start the cluster > >>> > >>>>>>>> all over. Maybe then we could solve the stucked or unfound PGs > >>> > >>>>>>>> as you advise. > >>> > >>>>>>>> > >>> > >>>>>>>> Any thought would be greatly appreciated. > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> On 2 Oct 2018, at 18:16, Darius Kasparavičius > >>> > >>>>>>>> <[email protected]> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>> Hello, > >>> > >>>>>>>> > >>> > >>>>>>>> Currently you have 15 objects missing. I would recommend > >>> > >>>>>>>> finding them > >>> > >>>>>>>> and making backups of them. Ditch all other osds that are > >>> > >>>>>>>> failing to > >>> > >>>>>>>> start and concentrate on bringing online those that have > >>> > >>>>>>>> missing > >>> > >>>>>>>> objects. Then slowly turn off nodown and noout on the cluster > >>> > >>>>>>>> and see > >>> > >>>>>>>> if it stabilises. If it stabilises leave these setting if not > >>> > >>>>>>>> turn > >>> > >>>>>>>> them back on. > >>> > >>>>>>>> Now get some of the pg's that are blocked and querry the pgs > >>> > >>>>>>>> to check > >>> > >>>>>>>> why they are blocked. Try removing as much blocks as possible > >>> > >>>>>>>> and then > >>> > >>>>>>>> remove the norebalance/norecovery flags and see if it starts > >>> > >>>>>>>> to fix > >>> > >>>>>>>> itself. On Tue, Oct 2, 2018 at 5:14 PM by morphin > >>> > >>>>>>>> <[email protected]> wrote: > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> One of ceph experts indicated that bluestore is somewhat > >>> > >>>>>>>> preview tech > >>> > >>>>>>>> (as for Redhat). > >>> > >>>>>>>> So it could be best to checkout bluestore and rocksdb. There > >>> > >>>>>>>> are some > >>> > >>>>>>>> tools to check health and also repair. But there are limited > >>> > >>>>>>>> documentation. > >>> > >>>>>>>> Anyone who has experince with it? > >>> > >>>>>>>> Anyone lead/help to a proper check would be great. > >>> > >>>>>>>> Goktug Yildirim <[email protected]>, 1 Eki 2018 Pzt, > >>> > >>>>>>>> 22:55 > >>> > >>>>>>>> tarihinde şunu yazdı: > >>> > >>>>>>>> > >>> > >>>>>>>> > >>> > >>>>>>>> Hi all, > >>> > >>>>>>>> > >>> > >>>>>>>> We have recently upgraded from luminous to mimic. It’s been 6 > >>> > >>>>>>>> days since this cluster is offline. The long short story is > >>> > >>>>>>>> here: > >>> > >>>>>>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/030078.html > >>> > >>>>>>>> > >>> > >>>>>>>> I’ve also CC’ed developers since I believe this is a bug. If > >>> > >>>>>>>> this is not to correct way I apology and please let me know. > >>> > >>>>>>>> > >>> > >>>>>>>> For the 6 days lots of thing happened and there were some > >>> > >>>>>>>> outcomes about the problem. Some of them was misjudged and > >>> > >>>>>>>> some of them are not looked deeper. > >>> > >>>>>>>> However the most certain diagnosis is this: each OSD causes > >>> > >>>>>>>> very high disk I/O to its bluestore disk (WAL and DB are > >>> > >>>>>>>> fine). After that OSDs become unresponsive or very very less > >>> > >>>>>>>> responsive. For example "ceph tell osd.x version” stucks like > >>> > >>>>>>>> for ever. > >>> > >>>>>>>> > >>> > >>>>>>>> So due to unresponsive OSDs cluster does not settle. This is > >>> > >>>>>>>> our problem! > >>> > >>>>>>>> > >>> > >>>>>>>> This is the one we are very sure of. But we are not sure of > >>> > >>>>>>>> the reason. > >>> > >>>>>>>> > >>> > >>>>>>>> Here is the latest ceph status: > >>> > >>>>>>>> https://paste.ubuntu.com/p/2DyZ5YqPjh/. > >>> > >>>>>>>> > >>> > >>>>>>>> This is the status after we started all of the OSDs 24 hours > >>> > >>>>>>>> ago. > >>> > >>>>>>>> Some of the OSDs are not started. However it didnt make any > >>> > >>>>>>>> difference when all of them was online. > >>> > >>>>>>>> > >>> > >>>>>>>> Here is the debug=20 log of an OSD which is same for all > >>> > >>>>>>>> others: > >>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ > >>> > >>>>>>>> As we figure out there is a loop pattern. I am sure it wont > >>> > >>>>>>>> caught from eye. > >>> > >>>>>>>> > >>> > >>>>>>>> This the full log the same OSD. > >>> > >>>>>>>> https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0 > >>> > >>>>>>>> > >>> > >>>>>>>> Here is the strace of the same OSD process: > >>> > >>>>>>>> https://paste.ubuntu.com/p/8n2kTvwnG6/ > >>> > >>>>>>>> > >>> > >>>>>>>> Recently we hear more to uprade mimic. I hope none get hurts > >>> > >>>>>>>> as we do. I am sure we have done lots of mistakes to let this > >>> > >>>>>>>> happening. And this situation may be a example for other user > >>> > >>>>>>>> and could be a potential bug for ceph developer. > >>> > >>>>>>>> > >>> > >>>>>>>> Any help to figure out what is going on would be great. > >>> > >>>>>>>> > >>> > >>>>>>>> Best Regards, > >>> > >>>>>>>> Goktug Yildirim > >>> > >>>>>>>> > >>> > >>>>>>>> _______________________________________________ > >>> > >>>>>>>> ceph-users mailing list > >>> > >>>>>>>> [email protected] > >>> > >>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >> > >>> > > >>> > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
