Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
I'd like to see some way to cap recovery IOPS per OSD. Don't allow backfill to do no more than 50 operations per second. It will slow backfill down, but reserve plenty of IOPS for normal operation. I know that implementing this well is not a simple task. I know I did some stupid things that caused a lot of my problems. Most of my problems can be traced back to osd mkfs options xfs = -l size=1024m -n size=64k -i size=2048 -s size=4096 and the kernel malloc problems it caused. Reformatting all of the disks fixed a lot of my issues, but it didn't fix them all. While I was reformatting my secondary cluster, I tested the stability by reformatting all of the disks on the last node at once. I didn't mark them out and wait for the rebuild; I removed the OSDs, reformatted, and added them back to the cluster. It was 10 disks out of 36 total, in a 4 node cluster (I'm waiting for hardware to free up to build the 5th node). Everything was fine for the first hour or so. After several hours, there was enough latency that the HTTP load balancer was marking RadosGW nodes down. My load balancer has a 30s timeout. Since the latency was cluster wide, all RadosGW nodes were marked down together. When the latency spike subsided, they'd all get marked up again. This continued until the backfill completed. They were mostly up. I don't have numbers, but I think they were marked down about 5 times an hour, for less than a minute each time. That really messes with radosgw-agent. I had recovery tuned down: osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 1 I have journals on SSD, and single GigE public and cluster networks. This cluster has 2x replication (I'm waiting for the 5th node to go to 3x). The cluster network was pushing 950 Mbps. The SSDs and OSDs had plenty of write bandwidth, but the HDDs were saturating their IOPs. These are consumer class 7200 RPM SATA disks, so they don't have very many IOPS. The average write latency on these OSDs is normally ~10ms. While this backfill was going on, the average write latency was 100ms, with plenty of times when the latency was 200ms. The average read latency increased, but not as bad. It averaged 50ms, with occasional spikes up to 400ms. Since I formatted a 27% of my cluster, I was seeing higher latency on 55% of my OSDs (readers and writers). Instead, if I trickle in the disks, everything works fine. I was able to reformat 2 OSDs at a time without a problem. The cluster latency increase was barely noticeable, even though the IOPS on those two disks were saturated. A bit of latency here and there (5% of the time) doesn't hurt much. When it's 55% of the time, it hurts a lot more. When I finally get the 5th node, and increase replication from 2x to 3x, I expect this cluster to be unusable for about a week. On Thu, Jul 17, 2014 at 9:02 AM, Andrei Mikhailovsky and...@arhont.com wrote: Comments inline -- *From: *Sage Weil sw...@redhat.com *To: *Quenten Grasso qgra...@onq.com.au *Cc: *ceph-users@lists.ceph.com *Sent: *Thursday, 17 July, 2014 4:44:45 PM *Subject: *Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time On Thu, 17 Jul 2014, Quenten Grasso wrote: Hi Sage List I understand this is probably a hard question to answer. I mentioned previously our cluster is co-located MON?s on OSD servers, which are R515?s w/ 1 x AMD 6 Core processor 11 3TB OSD?s w/ dual 10GBE. When our cluster is doing these busy operations and IO has stopped as in my case, I mentioned earlier running/setting tuneable to optimal or heavy recovery operations is there a way to ensure our IO doesn?t get completely blocked/stopped/frozen in our vms? Could it be as simple as putting all 3 of our mon servers on baremetal w/ssd?s? (I recall reading somewhere that a mon disk was doing several thousand IOPS during a recovery operation) I assume putting just one on baremetal won?t help because our mon?s will only ever be as fast as our slowest mon server? I don't think this is related to where the mons are (most likely). The big question for me is whether IO is getting completely blocked, or just slowed enough that the VMs are all timing out. AM: I was looking at the cluster status while the rebalancing was taking place and I was seeing very little client IO reported by ceph -s output. The numbers were around 20-100 whereas our typical IO for the cluster is around 1000. Having said that, this was not enough as _all_ of our vms become unresponsive and didn't recover after rebalancing finished. What slow request messages did you see during the rebalance? AM: As I was experimenting with different options while trying to gain some client IO back i've noticed that when I am limiting the options to 1 per osd ( osd max backfills = 1, osd recovery max active = 1, osd recovery threads = 1), I did not have any slow
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi Sage, Andrija List I have seen the tuneables issue on our cluster when I upgraded to firefly. I ended up going back to legacy settings after about an hour as my cluster is of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our data, which after an hour all of our vm’s were frozen and I had to revert the change back to legacy settings and wait about the same time again until our cluster had recovered and reboot our vms. (wasn’t really expecting that one from the patch notes) Also our CPU usage went through the roof as well on our nodes, do you per chance have your metadata servers co-located on your osd nodes as we do? I’ve been thinking about trying to move these to dedicated nodes as it may resolve our issues. Regards, Quenten From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrija Panic Sent: Tuesday, 15 July 2014 8:38 PM To: Sage Weil Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. Thanks, Andrija On 14 July 2014 18:18, Sage Weil sw...@redhat.commailto:sw...@redhat.com wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Quenten, We've got two monitors sitting on the osd servers and one on a different server. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. - Original Message - From: Quenten Grasso qgra...@onq.com.au To: Andrija Panic andrija.pa...@gmail.com, Sage Weil sw...@redhat.com Cc: ceph-users@lists.ceph.com Sent: Wednesday, 16 July, 2014 1:20:19 PM Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, Andrija List I have seen the tuneables issue on our cluster when I upgraded to firefly. I ended up going back to legacy settings after about an hour as my cluster is of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our data, which after an hour all of our vm’s were frozen and I had to revert the change back to legacy settings and wait about the same time again until our cluster had recovered and reboot our vms. (wasn’t really expecting that one from the patch notes) Also our CPU usage went through the roof as well on our nodes, do you per chance have your metadata servers co-located on your osd nodes as we do? I’ve been thinking about trying to move these to dedicated nodes as it may resolve our issues. Regards, Quenten From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrija Panic Sent: Tuesday, 15 July 2014 8:38 PM To: Sage Weil Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. Thanks, Andrija On 14 July 2014 18:18, Sage Weil sw...@redhat.com wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
For me, 3 nodes, 1MON+ 2x2TB OSDs on each node... no mds used... I went through pain of waiting for data rebalancing and now I'm on optimal tunables... Cheers On 16 July 2014 14:29, Andrei Mikhailovsky and...@arhont.com wrote: Quenten, We've got two monitors sitting on the osd servers and one on a different server. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. -- *From: *Quenten Grasso qgra...@onq.com.au *To: *Andrija Panic andrija.pa...@gmail.com, Sage Weil sw...@redhat.com *Cc: *ceph-users@lists.ceph.com *Sent: *Wednesday, 16 July, 2014 1:20:19 PM *Subject: *Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, Andrija List I have seen the tuneables issue on our cluster when I upgraded to firefly. I ended up going back to legacy settings after about an hour as my cluster is of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our data, which after an hour all of our vm’s were frozen and I had to revert the change back to legacy settings and wait about the same time again until our cluster had recovered and reboot our vms. (wasn’t really expecting that one from the patch notes) Also our CPU usage went through the roof as well on our nodes, do you per chance have your metadata servers co-located on your osd nodes as we do? I’ve been thinking about trying to move these to dedicated nodes as it may resolve our issues. Regards, Quenten *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Andrija Panic *Sent:* Tuesday, 15 July 2014 8:38 PM *To:* Sage Weil *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. Thanks, Andrija On 14 July 2014 18:18, Sage Weil sw...@redhat.com wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
With 34 x 4TB OSDs over 4 hosts, I had 30% objects moved - about half full and took around 12 hours. Except now I can't use the kclient any more - wish I'd read that first. On 16 July 2014 13:36, Andrija Panic andrija.pa...@gmail.com wrote: For me, 3 nodes, 1MON+ 2x2TB OSDs on each node... no mds used... I went through pain of waiting for data rebalancing and now I'm on optimal tunables... Cheers On 16 July 2014 14:29, Andrei Mikhailovsky and...@arhont.com wrote: Quenten, We've got two monitors sitting on the osd servers and one on a different server. Andrei -- Andrei Mikhailovsky Director Arhont Information Security Web: http://www.arhont.com http://www.wi-foo.com Tel: +44 (0)870 4431337 Fax: +44 (0)208 429 3111 PGP: Key ID - 0x2B3438DE PGP: Server - keyserver.pgp.com DISCLAIMER The information contained in this email is intended only for the use of the person(s) to whom it is addressed and may be confidential or contain legally privileged information. If you are not the intended recipient you are hereby notified that any perusal, use, distribution, copying or disclosure is strictly prohibited. If you have received this email in error please immediately advise us by return email at and...@arhont.com and delete and purge the email and any attachments without making a copy. -- *From: *Quenten Grasso qgra...@onq.com.au *To: *Andrija Panic andrija.pa...@gmail.com, Sage Weil sw...@redhat.com *Cc: *ceph-users@lists.ceph.com *Sent: *Wednesday, 16 July, 2014 1:20:19 PM *Subject: *Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, Andrija List I have seen the tuneables issue on our cluster when I upgraded to firefly. I ended up going back to legacy settings after about an hour as my cluster is of 55 3TB OSD’s over 5 nodes and it decided it needed to move around 32% of our data, which after an hour all of our vm’s were frozen and I had to revert the change back to legacy settings and wait about the same time again until our cluster had recovered and reboot our vms. (wasn’t really expecting that one from the patch notes) Also our CPU usage went through the roof as well on our nodes, do you per chance have your metadata servers co-located on your osd nodes as we do? I’ve been thinking about trying to move these to dedicated nodes as it may resolve our issues. Regards, Quenten *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Andrija Panic *Sent:* Tuesday, 15 July 2014 8:38 PM *To:* Sage Weil *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. Thanks, Andrija On 14 July 2014 18:18, Sage Weil sw...@redhat.com wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. Thanks, Andrija On 14 July 2014 18:18, Sage Weil sw...@redhat.com wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
On Tue, 15 Jul 2014, Andrija Panic wrote: Hi Sage, since this problem is tunables-related, do we need to expect same behavior or not when we do regular data rebalancing caused by adding new/removing OSD? I guess not, but would like your confirmation. I'm already on optimal tunables, but I'm afraid to test this by i.e. shuting down 1 OSD. When you shut down a single OSD it is a relativey small amount of data that needs to move to do the recovery. The issue with the tunables is just that a huge fraction of the data stored needs to move, and the performance impact is much higher. sage Thanks, Andrija On 14 July 2014 18:18, Sage Weil sw...@redhat.com wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca77328324 51 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? -- Andrija Pani? -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi Andrija, I've got at least two more stories of similar nature. One is my friend running a ceph cluster and one is from me. Both of our clusters are pretty small. My cluster has only two osd servers with 8 osds each, 3 mons. I have an ssd journal per 4 osds. My friend has a cluster of 3 mons and 3 osd servers with 4 osds each and an ssd per 4 osds as well. Both clusters are connected with 40gbit/s IP over Infiniband links. We had the same issue while upgrading to firefly. However, we did not add any new disks, just ran the ceph osd crush tunables optimal command after following an upgrade. Both of our clusters were down as far as the virtual machines are concerned. All vms have crashed because of the lack of IO. It was a bit problematic, taking into account that ceph is typically so great at staying alive during failures and upgrades. So, there seems to be a problem with the upgrade. I wish devs would have added a big note in red letters that if you run this command it will likely affect your cluster performance and most likely all your vms will die. So, please shutdown your vms if you do not want to have data loss. I've changed the default values to reduce the load during recovery and also to tune a few things performance wise. My settings were: osd recovery max chunk = 8388608 osd recovery op priority = 2 osd max backfills = 1 osd recovery max active = 1 osd recovery threads = 1 osd disk threads = 2 filestore max sync interval = 10 filestore op threads = 20 filestore_flusher = false However, this didn't help much and i've noticed that shortly after running the tunnables command my guest vms iowait has quickly jumped to 50% and a to 99% a minute after. This has happened on all vms at once. During the recovery phase I ran the rbd -p poolname ls -l command several times and it took between 20-40 minutes to complete. It typically takes less than 2 seconds when the cluster is not in recovery mode. My mate's cluster had the same tunables apart from the last three. He had exactly the same behaviour. One other thing that i've noticed is that somewhere in the docs I've read that running the tunnable optimal command should move not more than 10% of your data. However, in both of our cases our status was just over 30% degraded and it took a good part of 9 hours to complete the data reshuffling. Any comments from the ceph team or other ceph gurus on: 1. What have we done wrong in our upgrade process 2. What options should we have used to keep our vms alive Cheers Andrei - Original Message - From: Andrija Panic andrija.pa...@gmail.com To: ceph-users@lists.ceph.com Sent: Sunday, 13 July, 2014 9:54:17 PM Subject: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi Andrei, nice to meet you again ;) Thanks for sharing this info with me - I though it was my mistake by introducing new OSD components at the same time - I though that since it's rebalancing, let's add those new OSD, so it also rebalances - so I don't have to cause 2 data rebalancing - but during normal OSD restart and data rebalancing (I did not set osd noout etc...) I did have somehat lower VM performacne, but it was all UP and fine. Also 30% of data moving during my upgrade/tunables change... although documents say 10% as you said. Did not lost any data, but finding all VMs that use CEPH as storage, is somewhat PITA... So, any CEPH developers input would be greatly appriciated... Thanks agan for such detailed info, Andrija On 14 July 2014 10:52, Andrei Mikhailovsky and...@arhont.com wrote: Hi Andrija, I've got at least two more stories of similar nature. One is my friend running a ceph cluster and one is from me. Both of our clusters are pretty small. My cluster has only two osd servers with 8 osds each, 3 mons. I have an ssd journal per 4 osds. My friend has a cluster of 3 mons and 3 osd servers with 4 osds each and an ssd per 4 osds as well. Both clusters are connected with 40gbit/s IP over Infiniband links. We had the same issue while upgrading to firefly. However, we did not add any new disks, just ran the ceph osd crush tunables optimal command after following an upgrade. Both of our clusters were down as far as the virtual machines are concerned. All vms have crashed because of the lack of IO. It was a bit problematic, taking into account that ceph is typically so great at staying alive during failures and upgrades. So, there seems to be a problem with the upgrade. I wish devs would have added a big note in red letters that if you run this command it will likely affect your cluster performance and most likely all your vms will die. So, please shutdown your vms if you do not want to have data loss. I've changed the default values to reduce the load during recovery and also to tune a few things performance wise. My settings were: osd recovery max chunk = 8388608 osd recovery op priority = 2 osd max backfills = 1 osd recovery max active = 1 osd recovery threads = 1 osd disk threads = 2 filestore max sync interval = 10 filestore op threads = 20 filestore_flusher = false However, this didn't help much and i've noticed that shortly after running the tunnables command my guest vms iowait has quickly jumped to 50% and a to 99% a minute after. This has happened on all vms at once. During the recovery phase I ran the rbd -p poolname ls -l command several times and it took between 20-40 minutes to complete. It typically takes less than 2 seconds when the cluster is not in recovery mode. My mate's cluster had the same tunables apart from the last three. He had exactly the same behaviour. One other thing that i've noticed is that somewhere in the docs I've read that running the tunnable optimal command should move not more than 10% of your data. However, in both of our cases our status was just over 30% degraded and it took a good part of 9 hours to complete the data reshuffling. Any comments from the ceph team or other ceph gurus on: 1. What have we done wrong in our upgrade process 2. What options should we have used to keep our vms alive Cheers Andrei -- *From: *Andrija Panic andrija.pa...@gmail.com *To: *ceph-users@lists.ceph.com *Sent: *Sunday, 13 July, 2014 9:54:17 PM *Subject: *[ceph-users] ceph osd crush tunables optimal AND add new OSD at thesame time Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Perhaps here: http://ceph.com/releases/v0-80-firefly-released/ Thanks On 14 July 2014 18:18, Sage Weil sw...@redhat.com wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage On Sun, 13 Jul 2014, Andrija Panic wrote: Hi, after seting ceph upgrade (0.72.2 to 0.80.3) I have issued ceph osd crush tunables optimal and after only few minutes I have added 2 more OSDs to the CEPH cluster... So these 2 changes were more or a less done at the same time - rebalancing because of tunables optimal, and rebalancing because of adding new OSD... Result - all VMs living on CEPH storage have gone mad, no disk access efectively, blocked so to speak. Since this rebalancing took 5h-6h, I had bunch of VMs down for that long... Did I do wrong by causing 2 rebalancing to happen at the same time ? Is this behaviour normal, to cause great load on all VMs because they are unable to access CEPH storage efectively ? Thanks for any input... -- Andrija Pani? -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Hi, which values are all changed with ceph osd crush tunables optimal? Is it perhaps possible to change some parameter the weekends before the upgrade is running, to have more time? (depends if the parameter are available in 0.72...). The warning told, it's can take days... we have an cluster with 5 storage node and 12 4TB-osd-disk each (60 osd), replica 2. The cluster is 60% filled. Networkconnection 10Gb. Takes tunables optimal in such an configuration one, two or more days? Udo On 14.07.2014 18:18, Sage Weil wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
On Mon, 14 Jul 2014, Udo Lembke wrote: Hi, which values are all changed with ceph osd crush tunables optimal? There are some brand new crush tunables that fix.. I don't even remember off hand. In general, you probably want to stay away from 'optimal' unless this is a fresh cluster and all clients are librados. Using the 'firefly' tunables is probably the safest bet. Keep in mind that adjusting tunables is going to move a bunch of data and client performance will be heavily impacted. If that's ok, go for it, otherise just stick with bobtail tunables unless/until it becomes a problem. sage Is it perhaps possible to change some parameter the weekends before the upgrade is running, to have more time? (depends if the parameter are available in 0.72...). The warning told, it's can take days... we have an cluster with 5 storage node and 12 4TB-osd-disk each (60 osd), replica 2. The cluster is 60% filled. Networkconnection 10Gb. Takes tunables optimal in such an configuration one, two or more days? Udo On 14.07.2014 18:18, Sage Weil wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph osd crush tunables optimal AND add new OSD at the same time
Udo, I had all VMs completely unoperational - so don't set optimal for now... On 14 July 2014 20:48, Udo Lembke ulem...@polarzone.de wrote: Hi, which values are all changed with ceph osd crush tunables optimal? Is it perhaps possible to change some parameter the weekends before the upgrade is running, to have more time? (depends if the parameter are available in 0.72...). The warning told, it's can take days... we have an cluster with 5 storage node and 12 4TB-osd-disk each (60 osd), replica 2. The cluster is 60% filled. Networkconnection 10Gb. Takes tunables optimal in such an configuration one, two or more days? Udo On 14.07.2014 18:18, Sage Weil wrote: I've added some additional notes/warnings to the upgrade and release notes: https://github.com/ceph/ceph/commit/fc597e5e3473d7db6548405ce347ca7732832451 If there is somewhere else where you think a warning flag would be useful, let me know! Generally speaking, we want to be able to cope with huge data rebalances without interrupting service. It's an ongoing process of improving the recovery vs client prioritization, though, and removing sources of overhead related to rebalancing... and it's clearly not perfect yet. :/ sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- http://admintweets.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com