Re: [ceph-users] Ceph RBD bench has a strange behaviour when RBD client caching is active
Ceph package is 0.94.5, which is hammer. So yes it could very well be this bug. Must I assume then that it only affects rbd bench and not the general functionality of the client? On 2016-01-25 1:59 PM, Jason Dillaman wrote: > What release are you testing? You might be hitting this issue [1] where 'rbd > bench-write' would issue the same IO request repeatedly. With writeback > cache enabled, this would result in virtually no ops issued to the backend. > > [1] http://tracker.ceph.com/issues/14283 > -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph RBD bench has a strange behaviour when RBD client caching is active
Hi, We've run into a weird issue on our current test setup. We're currently testing a small low-cost Ceph setup, with sata drives, 1gbps ethernet and an Intel SSD for journaling per host. We've linked this to an openstack setup. Ceph is the latest Hammer release. We notice that when we do rbd benchmarks using the rbd bench tool, the benchmark never really complete. It starts, run for 3 seconds and then stop, despite the rbd drive being 10 GB and the tool using 4k block size. If I set the rbd caching to false, the benchmark runs normally and complete after a few minutes. How can the rbd_cache affect the benchmark tool in this manner and does it has direct impacts on the openstack cluster running of this ceph setup? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Backing up ceph rbd content to an external storage
Hi, We've been considering periodically backing up rbds from ceph to a different storage backend, just in case. I've thought of a few ways this could be possible, but I am curious if anybody on this list is currently doing that. Are you currently backing up data that is contained in ceph? What do you think is the best way to do it? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Strange logging behaviour for ceph
Hi, We're using Ceph Hammer 0.94.1 on centOS 7. On the monitor, when we set log_to_syslog = true Ceph starts shooting logs at stdout. I thought at first it might be rsyslog that is wrongly configured, but I did not find a rule that could explain this behavior. Can anybody else replicate this? If it's a bug, has it been fixed in more recent version? (I couldn't find anything relating to such an issue). -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bad performances in recovery
Hi, First of all, we are sure that the return to the default configuration fixed it. As soon as we restarted only one of the ceph nodes with the default configuration, it sped up recovery tremedously. We had already restarted before with the old conf and recovery was never that fast. Regarding the configuration, here's the old one with comments : [global] fsid = * mon_initial_members = cephmon1 mon_host = *** auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true // Let's you use xattributes of xfs/ext4/btrfs filesystems osd_pool_default_pgp_num = 450 // default pgp number for new pools osd_pg_bits = 12 // number of bits used to designate pgps. Lets you have 2^12 pgps osd_pool_default_size = 3 // default copy number for new pools osd_pool_default_pg_num = 450// default pg number for new pools public_network = * cluster_network = *** osd_pgp_bits = 12 // number of bits used to designate pgps. Let's you have 2^12 pgps [osd] filestore_queue_max_ops = 5000// set to 500 by default Defines the maximum number of in progress operations the file store accepts before blocking on queuing new operations. filestore_fd_cache_random = true// journal_queue_max_ops = 100 // set to 500 by default. Number of operations allowed in the journal queue filestore_omap_header_cache_size = 100 // Determines the size of the LRU used to cache object omap headers. Larger values use more memory but may reduce lookups on omap. filestore_fd_cache_size = 100 // not in the ceph documentation. Seems to be a common tweak for SSD clusters though. max_open_files = 100 // lets ceph set the max file descriptor in the OS to prevent running out of file descriptors osd_journal_size = 1 // journal max size for each OSD New conf: [global] fsid = * mon_initial_members = cephmon1 mon_host = auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx public_network = ** cluster_network = ** You might notice, I have a few undocumented settings in the old configuration. These are settings I took from a certain openstack summit presentation and they may have contributed to this whole problem. Here's a list of settings that I think might be a possible cause for these speed issues: filestore_fd_cache_random = true filestore_fd_cache_size = 100 Additionally, my colleague thinks these settings may have contributed : filestore_queue_max_ops = 5000 journal_queue_max_ops = 100 We will do further tests on these settings once we have our lab ceph test environment as we are also curious as to exactly what caused this. On 2015-08-20 11:43 AM, Alex Gorbachev wrote: Just to update the mailing list, we ended up going back to default ceph.conf without any additional settings than what is mandatory. We are now reaching speeds we never reached before, both in recovery and in regular usage. There was definitely something we set in the ceph.conf bogging everything down. Could you please share the old and new ceph.conf, or the section that was removed? Best regards, Alex On 2015-08-20 4:06 AM, Christian Balzer wrote: Hello, from all the pertinent points by Somnath, the one about pre-conditioning would be pretty high on my list, especially if this slowness persists and nothing else (scrub) is going on. This might be fixed by doing a fstrim. Additionally the levelDB's per OSD are of course sync'ing heavily during reconstruction, so that might not be the favorite thing for your type of SSDs. But ultimately situational awareness is very important, as in what is actually going and slowing things down. As usual my recommendations would be to use atop, iostat or similar on all your nodes and see if your OSD SSDs are indeed the bottleneck or if it is maybe just one of them or something else entirely. Christian On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote: Also, check if scrubbing started in the cluster or not. That may considerably slow down the cluster. -Original Message- From: Somnath Roy Sent: Wednesday, August 19, 2015 1:35 PM To: 'J-P Methot'; ceph-us...@ceph.com Subject: RE: [ceph-users] Bad performances in recovery All the writes will go through the journal. It may happen your SSDs are not preconditioned well and after a lot of writes during recovery IOs are stabilized to lower number. This is quite common for SSDs
Re: [ceph-users] Bad performances in recovery
Hi, Just to update the mailing list, we ended up going back to default ceph.conf without any additional settings than what is mandatory. We are now reaching speeds we never reached before, both in recovery and in regular usage. There was definitely something we set in the ceph.conf bogging everything down. On 2015-08-20 4:06 AM, Christian Balzer wrote: Hello, from all the pertinent points by Somnath, the one about pre-conditioning would be pretty high on my list, especially if this slowness persists and nothing else (scrub) is going on. This might be fixed by doing a fstrim. Additionally the levelDB's per OSD are of course sync'ing heavily during reconstruction, so that might not be the favorite thing for your type of SSDs. But ultimately situational awareness is very important, as in what is actually going and slowing things down. As usual my recommendations would be to use atop, iostat or similar on all your nodes and see if your OSD SSDs are indeed the bottleneck or if it is maybe just one of them or something else entirely. Christian On Wed, 19 Aug 2015 20:54:11 + Somnath Roy wrote: Also, check if scrubbing started in the cluster or not. That may considerably slow down the cluster. -Original Message- From: Somnath Roy Sent: Wednesday, August 19, 2015 1:35 PM To: 'J-P Methot'; ceph-us...@ceph.com Subject: RE: [ceph-users] Bad performances in recovery All the writes will go through the journal. It may happen your SSDs are not preconditioned well and after a lot of writes during recovery IOs are stabilized to lower number. This is quite common for SSDs if that is the case. Thanks Regards Somnath -Original Message- From: J-P Methot [mailto:jpmet...@gtcomm.net] Sent: Wednesday, August 19, 2015 1:03 PM To: Somnath Roy; ceph-us...@ceph.com Subject: Re: [ceph-users] Bad performances in recovery Hi, Thank you for the quick reply. However, we do have those exact settings for recovery and it still strongly affects client io. I have looked at various ceph logs and osd logs and nothing is out of the ordinary. Here's an idea though, please tell me if I am wrong. We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was explained several times on this mailing list, Samsung SSDs suck in ceph. They have horrible O_dsync speed and die easily, when used as journal. That's why we're using Intel ssds for journaling, so that we didn't end up putting 96 samsung SSDs in the trash. In recovery though, what is the ceph behaviour? What kind of write does it do on the OSD SSDs? Does it write directly to the SSDs or through the journal? Additionally, something else we notice: the ceph cluster is MUCH slower after recovery than before. Clearly there is a bottleneck somewhere and that bottleneck does not get cleared up after the recovery is done. On 2015-08-19 3:32 PM, Somnath Roy wrote: If you are concerned about *client io performance* during recovery, use these settings.. osd recovery max active = 1 osd max backfills = 1 osd recovery threads = 1 osd recovery op priority = 1 If you are concerned about *recovery performance*, you may want to bump this up, but I doubt it will help much from default settings.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P Methot Sent: Wednesday, August 19, 2015 12:17 PM To: ceph-us...@ceph.com Subject: [ceph-users] Bad performances in recovery Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] Bad performances in recovery
Hi, Thank you for the quick reply. However, we do have those exact settings for recovery and it still strongly affects client io. I have looked at various ceph logs and osd logs and nothing is out of the ordinary. Here's an idea though, please tell me if I am wrong. We use intel SSDs for journaling and samsung SSDs as proper OSDs. As was explained several times on this mailing list, Samsung SSDs suck in ceph. They have horrible O_dsync speed and die easily, when used as journal. That's why we're using Intel ssds for journaling, so that we didn't end up putting 96 samsung SSDs in the trash. In recovery though, what is the ceph behaviour? What kind of write does it do on the OSD SSDs? Does it write directly to the SSDs or through the journal? Additionally, something else we notice: the ceph cluster is MUCH slower after recovery than before. Clearly there is a bottleneck somewhere and that bottleneck does not get cleared up after the recovery is done. On 2015-08-19 3:32 PM, Somnath Roy wrote: If you are concerned about *client io performance* during recovery, use these settings.. osd recovery max active = 1 osd max backfills = 1 osd recovery threads = 1 osd recovery op priority = 1 If you are concerned about *recovery performance*, you may want to bump this up, but I doubt it will help much from default settings.. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J-P Methot Sent: Wednesday, August 19, 2015 12:17 PM To: ceph-us...@ceph.com Subject: [ceph-users] Bad performances in recovery Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bad performances in recovery
Hi, Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The ceph version is hammer v0.94.1 . There is a performance overhead because we're using SSDs (I've heard it gets better in infernalis, but we're not upgrading just yet) but we can reach numbers that I would consider alright. Now, the issue is, when the cluster goes into recovery it's very fast at first, but then slows down to ridiculous levels as it moves forward. You can go from 7% to 2% to recover in ten minutes, but it may take 2 hours to recover the last 2%. While this happens, the attached openstack setup becomes incredibly slow, even though there is only a small fraction of objects still recovering (less than 1%). The settings that may affect recovery speed are very low, as they are by default, yet they still affect client io speed way more than it should. Why would ceph recovery become so slow as it progress and affect client io even though it's recovering at a snail's pace? And by a snail's pace, I mean a few kb/second on 10gbps uplinks. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to backup hundreds or thousands of TB
Case in point, here's a little story as to why backup outside ceph is necessary: I was working on modifying journal locations for a running test ceph cluster when, after bringing back a few OSD nodes, two PGs started being marked as incomplete. That made all operations on the pool hang as, for some reason, rbd clients couldn't read the missing PG and there was no timeout value for their operation. After spending half a day fixing this, I ended up needing to delete the pool and then recreate it. Thankfully that setup was not in production so it was only a minor setback. So, when we go in production with our setup, we are planning to have a second ceph for backups, just in case such an issue happens again. I don't want to scare anyone and I'm pretty sure my issue was very exceptional, but no matter how well ceph replicate and ensures data safety, backups are still a good idea, in my humble opinion. On 5/6/2015 6:35 AM, Mariusz Gronczewski wrote: Snapshot on same storage cluster should definitely NOT be treated as backup Snapshot as a source for backup however can be pretty good solution for some cases, but not every case. For example if using ceph to serve static web files, I'd rather have possibility to restore given file from given path than snapshot of whole multiple TB cluster. There are 2 cases for backup restore: * something failed, need to fix it - usually full restore needed * someone accidentally removed a thing, and now they need a thing back Snapshots fix first problem, but not the second one, restoring 7TB of data to recover few GBs is not reasonable. As it is now we just backup from inside VMs (file-based backup) and have puppet to easily recreate machine config but if (or rather when) we would use object store we would backup it in a way that allows for partial restore. On Wed, 6 May 2015 10:50:34 +0100, Nick Fisk n...@fisk.me.uk wrote: For me personally I would always feel more comfortable with backups on a completely different storage technology. Whilst there are many things you can do with snapshots and replication, there is always a small risk that whatever causes data loss on your primary system may affect/replicate to your 2nd copy. I guess it all really depends on what you are trying to protect against, but Tape still looks very appealing if you want to maintain a completely isolated copy of data. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alexandre DERUMIER Sent: 06 May 2015 10:10 To: Götz Reinicke Cc: ceph-users Subject: Re: [ceph-users] How to backup hundreds or thousands of TB for the moment, you can use snapshot for backup https://ceph.com/community/blog/tag/backup/ I think that async mirror is on the roadmap https://wiki.ceph.com/Planning/Blueprints/Hammer/RBD%3A_Mirroring if you use qemu, you can do qemu full backup. (qemu incremental backup is coming for qemu 2.4) - Mail original - De: Götz Reinicke goetz.reini...@filmakademie.de À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 6 Mai 2015 10:25:01 Objet: [ceph-users] How to backup hundreds or thousands of TB Hi folks, beside hardware and performance and failover design: How do you manage to backup hundreds or thousands of TB :) ? Any suggestions? Best practice? A second ceph cluster at a different location? bigger archive Disks in good boxes? Or tabe-libs? What kind of backupsoftware can handle such volumes nicely? Thanks and regards . Götz -- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82 420 E-Mail goetz.reini...@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
Thank you everyone for your replies. We are currently in the process of selecting new drives for journaling to replace the samsung drives. We're running our own tests using DD and the command found here : http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ dd if=randfile of=/dev/sda bs=4k count=10 oflag=direct,dsync I have no trouble believing that this is right, but I was asked to double-check the command's validity. So, does this command emulate properly the way ceph journaling write to SSD? If you want, I can also post the results of our test on different drives once we're done. On 4/21/2015 4:04 AM, Andrei Mikhailovsky wrote: Hi I have been testing the Samsung 840 Pro (128gb) for quite sometime and I can also confirm that this drive is unsuitable for osd journal. The performance and latency that I get from these drives (according to ceph osd perf) are between 10 - 15 times slower compared to the Intel 520. The Intel 530 drives are also pretty awful. They are meant to be a replacement of the 520 drives, but the performance is pretty bad. I have found Intel 520 to be a reasonable drive for performance per price, for a cluster without a great deal of writes. However they do not make those anymore. Otherwise, it seems that the Intel 3600 and 3700 series is a good performer and has a much longer life expectancy. Andrei *From: *Eneko Lacunza elacu...@binovo.es *To: *J-P Methot jpmet...@gtcomm.net, Christian Balzer ch...@gol.com, ceph-users@lists.ceph.com *Sent: *Tuesday, 21 April, 2015 8:18:20 AM *Subject: *Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals) Hi, I'm just writing to you to stress out what others have already said, because it is very important that you take it very seriously. On 20/04/15 19:17, J-P Methot wrote: On 4/20/2015 11:01 AM, Christian Balzer wrote: This is similar to another thread running right now, but since our current setup is completely different from the one described in the other thread, I thought it may be better to start a new one. We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach write speeds of roughly 400 MB/sec, plugged in jbod on a controller that can theoretically transfer at 6gb/sec. All of that is linked to openstack compute nodes on two bonded 10gbps links (so a max transfer rate of 20 gbps). I sure as hell hope you're not planning to write all that much to this cluster. But then again you're worried about write speed, so I guess you do. Those _consumer_ SSDs will be dropping like flies, there are a number of threads about them here. They also might be of the kind that don't play well with O_DSYNC, I can't recall for sure right now, check the archives. Consumer SSDs universally tend to slow down quite a bit when not TRIM'ed and/or subjected to prolonged writes, like those generated by a benchmark. I see, yes it looks like these SSDs are not the best for the job. We will not change them for now, but if they start failing, we will replace them with better ones. I tried to put a Samsung 840 Pro 256GB in a ceph setup. It is supposed to be quite better than the EVO right? It was total crap. No not the best for the job. TOTAL CRAP. :) It can't give any useful write performance for a Ceph OSD. Spec sheet numbers don't matter for this, they don't work for ceph OSD, period. And yes, the drive is fine and works like a charm in workstation workloads. I suggest you at least get some intel S3700/S3610 and use them for the journal of those samsung drives, I think that could help performance a lot. Cheers Eneko -- Zuzendari Teknikoa / Director Técnico Binovo IT Human Project, S.L. Telf. 943575997 943493611 Astigarraga bidea 2, planta 6 dcha., ofi. 3-2; 20180 Oiartzun (Gipuzkoa) www.binovo.es ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
My journals are on-disk, each disk being a SSD. The reason I didn't go with dedicated drives for journals is that when designing the setup, I was told that having dedicated journal SSDs on a full-SSD setup would not give me performance increases. So that makes the journal disk to data disk ratio 1:1. The replication size is 3, yes. The pools are replicated. On 4/20/2015 10:43 AM, Barclay Jameson wrote: Are your journals on separate disks? What is your ratio of journal disks to data disks? Are you doing replication size 3 ? On Mon, Apr 20, 2015 at 9:30 AM, J-P Methot jpmet...@gtcomm.net mailto:jpmet...@gtcomm.net wrote: Hi, This is similar to another thread running right now, but since our current setup is completely different from the one described in the other thread, I thought it may be better to start a new one. We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach write speeds of roughly 400 MB/sec, plugged in jbod on a controller that can theoretically transfer at 6gb/sec. All of that is linked to openstack compute nodes on two bonded 10gbps links (so a max transfer rate of 20 gbps). When I run rados bench from the compute nodes, I reach the network cap in read speed. However, write speeds are vastly inferior, reaching about 920 MB/sec. If I have 4 compute nodes running the write benchmark at the same time, I can see the number plummet to 350 MB/sec . For our planned usage, we find it to be rather slow, considering we will run a high number of virtual machines in there. Of course, the first thing to do would be to transfer the journal on faster drives. However, these are SSDs we're talking about. We don't really have access to faster drives. I must find a way to get better write speeds. Thus, I am looking for suggestions as to how to make it faster. I have also thought of options myself like: -Upgrading to the latest stable hammer version (would that really give me a big performance increase?) -Crush map modifications? (this is a long shot, but I'm still using the default crush map, maybe there's a change there I could make to improve performances) Any suggestions as to anything else I can tweak would be strongly appreciated. For reference, here's part of my ceph.conf: [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx osd pool default size = 3 osd pg bits = 12 osd pgp bits = 12 osd pool default pg num = 800 osd pool default pgp num = 800 [client] rbd cache = true rbd cache writethrough until flush = true [osd] filestore_fd_cache_size = 100 filestore_omap_header_cache_size = 100 filestore_fd_cache_random = true filestore_queue_max_ops = 5000 journal_queue_max_ops = 100 max_open_files = 100 osd journal size = 1 -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 tel:1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 tel:1-%28514%29-907-0750 jpmet...@gtcomm.net mailto:jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Possible improvements for a slow write speed (excluding independent SSD journals)
Hi, This is similar to another thread running right now, but since our current setup is completely different from the one described in the other thread, I thought it may be better to start a new one. We are running Ceph Firefly 0.80.8 (soon to be upgraded to 0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96 OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach write speeds of roughly 400 MB/sec, plugged in jbod on a controller that can theoretically transfer at 6gb/sec. All of that is linked to openstack compute nodes on two bonded 10gbps links (so a max transfer rate of 20 gbps). When I run rados bench from the compute nodes, I reach the network cap in read speed. However, write speeds are vastly inferior, reaching about 920 MB/sec. If I have 4 compute nodes running the write benchmark at the same time, I can see the number plummet to 350 MB/sec . For our planned usage, we find it to be rather slow, considering we will run a high number of virtual machines in there. Of course, the first thing to do would be to transfer the journal on faster drives. However, these are SSDs we're talking about. We don't really have access to faster drives. I must find a way to get better write speeds. Thus, I am looking for suggestions as to how to make it faster. I have also thought of options myself like: -Upgrading to the latest stable hammer version (would that really give me a big performance increase?) -Crush map modifications? (this is a long shot, but I'm still using the default crush map, maybe there's a change there I could make to improve performances) Any suggestions as to anything else I can tweak would be strongly appreciated. For reference, here's part of my ceph.conf: [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx osd pool default size = 3 osd pg bits = 12 osd pgp bits = 12 osd pool default pg num = 800 osd pool default pgp num = 800 [client] rbd cache = true rbd cache writethrough until flush = true [osd] filestore_fd_cache_size = 100 filestore_omap_header_cache_size = 100 filestore_fd_cache_random = true filestore_queue_max_ops = 5000 journal_queue_max_ops = 100 max_open_files = 100 osd journal size = 1 -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Migrating objects from one pool to another?
That's a great idea. I know I can setup cinder (the openstack volume manager) as a multi-backend manager and migrate from one backend to the other, each backend linking to different pools of the same ceph cluster. What bugs me though is that I'm pretty sure the image store, glance, wouldn't let me do that. Additionally, since the compute component also has its own ceph pool, I'm pretty sure it won't let me migrate the data through openstack. On 3/26/2015 3:54 PM, Steffen W Sørensen wrote: On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote: Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? What I did a one point was going one layer higher in my storage abstraction, and created new Ceph pools and used those for new storage resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual disks there, assume you could do the same in OpenStack. My 0.02$ /Steffen -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Migrating objects from one pool to another?
Hi, Lately I've been going back to work on one of my first ceph setup and now I see that I have created way too many placement groups for the pools on that setup (about 10 000 too many). I believe this may impact performances negatively, as the performances on this ceph cluster are abysmal. Since it is not possible to reduce the number of PGs in a pool, I was thinking of creating new pools with a smaller number of PGs, moving the data from the old pools to the new pools and then deleting the old pools. I haven't seen any command to copy objects from one pool to another. Would that be possible? I'm using ceph for block storage with openstack, so surely there must be a way to move block devices from a pool to another, right? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Cluster Address
I had to go through the same experience of changing the public network address and it's not easy. Ceph seems to keep a record of what ip address is associated to what OSD and a port number for the process. I was never able to find out where this record is kept or how to change it manually. Here's what I did, from memory : 1. Remove the network address I didn't want to use anymore from the ceph.conf and put the one I wanted to use instead. Don't worry, modifying the ceph.conf will not affect a currently running cluster unless you issue a command to it, like adding an OSD. 2. Remove each OSD one by one and then reinitialize them right after. You will lose the data that's on the OSD, but if your cluster is replicated properly and do this operation one OSD at a time, you should not lose the copies of that data. 3. Check the OSD status to make sure they use the proper IP. The command ceph osd dump will tell you if your OSDs are detected on the proper IP. 4. Remove and reinstall each monitor one by one. If anybody else has another solution I'd be curious to hear it, but this is how I managed to do it, by basically reinstalling each component one by one. On 3/3/2015 12:26 PM, Garg, Pankaj wrote: Hi, I have ceph cluster that is contained within a rack (1 Monitor and 5 OSD nodes). I kept the same public and private address for configuration. I do have 2 NICS and 2 valid IP addresses (one internal only and one external) for each machine. Is it possible now, to change the Public Network address, after the cluster is up and running? I had used Ceph-deploy for the cluster. If I change the address of the public network in Ceph.conf, do I need to propagate to all the machines in the cluster or just the Monitor Node is enough? Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] client unable to access files after caching pool addition
Hi, I tried to add a caching pool in front of openstack vms and volumes pools. I believed that the process was transparent, but as soon as I set the caching for both of these pools, the VMs could not find their volumes anymore. Obviously when I undid my changes, everything went back to normal. Could it be an authorization issue? Would the openstack vms need to connect to the caching pool instead of the storage pools to be able to access their volumes? Or is the configuration supposed to stay the same and the process is supposed to be completely transparent? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] filestore_fiemap and other ceph tweaks
Hi, I've been looking into increasing the performance of my ceph cluster for openstack that will be moved in production soon. It's a full 1TB SSD cluster with 16 OSD per node over 6 nodes. As I searched for possible tweaks to implement, I stumbled upon unitedstack's presentation at the openstack paris summit (video : https://www.openstack.org/summit/openstack-paris-summit-2014/session-videos/presentation/build-a-high-performance-and-high-durability-block-storage-service-based-on-ceph). Now, before implementing any of the suggested tweaks, I've been reading up on each one. It's not that I don't trust everything that's being said there, but I thought it may be better to inform myself before starting to implement tweaks that may strongly impact the performance and stability of my cluster. One of the suggested tweaks is to set filestore_fiemap to true. The issue is, after some research, I found that there is a rados block device corruption bug linked to setting that option to true (link: http://www.spinics.net/lists/ceph-devel/msg06851.html ). I have not found any trace of that bug being fixed since, despite the mailing list message being fairly old. Is it safe to set filestore_fiemap to true? Additionally, if anybody feels like watching the video or reading the presentation (slides are at http://www.spinics.net/lists/ceph-users/attachments/pdfUlINnd6l8e.pdf ), what do you think of the part about the other tweaks and the data durability part? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] filestore_fiemap and other ceph tweaks
Thank you very much. Also thank you for the presentation you made in Paris, it was very instructive. So, from what I understand, the fiemap patch is proven to work on kernel 2.6.32 . The good news is that we use the same kernel in our setup. How long have your production cluster been running with fiemap set to true? On 2/2/2015 10:47 AM, Haomai Wang wrote: There exists a more recently discuss in PR(https://github.com/ceph/ceph/pull/1665). On Mon, Feb 2, 2015 at 11:05 PM, J-P Methot jpmet...@gtcomm.net wrote: Hi, I've been looking into increasing the performance of my ceph cluster for openstack that will be moved in production soon. It's a full 1TB SSD cluster with 16 OSD per node over 6 nodes. As I searched for possible tweaks to implement, I stumbled upon unitedstack's presentation at the openstack paris summit (video : https://www.openstack.org/summit/openstack-paris-summit-2014/session-videos/presentation/build-a-high-performance-and-high-durability-block-storage-service-based-on-ceph). Now, before implementing any of the suggested tweaks, I've been reading up on each one. It's not that I don't trust everything that's being said there, but I thought it may be better to inform myself before starting to implement tweaks that may strongly impact the performance and stability of my cluster. One of the suggested tweaks is to set filestore_fiemap to true. The issue is, after some research, I found that there is a rados block device corruption bug linked to setting that option to true (link: http://www.spinics.net/lists/ceph-devel/msg06851.html ). I have not found any trace of that bug being fixed since, despite the mailing list message being fairly old. Is it safe to set filestore_fiemap to true? Additionally, if anybody feels like watching the video or reading the presentation (slides are at http://www.spinics.net/lists/ceph-users/attachments/pdfUlINnd6l8e.pdf ), what do you think of the part about the other tweaks and the data durability part? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSDs not getting mounted back after reboot
Hi, I'm having an issue wuite similar to this old bug : http://tracker.ceph.com/issues/5194, except that I'm using centos 6. Basically, I setup a cluster using ceph-deploy to save some time (this is a 90+ OSD cluster). I rebooted a node earlier today and now all the drives are unmounted and any attempt at mounting them manually returns : mount: special device /dev/sda1 does not exist However, those partitions are listed if I do sfdisk -l /dev/sda. I have also tried to do a partprobe on the devices, as was done in the previous bug, to no avail. /usr/sbin/ceph-disk-activate --mount /dev/sda1 tells me that the device does not exist. Is it a bug, or am I doing something wrong? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph configuration on multiple public networks.
Hi, We've setup ceph and openstack on a fairly peculiar network configuration (or at least I think it is) and I'm looking for information on how to make it work properly. Basically, we have 3 networks, a management network, a storage network and a cluster network. The management network is over a 1 gbps link, while the storage network is over 2 bonded 10 gbps links. The cluster network can be ignored for now, as it works well. Now, the main problem is that ceph osd nodes are plugged on the management, storage and cluster networks, but the monitors are only plugged on the management network. When I do tests, I see that all the traffic ends up going through the management network, slowing down ceph's performances. Because of the current network setup, I can't hook up the monitoring nodes on the storage network, as we're missing ports on the switch. Would it be possible to maintain access to the management nodes while forcing the ceph cluster to use the storage network for data transfer? As a reference, here's my ceph.conf. [global] osd_pool_default_pgp_num = 800 osd_pg_bits = 12 auth_service_required = cephx osd_pool_default_size = 3 filestore_xattr_use_omap = true auth_client_required = cephx osd_pool_default_pg_num = 800 auth_cluster_required = cephx mon_host = 10.251.0.51 public_network = 10.251.0.0/24, 10.21.0.0/24 mon_initial_members = cephmon1 cluster_network = 192.168.31.0/24 fsid = 60e1b557-e081-4dab-aa76-e68ba38a159e osd_pgp_bits = 12 As you can see I've setup 2 public networks, 10.251.0.0 being the management network and 10.21.0.0 being the storage network. Would it be possible to maintain cluster functionality and remove 10.251.0.0/24 from the public_network list? For example, if I were to remove it from the public network list and referenced each monitor node IP in the config file, would I be able to maintain connectivity? -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Increasing osd pg bits and osd pgp bits after cluster has been setup
Hi, I'm trying to reach a number of 4096 pg as suggested in the doc, but I can't have more than 32 pgs per OSD. I suspect this is caused by the default of 6 pg bits (2^5 = 32, the first bit being for 2^0). Is there a command to increase it once the OSDs have been linked to the cluster and the default pool has been created? I have increased the number of bits to 9 in ceph.conf, but it is not taken into account by the OSDs that already exist. -- == Jean-Philippe Méthot Administrateur système / System administrator GloboTech Communications Phone: 1-514-907-0050 Toll Free: 1-(888)-GTCOMM1 Fax: 1-(514)-907-0750 jpmet...@gtcomm.net http://www.gtcomm.net ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com