Re: [ceph-users] Persistent Write Back Cache
Hi Nick, Christian, This is something we've discussed a bit but hasn't made it to the top of the list. I think having a single persistent copy on the client has *some* value, although it's limited because its a single point of failure. The simplest scenario would be to use it as a write-through cache that accellerates reads only. Another option would be to have a shared but local device (like an SSD that is connected to a pair of client hosts, or has fast access within a rack--a scenario that I've heard a few vendors talk about). It still leaves a host pair or rack as a failure zone, but there are times where that's appropriate. In either case, though, I think the key RBD feature that would make it much more valuable would be if RBD (librbd presumably) could maintain the writeback cache with some sort of checkpoints or journal internally such that writes that get flushed back to the cluster are always *crash consistent*. So even if you lose the client cache entirely, your disk image is still holding a valid file system that looks like it is just a little bit stale. If the client-side writeback cache were structured as a data journal this would be pretty staightforward... it might even mesh well with the RBD mirroring? sage On Wed, 4 Mar 2015, Nick Fisk wrote: Hi Christian, Yes that's correct, it's on the client side. I don't see this much different to a battery backed Raid controller, if you lose power, the data is in the cache until power resumes when it is flushed. If you are going to have the same RBD accessed by multiple servers/clients then you need to make sure the SSD is accessible to both (eg DRBD / Dual Port SAS). But then something like pacemaker would be responsible for ensuring the RBD and cache device are both present before allowing client access. When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's, however I can understand that this feature would prove more of a challenge if you are using Qemu and RBD. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 04 March 2015 08:40 To: ceph-users@lists.ceph.com Cc: Nick Fisk Subject: Re: [ceph-users] Persistent Write Back Cache Hello, If I understand you correctly, you're talking about the rbd cache on the client side. So assume that host or the cache SSD on if fail terminally. The client thinks its sync'ed are on the permanent storage (the actual ceph storage cluster), while they are only present locally. So restarting that service or VM on a different host now has to deal with likely crippling data corruption. Regards, Christian On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote: Hi All, Is there anything in the pipeline to add the ability to write the librbd cache to ssd so that it can safely ignore sync requests? I have seen a thread a few years back where Sage was discussing something similar, but I can't find anything more recent discussing it. I've been running lots of tests on our new cluster, buffered/parallel performance is amazing (40K Read 10K write iops), very impressed. However sync writes are actually quite disappointing. Running fio with 128k block size and depth=1, normally only gives me about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's and from what I hear that's about normal, so I don't think I have a ceph config problem. For applications which do a lot of sync's, like ESXi over iSCSI or SQL databases, this has a major performance impact. Traditional storage arrays work around this problem by having a battery backed cache which has latency 10-100 times less than what you can currently achieve with Ceph and an SSD . Whilst librbd does have a writeback cache, from what I understand it will not cache syncs and so in my usage case, it effectively acts like a write through cache. To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. Nick -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
Yes, good idea. I was looking the «WBThrottle» feature, but go for logging instead. Le mercredi 04 mars 2015 à 17:10 +0100, Alexandre DERUMIER a écrit : Only writes ;) ok, so maybe some background operations (snap triming, scrubing...). maybe debug_osd=20 , could give you more logs ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 16:42:13 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Only writes ;) Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit : The change is only on OSD (and not on OSD journal). do you see twice iops for read and write ? if only read, maybe a read ahead bug could explain this. - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 15:13:30 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Ceph health is OK yes. The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by ceph : there is no change between dumpling and firefly. The change is only on OSD (and not on OSD journal). Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : The load problem is permanent : I have twice IO/s on HDD since firefly. Oh, permanent, that's strange. (If you don't see more traffic coming from clients, I don't understand...) do you see also twice ios/ ops in ceph -w stats ? is the ceph health ok ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 14:49:41 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from
Re: [ceph-users] Implement replication network with live cluster
If I remember right, someone has done this on a live cluster without any issues. I seem to remember that it had a fallback mechanism if the OSDs couldn't be reached on the cluster network to contact them on the public network. You could test it pretty easily without much impact. Take one OSD that has both networks and configure it and restart the process. If all the nodes (specifically the old ones with only one network) is able to connect to it, then you are good to go by restarting one OSD at a time. On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote: Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
Only writes ;) Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit : The change is only on OSD (and not on OSD journal). do you see twice iops for read and write ? if only read, maybe a read ahead bug could explain this. - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 15:13:30 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Ceph health is OK yes. The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by ceph : there is no change between dumpling and firefly. The change is only on OSD (and not on OSD journal). Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : The load problem is permanent : I have twice IO/s on HDD since firefly. Oh, permanent, that's strange. (If you don't see more traffic coming from clients, I don't understand...) do you see also twice ios/ ops in ceph -w stats ? is the ceph health ok ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 14:49:41 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
Only writes ;) ok, so maybe some background operations (snap triming, scrubing...). maybe debug_osd=20 , could give you more logs ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 16:42:13 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Only writes ;) Le mercredi 04 mars 2015 à 16:19 +0100, Alexandre DERUMIER a écrit : The change is only on OSD (and not on OSD journal). do you see twice iops for read and write ? if only read, maybe a read ahead bug could explain this. - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 15:13:30 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Ceph health is OK yes. The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by ceph : there is no change between dumpling and firefly. The change is only on OSD (and not on OSD journal). Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : The load problem is permanent : I have twice IO/s on HDD since firefly. Oh, permanent, that's strange. (If you don't see more traffic coming from clients, I don't understand...) do you see also twice ios/ ops in ceph -w stats ? is the ceph health ok ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 14:49:41 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Implement replication network with live cluster
That was my thought, yes - I found this blog that confirms what you are saying I guess: http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/ I will do that... Thx I guess it doesnt matter, since my Crush Map will still refernce old OSDs, that are stoped (and cluster resynced after that) ? Thx again for the help On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote: If I remember right, someone has done this on a live cluster without any issues. I seem to remember that it had a fallback mechanism if the OSDs couldn't be reached on the cluster network to contact them on the public network. You could test it pretty easily without much impact. Take one OSD that has both networks and configure it and restart the process. If all the nodes (specifically the old ones with only one network) is able to connect to it, then you are good to go by restarting one OSD at a time. On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote: Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] CEPH hardware recommendations and cluster design questions
Hi! I seen the documentation http://ceph.com/docs/master/start/hardware-recommendations/ but those minimum requirements without some recommendations don't tell me much ... So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd would do ... what puzzles me is that per daemon construct ... Why would i need/require to have multiple daemons? with separate servers (3 mon + 1 mds - i understood that this is the requirement) i imagine that each will run a single type of daemon.. did i miss something? (beside that maybe is a relation between daemons and block devices and for each block device should be a daemon?) for mon and mds : would help the clients if these are on 10 GbE? for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB how much ram i would really need? (128 gb would be way to much i think) (that RAIDZ3 for 36 disks is just a thought - i have also choices like: 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare) Regarding journal and scrubbing : by using ZFS i would think that i can safely not use the CEPH ones ... is this ok? Do you have some other advises and recommendations for me? (the read:writes ratios will be 10:1) Thank you!! Adrian smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Cluster Address
On Tue, Mar 3, 2015 at 9:26 AM, Garg, Pankaj pankaj.g...@caviumnetworks.com wrote: Hi, I have ceph cluster that is contained within a rack (1 Monitor and 5 OSD nodes). I kept the same public and private address for configuration. I do have 2 NICS and 2 valid IP addresses (one internal only and one external) for each machine. Is it possible now, to change the Public Network address, after the cluster is up and running? I had used Ceph-deploy for the cluster. If I change the address of the public network in Ceph.conf, do I need to propagate to all the machines in the cluster or just the Monitor Node is enough? You'll need to change the config on each node and then restart it so that the OSDs will bind to the new location. The OSDs will let you do this on a rolling basis, but the networks will need to be routable to each other. Note that changing the addresses on the monitors (I can't tell if you want to do that) is much more difficult; it's probably easiest to remove one at a time from the cluster and then recreate it with its new IP. (There are docs on how to do this.) -Greg Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?
Just to get more specific: the reason you can apparently write stuff to a file when you can't write to the pool it's stored in is because the file data is initially stored in cache. The flush out to RADOS, when it happens, will fail. It would definitely be preferable if there was some way to immediately return a permission or IO error in this case, but so far we haven't found one; the relevant interfaces just aren't present and it's unclear how to propagate the data back to users in a way that makes sense even if they were. :/ -Greg On Wed, Mar 4, 2015 at 3:37 AM, SCHAER Frederic frederic.sch...@cea.fr wrote: Hi, Many thanks for the explanations. I haven't used the nodcache option when mounting cephfs, it actually got there by default My mount command is/was : # mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret I don't know what causes this option to be default, maybe it's the kernel module I compiled from git (because there is no kmod-ceph or kmod-rbd in any RHEL-like distributions except RHEV), I'll try to update/check ... Concerning the rados pool ls, indeed : I created empty files in the pool, and they were not showing up probably because they were just empty - but when I create a non empty file, I see things in rados ls... Thanks again Frederic -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John Spray Envoyé : mardi 3 mars 2015 17:15 À : ceph-users@lists.ceph.com Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ? On 03/03/2015 15:21, SCHAER Frederic wrote: By the way : looks like the ceph fs ls command is inconsistent when the cephfs is mounted (I used a locally compiled kmod-ceph rpm): [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ] (umount /mnt .) [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ] This is probably #10288, which was fixed in 0.87.1 So, I have this pool named root that I added in the cephfs filesystem. I then edited the filesystem xattrs : [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root getfattr: Removing leading '/' from absolute path names # file: mnt/root ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root I'm therefore assuming client.puppet should not be allowed to write or read anything in /mnt/root, which belongs to the root pool. but that is not the case. On another machine where I mounted cephfs using the client.puppet key, I can do this : The mount was done with the client.puppet key, not the admin one that is not deployed on that node : 1.2.3.4:6789:/ on /mnt type ceph (rw,relatime,name=puppet,secret=hidden,nodcache) [root@dev7248 ~]# echo not allowed /mnt/root/secret.notfailed [root@dev7248 ~]# [root@dev7248 ~]# cat /mnt/root/secret.notfailed not allowed This is data you're seeing from the page cache, it hasn't been written to RADOS. You have used the nodcache setting, but that doesn't mean what you think it does (it was about caching dentries, not data). It's actually not even used in recent kernels (http://tracker.ceph.com/issues/11009). You could try the nofsc option, but I don't know exactly how much caching that turns off -- the safer approach here is probably to do your testing using I/Os that have O_DIRECT set. And I can even see the xattrs inherited from the parent dir : [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root Whereas on the node where I mounted cephfs as ceph admin, I get nothing : [root@ceph0 ~]# cat /mnt/root/secret.notfailed [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed -rw-r--r-- 1 root root 12 Mar 3 15:27 /mnt/root/secret.notfailed After some time, the file also gets empty on the puppet client host : [root@dev7248 ~]# cat /mnt/root/secret.notfailed [root@dev7248 ~]# (but the metadata remained ?) Right -- eventually the cache goes away, and you see the true (empty) state of the file. Also, as an unpriviledged user, I can get ownership of a secret file by changing the extended attribute : [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet /mnt/root/secret.notfailed [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=puppet Well, you're not really getting ownership of anything here: you're modifying the file's metadata, which you are entitled to do (pool permissions have nothing to do with file metadata). There was a recent bug where a file's pool layout could
Re: [ceph-users] Implement replication network with live cluster
On 03/04/2015 05:44 PM, Robert LeBlanc wrote: If I remember right, someone has done this on a live cluster without any issues. I seem to remember that it had a fallback mechanism if the OSDs couldn't be reached on the cluster network to contact them on the public network. You could test it pretty easily without much impact. Take one OSD that has both networks and configure it and restart the process. If all the nodes (specifically the old ones with only one network) is able to connect to it, then you are good to go by restarting one OSD at a time. In the OSDMap each OSD has a public and cluster network address. If the cluster network address is not set, replication to that OSD will be done over the public network. So you can push a new configuration to all OSDs and restart them one by one. Make sure the network is ofcourse up and running and it should work. On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote: Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
You will most likely have a very high relocation percentage. Backfills always are more impactful on smaller clusters, but osd max backfills should be what you need to help reduce the impact. The default is 10, you will want to use 1. I didn't catch which version of Ceph you are running, but I think there was some priority work done in firefly to help make backfills lower priority. I think it has gotten better in later versions. On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush map - weather that will cause more than 37% of data moved (80% or whatever) I'm also wondering if the thortling that I applied is fine or not - I will introduce the osd_recovery_delay_start 10sec as Irek said. I'm just wondering hom much will be the performance impact, because: - when stoping OSD, the impact while backfilling was fine more or a less - I can leave with this - when I removed OSD from cursh map - first 1h or so, impact was tremendous, and later on during recovery process impact was much less but still noticable... Thanks for the tip of course ! Andrija On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote: I would be inclined to shut down both OSDs in a node, let the cluster recover. Once it is recovered, shut down the next two, let it recover. Repeat until all the OSDs are taken out of the cluster. Then I would set nobackfill and norecover. Then remove the hosts/disks from the CRUSH then unset nobackfill and norecover. That should give you a few small changes (when you shut down OSDs) and then one big one to get everything in the final place. If you are still adding new nodes, when nobackfill and norecover is set, you can add them in so that the one big relocate fills the new drives too. On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thx Irek. Number of replicas is 3. I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already decommissioned), which is further connected to a new 10G switch/network with 3 servers on it with 12 OSDs each. I'm decommissioning old 3 nodes on 1G network... So you suggest removing whole node with 2 OSDs manually from crush map? Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas were originally been distributed over all 3 nodes. So anyway It could be safe to remove 2 OSDs at once together with the node itself...since replica count is 3... ? Thx again for your time On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote: Once you have only three nodes in the cluster. I recommend you add new nodes to the cluster, and then delete the old. 2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com: You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster
Re: [ceph-users] Implement replication network with live cluster
If the data have been replicated to new OSDs, it will be able to function properly even them them down or only on the public network. On Wed, Mar 4, 2015 at 9:49 AM, Andrija Panic andrija.pa...@gmail.com wrote: I guess it doesnt matter, since my Crush Map will still refernce old OSDs, that are stoped (and cluster resynced after that) ? I wanted to say: it doesnt matter (I guess?) that my Crush map is still referencing old OSD nodes that are already stoped. Tired, sorry... On 4 March 2015 at 17:48, Andrija Panic andrija.pa...@gmail.com wrote: That was my thought, yes - I found this blog that confirms what you are saying I guess: http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/ I will do that... Thx I guess it doesnt matter, since my Crush Map will still refernce old OSDs, that are stoped (and cluster resynced after that) ? Thx again for the help On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote: If I remember right, someone has done this on a live cluster without any issues. I seem to remember that it had a fallback mechanism if the OSDs couldn't be reached on the cluster network to contact them on the public network. You could test it pretty easily without much impact. Take one OSD that has both networks and configure it and restart the process. If all the nodes (specifically the old ones with only one network) is able to connect to it, then you are good to go by restarting one OSD at a time. On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote: Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CEPH hardware recommendations and cluster design questions
On Wed, 4 Mar 2015, Adrian Sevcenco wrote: Hi! I seen the documentation http://ceph.com/docs/master/start/hardware-recommendations/ but those minimum requirements without some recommendations don't tell me much ... So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd would do ... what puzzles me is that per daemon construct ... Why would i need/require to have multiple daemons? with separate servers (3 mon + 1 mds - i understood that this is the requirement) i imagine that each will run a single type of daemon.. did i miss something? (beside that maybe is a relation between daemons and block devices and for each block device should be a daemon?) There is normally a ceph-osd daemon per disk. for mon and mds : would help the clients if these are on 10 GbE? For the MDS latency is important, so possibly! for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB how much ram i would really need? (128 gb would be way to much i think) (that RAIDZ3 for 36 disks is just a thought - i have also choices like: 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare) Usually Ceph is deployed without raid underneath. You can use it, though--ceph doesn't really care. Performance just tends to be lower as compared to ceph-osd daemon's per disk. Note that there is some support for ZFS but it is not tested by us at all, so you'll be mostly on your own. I know a few users have had success here but I have no idea how busy their clusters are. Be careful! Regarding journal and scrubbing : by using ZFS i would think that i can safely not use the CEPH ones ... is this ok? You still want Ceph scrubbing as it verifies that the replicas don't get out of sync. Maybe you could forgo deep scrubbing, but it may make more sense to disable ZFS scrubbing and let ceph drive it as you get things verified through the whole stack... sage Do you have some other advises and recommendations for me? (the read:writes ratios will be 10:1) Thank you!! Adrian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Persistent Write Back Cache
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of John Spray Sent: 04 March 2015 11:34 To: Nick Fisk; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Persistent Write Back Cache On 04/03/2015 08:26, Nick Fisk wrote: To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. What are you hoping to gain from building something into ceph instead of using flashcache/bcache/dm-cache on top of it? It seems like since you would anyway need to handle your HA configuration, setting up the actual cache device would be the simple part. Cheers, John Hi John, I guess it's to make things easier rather than having to run a huge stack of different technologies to achieve the same goal, especially when half of the caching logic is already in Ceph. It would be really nice and drive adoption if you could could add a SSD, set a config option and suddenly you have a storage platform that performs 10x faster. Another way of handling it might be for librbd to be pointed at a uuid instead of a /dev/sd* device. That way librbd knows what cache device to look for and will error out if the cache device is missing. These cache devices could then be presented to all necessary servers via iSCSI or something similar if the RBD will need to move around. Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Implement replication network with live cluster
Thx Wido, I needed this confirmations - thanks! On 4 March 2015 at 17:49, Wido den Hollander w...@42on.com wrote: On 03/04/2015 05:44 PM, Robert LeBlanc wrote: If I remember right, someone has done this on a live cluster without any issues. I seem to remember that it had a fallback mechanism if the OSDs couldn't be reached on the cluster network to contact them on the public network. You could test it pretty easily without much impact. Take one OSD that has both networks and configure it and restart the process. If all the nodes (specifically the old ones with only one network) is able to connect to it, then you are good to go by restarting one OSD at a time. In the OSDMap each OSD has a public and cluster network address. If the cluster network address is not set, replication to that OSD will be done over the public network. So you can push a new configuration to all OSDs and restart them one by one. Make sure the network is ofcourse up and running and it should work. On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote: Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Implement replication network with live cluster
Thx again - I really appreciatethe help guys ! On 4 March 2015 at 17:51, Robert LeBlanc rob...@leblancnet.us wrote: If the data have been replicated to new OSDs, it will be able to function properly even them them down or only on the public network. On Wed, Mar 4, 2015 at 9:49 AM, Andrija Panic andrija.pa...@gmail.com wrote: I guess it doesnt matter, since my Crush Map will still refernce old OSDs, that are stoped (and cluster resynced after that) ? I wanted to say: it doesnt matter (I guess?) that my Crush map is still referencing old OSD nodes that are already stoped. Tired, sorry... On 4 March 2015 at 17:48, Andrija Panic andrija.pa...@gmail.com wrote: That was my thought, yes - I found this blog that confirms what you are saying I guess: http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/ I will do that... Thx I guess it doesnt matter, since my Crush Map will still refernce old OSDs, that are stoped (and cluster resynced after that) ? Thx again for the help On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote: If I remember right, someone has done this on a live cluster without any issues. I seem to remember that it had a fallback mechanism if the OSDs couldn't be reached on the cluster network to contact them on the public network. You could test it pretty easily without much impact. Take one OSD that has both networks and configure it and restart the process. If all the nodes (specifically the old ones with only one network) is able to connect to it, then you are good to go by restarting one OSD at a time. On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote: Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Andrija Panić -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.93 Hammer release candidate released
On Wed, 4 Mar 2015, Thomas Lemarchand wrote: Thanks to all Ceph developers for the good work ! I see some love given to CephFS. When will you consider CephFS to be production ready ? The key missing piece is fsck (check and repair). That's where our efforts are focused now. I think infernalis will have something pretty reasonable? I use CephFS in production since Giant, and apart from the cache pressure health warning bug, now resolved, I didn't have a single problem. That's great to hear! sage -- Thomas Lemarchand Cloud Solutions SAS - Responsable des systèmes d'information On ven., 2015-02-27 at 14:10 -0800, Sage Weil wrote: This is the first release candidate for Hammer, and includes all of the features that will be present in the final release. We welcome and encourage any and all testing in non-production clusters to identify any problems with functionality, stability, or performance before the final Hammer release. We suggest some caution in one area: librbd. There is a lot of new functionality around object maps and locking that is disabled by default but may still affect stability for existing images. We are continuing to shake out those bugs so that the final Hammer release (probably v0.94) will be stable. Major features since Giant include: * cephfs: journal scavenger repair tool (John Spray) * crush: new and improved straw2 bucket type (Sage Weil, Christina Anderson, Xiaoxi Chen) * doc: improved guidance for CephFS early adopters (John Spray) * librbd: add per-image object map for improved performance (Jason Dillaman) * librbd: copy-on-read (Min Chen, Li Wang, Yunchuan Wen, Cheng Cheng) * librados: fadvise-style IO hints (Jianpeng Ma) * mds: many many snapshot-related fixes (Yan, Zheng) * mon: new 'ceph osd df' command (Mykola Golub) * mon: new 'ceph pg ls ...' command (Xinxin Shu) * osd: improved performance for high-performance backends * osd: improved recovery behavior (Samuel Just) * osd: improved cache tier behavior with reads (Zhiqiang Wang) * rgw: S3-compatible bucket versioning support (Yehuda Sadeh) * rgw: large bucket index sharding (Guang Yang, Yehuda Sadeh) * RDMA xio messenger support (Matt Benjamin, Vu Pham) Upgrading - * No special restrictions when upgrading from firefly or giant Notable Changes --- * build: CMake support (Ali Maredia, Casey Bodley, Adam Emerson, Marcus Watts, Matt Benjamin) * ceph-disk: do not re-use partition if encryption is required (Loic Dachary) * ceph-disk: support LUKS for encrypted partitions (Andrew Bartlett, Loic Dachary) * ceph-fuse,libcephfs: add support for O_NOFOLLOW and O_PATH (Greg Farnum) * ceph-fuse,libcephfs: resend requests before completing cap reconnect (#10912 Yan, Zheng) * ceph-fuse: select kernel cache invalidation mechanism based on kernel version (Greg Farnum) * ceph-objectstore-tool: improved import (David Zafman) * ceph-objectstore-tool: misc improvements, fixes (#9870 #9871 David Zafman) * ceph: add 'ceph osd df [tree]' command (#10452 Mykola Golub) * ceph: fix 'ceph tell ...' command validation (#10439 Joao Eduardo Luis) * ceph: improve 'ceph osd tree' output (Mykola Golub) * cephfs-journal-tool: add recover_dentries function (#9883 John Spray) * common: add newline to flushed json output (Sage Weil) * common: filtering for 'perf dump' (John Spray) * common: fix Formatter factory breakage (#10547 Loic Dachary) * common: make json-pretty output prettier (Sage Weil) * crush: new and improved straw2 bucket type (Sage Weil, Christina Anderson, Xiaoxi Chen) * crush: update tries stats for indep rules (#10349 Loic Dachary) * crush: use larger choose_tries value for erasure code rulesets (#10353 Loic Dachary) * debian,rpm: move RBD udev rules to ceph-common (#10864 Ken Dreyer) * debian: split python-ceph into python-{rbd,rados,cephfs} (Boris Ranto) * doc: CephFS disaster recovery guidance (John Spray) * doc: CephFS for early adopters (John Spray) * doc: fix OpenStack Glance docs (#10478 Sebastien Han) * doc: misc updates (#9793 #9922 #10204 #10203 Travis Rhoden, Hazem, Ayari, Florian Coste, Andy Allan, Frank Yu, Baptiste Veuillez-Mainard, Yuan Zhou, Armando Segnini, Robert Jansen, Tyler Brekke, Viktor Suprun) * doc: replace cloudfiles with swiftclient Python Swift example (Tim Freund) * erasure-code: add mSHEC erasure code support (Takeshi Miyamae) * erasure-code: improved docs (#10340 Loic Dachary) * erasure-code: set max_size to 20 (#10363 Loic Dachary) * libcephfs,ceph-fuse: fix getting zero-length xattr (#10552 Yan, Zheng) * librados: add blacklist_add convenience method (Jason Dillaman) * librados: expose rados_{read|write}_op_assert_version in C API (Kim Vandry) * librados: fix pool name caching (#10458 Radoslaw Zarzynski) * librados: fix resource leak, misc bugs
Re: [ceph-users] CEPH hardware recommendations and cluster design questions
Hi for hardware, inktank have good guides here: http://www.inktank.com/resource/inktank-hardware-selection-guide/ http://www.inktank.com/resource/inktank-hardware-configuration-guide/ ceph works well with multiple osd daemon (1 osd by disk), so you should not use raid. (xfs is the recommended fs for osd daemons). you don't need disk spare too, juste enough disk space to handle a disk failure. (datas are replicated-rebalanced on other disks/osd in case of disk failure) - Mail original - De: Adrian Sevcenco adrian.sevce...@cern.ch À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 18:30:31 Objet: [ceph-users] CEPH hardware recommendations and cluster design questions Hi! I seen the documentation http://ceph.com/docs/master/start/hardware-recommendations/ but those minimum requirements without some recommendations don't tell me much ... So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd would do ... what puzzles me is that per daemon construct ... Why would i need/require to have multiple daemons? with separate servers (3 mon + 1 mds - i understood that this is the requirement) i imagine that each will run a single type of daemon.. did i miss something? (beside that maybe is a relation between daemons and block devices and for each block device should be a daemon?) for mon and mds : would help the clients if these are on 10 GbE? for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB how much ram i would really need? (128 gb would be way to much i think) (that RAIDZ3 for 36 disks is just a thought - i have also choices like: 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare) Regarding journal and scrubbing : by using ZFS i would think that i can safely not use the CEPH ones ... is this ok? Do you have some other advises and recommendations for me? (the read:writes ratios will be 10:1) Thank you!! Adrian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Persistent Write Back Cache
On 03/04/2015 05:34 AM, John Spray wrote: On 04/03/2015 08:26, Nick Fisk wrote: To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO’s are getting coalesced into nice large 512kb IO’s at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. What are you hoping to gain from building something into ceph instead of using flashcache/bcache/dm-cache on top of it? It seems like since you would anyway need to handle your HA configuration, setting up the actual cache device would be the simple part. Agreed regarding flashcache/bcache/dm-cache. I suspect improving an existing project rather than reinventing it ourselves would be the way to go. It may also be worth looking at Luis's work, though I note that he specifically says write-through: http://vault2015.sched.org/event/6cc56a5b8a95ead46961697028b59c39#.VPc0uX-etWQ https://github.com/pblcache/pblcache Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Implement replication network with live cluster
I guess it doesnt matter, since my Crush Map will still refernce old OSDs, that are stoped (and cluster resynced after that) ? I wanted to say: it doesnt matter (I guess?) that my Crush map is still referencing old OSD nodes that are already stoped. Tired, sorry... On 4 March 2015 at 17:48, Andrija Panic andrija.pa...@gmail.com wrote: That was my thought, yes - I found this blog that confirms what you are saying I guess: http://www.sebastien-han.fr/blog/2012/07/29/tip-ceph-public-slash-private-network-configuration/ I will do that... Thx I guess it doesnt matter, since my Crush Map will still refernce old OSDs, that are stoped (and cluster resynced after that) ? Thx again for the help On 4 March 2015 at 17:44, Robert LeBlanc rob...@leblancnet.us wrote: If I remember right, someone has done this on a live cluster without any issues. I seem to remember that it had a fallback mechanism if the OSDs couldn't be reached on the cluster network to contact them on the public network. You could test it pretty easily without much impact. Take one OSD that has both networks and configure it and restart the process. If all the nodes (specifically the old ones with only one network) is able to connect to it, then you are good to go by restarting one OSD at a time. On Wed, Mar 4, 2015 at 4:17 AM, Andrija Panic andrija.pa...@gmail.com wrote: Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Andrija Panić -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Hi Robert, I already have this stuff set. CEph is 0.87.0 now... Thanks, will schedule this for weekend, 10G network and 36 OSDs - should move data in less than 8h per my last experineced that was arround8h, but some 1G OSDs were included... Thx! On 4 March 2015 at 17:49, Robert LeBlanc rob...@leblancnet.us wrote: You will most likely have a very high relocation percentage. Backfills always are more impactful on smaller clusters, but osd max backfills should be what you need to help reduce the impact. The default is 10, you will want to use 1. I didn't catch which version of Ceph you are running, but I think there was some priority work done in firefly to help make backfills lower priority. I think it has gotten better in later versions. On Wed, Mar 4, 2015 at 1:35 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush map - weather that will cause more than 37% of data moved (80% or whatever) I'm also wondering if the thortling that I applied is fine or not - I will introduce the osd_recovery_delay_start 10sec as Irek said. I'm just wondering hom much will be the performance impact, because: - when stoping OSD, the impact while backfilling was fine more or a less - I can leave with this - when I removed OSD from cursh map - first 1h or so, impact was tremendous, and later on during recovery process impact was much less but still noticable... Thanks for the tip of course ! Andrija On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote: I would be inclined to shut down both OSDs in a node, let the cluster recover. Once it is recovered, shut down the next two, let it recover. Repeat until all the OSDs are taken out of the cluster. Then I would set nobackfill and norecover. Then remove the hosts/disks from the CRUSH then unset nobackfill and norecover. That should give you a few small changes (when you shut down OSDs) and then one big one to get everything in the final place. If you are still adding new nodes, when nobackfill and norecover is set, you can add them in so that the one big relocate fills the new drives too. On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thx Irek. Number of replicas is 3. I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already decommissioned), which is further connected to a new 10G switch/network with 3 servers on it with 12 OSDs each. I'm decommissioning old 3 nodes on 1G network... So you suggest removing whole node with 2 OSDs manually from crush map? Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas were originally been distributed over all 3 nodes. So anyway It could be safe to remove 2 OSDs at once together with the node itself...since replica count is 3... ? Thx again for your time On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote: Once you have only three nodes in the cluster. I recommend you add new nodes to the cluster, and then delete the old. 2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com: You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com : Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to
Re: [ceph-users] CEPH hardware recommendations and cluster design questions
To expand upon this, the very nature and existence of Ceph is to replace RAID. The FS itself replicates data and handles the HA functionality that you're looking for. If you're going to build a single server with all those disks, backed by a ZFS RAID setup, you're going to be much better suited with an iSCSI setup. The idea of ceph is that it takes the place of all the ZFS bells and whistles. A CEPH cluster that only has one OSD backed by that huge ZFS setup becomes just a wire-protocol to speak to the server. The magic in ceph comes from the replication and distribution of the data across many OSDs, hopefully living in many hosts. My own setup for instance uses 96 OSDs that are spread across 4 hosts (I know I know guys - CPU is a big deal with SSDs so 24 per host is a tall order - didn't know that when we built it - been working ok so far) that is then distributed between 2 cabinets on 2 separate cooling/power/data zones in our datacenter. My CRUSH map is currently setup for 3 copies of all data, and laid out so that at least one copy is located in each cabinet, and then the cab that gets the 2 copies also makes sure that each copy is on a different host. No RAID needed because ceph makes sure that I have a safe amount of copies of the data, in a distribution layout that allows us to sleep at night. In my opinion, ceph is much more pleasant, powerful, and versatile to deal with than both hardware RAID and ZFS (Both of which we have instances of deployed as well from previous iterations of infrastructure deployments). Now, you could always create small little zRAID clusters using ZFS, and then give an OSD to each of those, if you wanted even an additional layer of safety. Heck, you could even have hardware RAID behind the zRAID, for even another layer. Where YOU need to make the decision is the trade-off between HA functionality/peace of mind, performance, and useability/maintainability. Would me happy to answer any questions you still have... Cheers, -- Stephen Mercier Senior Systems Architect Attainia, Inc. Phone: 866-288-2464 ext. 727 Email: stephen.merc...@attainia.com Web: www.attainia.com Capital equipment lifecycle planning budgeting solutions for healthcare On Mar 4, 2015, at 10:42 AM, Alexandre DERUMIER wrote: Hi for hardware, inktank have good guides here: http://www.inktank.com/resource/inktank-hardware-selection-guide/ http://www.inktank.com/resource/inktank-hardware-configuration-guide/ ceph works well with multiple osd daemon (1 osd by disk), so you should not use raid. (xfs is the recommended fs for osd daemons). you don't need disk spare too, juste enough disk space to handle a disk failure. (datas are replicated-rebalanced on other disks/osd in case of disk failure) - Mail original - De: Adrian Sevcenco adrian.sevce...@cern.ch À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 18:30:31 Objet: [ceph-users] CEPH hardware recommendations and cluster design questions Hi! I seen the documentation http://ceph.com/docs/master/start/hardware-recommendations/ but those minimum requirements without some recommendations don't tell me much ... So, from what i seen for mon and mds any cheap 6 core 16+ gb ram amd would do ... what puzzles me is that per daemon construct ... Why would i need/require to have multiple daemons? with separate servers (3 mon + 1 mds - i understood that this is the requirement) i imagine that each will run a single type of daemon.. did i miss something? (beside that maybe is a relation between daemons and block devices and for each block device should be a daemon?) for mon and mds : would help the clients if these are on 10 GbE? for osd : i plan to use a 36 disk server as osd server (ZFS RAIDZ3 all disks + 2 ssds mirror for ZIL and L2ARC) - that would give me ~ 132 TB how much ram i would really need? (128 gb would be way to much i think) (that RAIDZ3 for 36 disks is just a thought - i have also choices like: 2 X 18 RAIDZ2 ; 34 disks RAIDZ3 + 2 hot spare) Regarding journal and scrubbing : by using ZFS i would think that i can safely not use the CEPH ones ... is this ok? Do you have some other advises and recommendations for me? (the read:writes ratios will be 10:1) Thank you!! Adrian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.80.8 and librbd performance
On 03/03/2015 03:28 PM, Ken Dreyer wrote: On 03/03/2015 04:19 PM, Sage Weil wrote: Hi, This is just a heads up that we've identified a performance regression in v0.80.8 from previous firefly releases. A v0.80.9 is working it's way through QA and should be out in a few days. If you haven't upgraded yet you may want to wait. Thanks! sage Hi Sage, I've seen a couple Redmine tickets on this (eg http://tracker.ceph.com/issues/9854 , http://tracker.ceph.com/issues/10956). It's not totally clear to me which of the 70+ unreleased commits on the firefly branch fix this librbd issue. Is it only the three commits in https://github.com/ceph/ceph/pull/3410 , or are there more? Those are the only ones needed to fix the librbd performance regression, yes. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] The project of ceph client file system porting from Linux to AIX
I'd like to see a Solaris client. -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dennis Chen Sent: Wednesday, March 04, 2015 2:00 AM To: ceph-devel; ceph-users; Sage Weil; Loic Dachary Subject: [ceph-users] The project of ceph client file system porting from Linux to AIX Hello, The ceph cluster now can only be used by Linux system AFAICT, so I planed to port the ceph client file system from Linux to AIX as a tiered storage solution in that platform. Below is the source code repository I've done, which is still in progress. 3 important modules: 1. aixker: maintain a uniform kernel API beteween the Linux and AIX 2. net: as a data transfering layer between the client and cluster 3. fs: as an adaptor to make the AIX can recognize the Linux file system. https://github.com/Dennis-Chen1977/aix-cephfs Welcome any comments or anything... -- Den ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New EC pool undersized
Last night I blew away my previous ceph configuration (this environment is pre-production) and have 0.87.1 installed. I've manually edited the crushmap so it down looks like https://dpaste.de/OLEa I currently have 144 OSDs on 8 nodes. After increasing pg_num and pgp_num to a more suitable 1024 (due to the high number of OSDs), everything looked happy. So, now I'm trying to play with an erasure-coded pool. I did: ceph osd erasure-code-profile set ec44profile k=4 m=4 ruleset-failure-domain=rack ceph osd pool create ec44pool 8192 8192 erasure ec44profile After settling for a bit 'ceph status' gives cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck unclean; 7 pgs stuck undersized; 7 pgs undersized monmap e1: 4 mons at {hobbit01= 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e409: 144 osds: 144 up, 144 in pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects 90598 MB used, 640 TB / 640 TB avail 7 active+undersized+degraded 12281 active+clean So to troubleshoot the undersized pgs, I issued 'ceph pg dump_stuck' ok pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 1.d77 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:57.502849 0'0 408:12 [15,95,58,73,52,31,116,2147483647] 15 [15,95,58,73,52,31,116,2147483647] 15 0'0 2015-03-04 11:33:42.100752 0'0 2015-03-04 11:33:42.100752 1.10fa 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:29.362554 0'0 408:12 [23,12,99,114,132,53,56,2147483647] 23 [23,12,99,114,132,53,56,2147483647] 23 0'0 2015-03-04 11:33:42.168571 0'0 2015-03-04 11:33:42.168571 1.1271 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:33:48.795742 0'0 408:12 [135,112,69,4,22,95,2147483647,83] 135 [135,112,69,4,22,95,2147483647,83] 135 0'0 2015-03-04 11:33:42.139555 0'0 2015-03-04 11:33:42.139555 1.2b5 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:32.189738 0'0 408:12 [11,115,139,19,76,52,94,2147483647] 11 [11,115,139,19,76,52,94,2147483647] 11 0'0 2015-03-04 11:33:42.079673 0'0 2015-03-04 11:33:42.079673 1.7ae 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:26.848344 0'0 408:12 [27,5,132,119,94,56,52,2147483647] 27 [27,5,132,119,94,56,52,2147483647] 27 0'0 2015-03-04 11:33:42.109832 0'0 2015-03-04 11:33:42.109832 1.1a97 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:25.457454 0'0 408:12 [20,53,14,54,102,118,2147483647,72] 20 [20,53,14,54,102,118,2147483647,72] 20 0'0 2015-03-04 11:33:42.833850 0'0 2015-03-04 11:33:42.833850 1.10a6 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 11:34:30.059936 0'0 408:12 [136,22,4,2147483647,72,52,101,55] 136 [136,22,4,2147483647,72,52,101,55] 136 0'0 2015-03-04 11:33:42.125871 0'0 2015-03-04 11:33:42.125871 This appears to have a number on all these (2147483647) that is way out of line from what I would expect. Thoughts? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 2048. -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don Doerner Sent: 04 March, 2015 12:14 To: Kyle Hutson; Ceph Users Subject: Re: [ceph-users] New EC pool undersized In this case, that number means that there is not an OSD that can be assigned. What’s your k, m from you erasure coded pool? You’ll need approximately (14400/(k+m)) PGs, rounded up to the next power of 2… -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kyle Hutson Sent: 04 March, 2015 12:06 To: Ceph Users Subject: [ceph-users] New EC pool undersized Last night I blew away my previous ceph configuration (this environment is pre-production) and have 0.87.1 installed. I've manually edited the crushmap so it down looks like https://dpaste.de/OLEahttps://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEak=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442 I currently have 144 OSDs on 8 nodes. After increasing pg_num and pgp_num to a more suitable 1024 (due to the high number of OSDs), everything looked happy. So, now I'm trying to play with an erasure-coded pool. I did: ceph osd erasure-code-profile set ec44profile k=4 m=4 ruleset-failure-domain=rack ceph osd pool create ec44pool 8192 8192 erasure ec44profile After settling for a bit 'ceph status' gives cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck unclean; 7 pgs stuck undersized; 7 pgs undersized monmap e1: 4 mons at {hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=6fe07b47a00235857630057e09cfb702dcddcea1d3f98d81a574020ee95dee44}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e409: 144 osds: 144 up, 144 in pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects 90598 MB used, 640 TB / 640 TB avail 7 active+undersized+degraded 12281 active+clean So to troubleshoot the undersized pgs, I issued 'ceph pg dump_stuck' ok pg_stat objects mip degr misp unf bytes log disklog state state_stampvreported up up_primary actingacting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 1.d77 00000000 active+undersized+degraded 2015-03-04 11:33:57.502849 0'0 408:12 [15,95,58,73,52,31,116,2147483647] 15 [15,95,58,73,52,31,116,2147483647] 15 0'0 2015-03-04 11:33:42.100752 0'0 2015-03-04 11:33:42.100752 1.10fa00000000 active+undersized+degraded 2015-03-04 11:34:29.362554 0'0 408:12 [23,12,99,114,132,53,56,2147483647] 23 [23,12,99,114,132,53,56,2147483647] 23 0'0 2015-03-04 11:33:42.168571 0'0 2015-03-04 11:33:42.168571 1.127100000000 active+undersized+degraded 2015-03-04 11:33:48.795742 0'0 408:12 [135,112,69,4,22,95,2147483647,83] 135 [135,112,69,4,22,95,2147483647,83] 135 0'0 2015-03-04 11:33:42.139555 0'0 2015-03-04 11:33:42.139555 1.2b5 00000000 active+undersized+degraded 2015-03-04 11:34:32.189738 0'0 408:12 [11,115,139,19,76,52,94,2147483647] 11 [11,115,139,19,76,52,94,2147483647] 11 0'0 2015-03-04 11:33:42.079673 0'0 2015-03-04 11:33:42.079673 1.7ae 00000000 active+undersized+degraded 2015-03-04 11:34:26.848344 0'0 408:12 [27,5,132,119,94,56,52,2147483647] 27 [27,5,132,119,94,56,52,2147483647] 27 0'0 2015-03-04 11:33:42.109832 0'0 2015-03-04 11:33:42.109832 1.1a9700000000 active+undersized+degraded 2015-03-04 11:34:25.457454 0'0 408:12 [20,53,14,54,102,118,2147483647,72] 20 [20,53,14,54,102,118,2147483647,72] 20 0'0 2015-03-04 11:33:42.833850 0'0 2015-03-04 11:33:42.833850 1.10a600000000 active+undersized+degraded 2015-03-04 11:34:30.059936 0'0 408:12 [136,22,4,2147483647,72,52,101,55] 136 [136,22,4,2147483647,72,52,101,55] 136 0'0 2015-03-04 11:33:42.125871 0'0 2015-03-04 11:33:42.125871 This appears to have a number on all these (2147483647) that is way out of line from what I would expect. Thoughts? The information contained in
[ceph-users] Ceph User Teething Problems
I have been following ceph for a long time. I have yet to put it into service, and I keep coming back as btrfs improves and ceph reaches higher version numbers. I am now trying ceph 0.93 and kernel 4.0-rc1. Q1) Is it still considered that btrfs is not robust enough, and that xfs should be used instead? [I am trying with btrfs]. I followed the manual deployment instructions on the web site (http://ceph.com/docs/master/install/manual-deployment/) and I managed to get a monitor and several osds running and apparently working. The instructions fizzle out without explaining how to set up mds. I went back to mkcephfs and got things set up that way. The mds starts. [Please don't mention ceph-deploy] The first thing that I noticed is that (whether I set up mon and osds by following the manual deployment, or using mkcephfs), the correct default pools were not created. bash-4.3# ceph osd lspools 0 rbd, bash-4.3# I get only 'rbd' created automatically. I deleted this pool, and re-created data, metadata and rbd manually. When doing this, I had to juggle with the pg- num in order to avoid the 'too many pgs for osd'. I have three osds running at the moment, but intend to add to these when I have some experience of things working reliably. I am puzzled, because I seem to have to set the pg-num for the pool to a number that makes (N-pools x pg-num)/N-osds come to the right kind of number. So this implies that I can't really expand a set of pools by adding osds at a later date. Q2) Is there any obvious reason why my default pools are not getting created automatically as expected? Q3) Can pg-num be modified for a pool later? (If the number of osds is increased dramatically). Finally, when I try to mount cephfs, I get a mount 5 error. A mount 5 error typically occurs if a MDS server is laggy or if it crashed. Ensure at least one MDS is up and running, and the cluster is active + healthy. My mds is running, but its log is not terribly active: 2015-03-04 17:47:43.177349 7f42da2c47c0 0 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4), process ceph-mds, pid 4110 2015-03-04 17:47:43.182716 7f42da2c47c0 -1 mds.-1.0 log_to_monitors {default=true} (This is all there is in the log). I think that a key indicator of the problem must be this from the monitor log: 2015-03-04 16:53:20.715132 7f3cd0014700 1 mon.ceph-mon-00@0(leader).mds e1 warning, MDS mds.? [2001:8b0::5fb3::1fff::9054]:6800/4036 up but filesystem disabled (I have added the '' sections to obscure my ip address) Q4) Can you give me an idea of what is wrong that causes the mds to not play properly? I think that there are some typos on the manual deployment pages, for example: ceph-osd id={osd-num} This is not right. As far as I am aware it should be: ceph-osd -i {osd-num} An observation. In principle, setting things up manually is not all that complicated, provided that clear and unambiguous instructions are provided. This simple piece of documentation is very important. My view is that the existing manual deployment instructions gets a bit confused and confusing when it gets to the osd setup, and the mds setup is completely absent. For someone who knows, this would be a fairly simple and fairly quick operation to review and revise this part of the documentation. I suspect that this part suffers from being really obvious stuff to the well initiated. For those of us closer to the start, this forms the ends of the threads that have to be picked up before the journey can be made. Very best regards, David ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-kvm and cloned rbd image
On 03/03/2015 05:53 PM, Jason Dillaman wrote: Your procedure appears correct to me. Would you mind re-running your cloned image VM with the following ceph.conf properties: [client] rbd cache off debug rbd = 20 log file = /path/writeable/by/qemu.$pid.log If you recreate the issue, would you mind opening a ticket at http://tracker.ceph.com/projects/rbd/issues? Jason, Thanks for the reply. Recreating the issue is not a problem, I can reproduce it any time. The log file was getting a bit large, I destroyed the guest after letting it thrash for about ~3 minutes, plenty of time to hit the problem. I've uploaded it at: http://paste.scsys.co.uk/468868 (~19MB) Do you really think this is a bug and not an err on my side? -K. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
Sorry, I missed your other questions, down at the bottom. See herehttp://ceph.com/docs/master/rados/operations/placement-groups/ (look for “number of replicas for replicated pools or the K+M sum for erasure coded pools”) for the formula; 38400/8 probably implies 8192. The thing is, you’ve got to think about how many ways you can form combinations of 8 unique OSDs (with replacement) that match your failure domain rules. If you’ve only got 8 hosts, and your failure domain is hosts, it severely limits this number. And I have read that too many isn’t good either – a serialization issue, I believe. -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don Doerner Sent: 04 March, 2015 12:49 To: Kyle Hutson Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] New EC pool undersized Hmmm, I just struggled through this myself. How many racks do you have? If not more than 8, you might want to make your failure domain smaller? I.e., maybe host? That, at least, would allow you to debug the situation… -don- From: Kyle Hutson [mailto:kylehut...@ksu.edu] Sent: 04 March, 2015 12:43 To: Don Doerner Cc: Ceph Users Subject: Re: [ceph-users] New EC pool undersized It wouldn't let me simply change the pg_num, giving Error EEXIST: specified pg_num 2048 = current 8192 But that's not a big deal, I just deleted the pool and recreated with 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' ...and the result is quite similar: 'ceph status' is now ceph status cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized monmap e1: 4 mons at {hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e412: 144 osds: 144 up, 144 in pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects 90590 MB used, 640 TB / 640 TB avail 4 active+undersized+degraded 6140 active+clean 'ceph pg dump_stuck results' in ok pg_stat objects mip degr misp unf bytes log disklog state state_stampvreported up up_primary actingacting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.296 00000000 active+undersized+degraded 2015-03-04 14:33:26.672224 0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] 5 0'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 2.69c 00000000 active+undersized+degraded 2015-03-04 14:33:24.984802 0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 14:33:15.695747 2.36d 00000000 active+undersized+degraded 2015-03-04 14:33:21.937620 0'0 412:9 [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63,2147483647]12 0'0 2015-03-04 14:33:15.652480 0'0 2015-03-04 14:33:15.652480 2.5f7 00000000 active+undersized+degraded 2015-03-04 14:33:26.169242 0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 14:33:15.687695 I do have questions for you, even at this point, though. 1) Where did you find the formula (14400/(k+m))? 2) I was really trying to size this for when it goes to production, at which point it may have as many as 384 OSDs. Doesn't that imply I should have even more pgs? On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.commailto:don.doer...@quantum.com wrote: Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 2048. -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don Doerner Sent: 04 March, 2015 12:14 To: Kyle Hutson; Ceph Users Subject: Re: [ceph-users] New EC pool undersized In this case, that number means that there is not an OSD that can be assigned. What’s your k, m from you erasure coded pool? You’ll need approximately (14400/(k+m)) PGs, rounded up to the next power of 2… -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kyle Hutson Sent: 04 March, 2015 12:06 To: Ceph Users Subject: [ceph-users] New EC pool undersized Last night I blew away my previous ceph configuration (this
Re: [ceph-users] New EC pool undersized
So it sounds like I should figure out at 'how many nodes' do I need to increase pg_num to 4096, and again for 8192, and increase those incrementally when as I add more hosts, correct? On Wed, Mar 4, 2015 at 3:04 PM, Don Doerner don.doer...@quantum.com wrote: Sorry, I missed your other questions, down at the bottom. See here http://ceph.com/docs/master/rados/operations/placement-groups/ (look for “number of replicas for replicated pools or the K+M sum for erasure coded pools”) for the formula; 38400/8 probably implies 8192. The thing is, you’ve got to think about how many ways you can form combinations of 8 unique OSDs (with replacement) that match your failure domain rules. If you’ve only got 8 hosts, and your failure domain is hosts, it severely limits this number. And I have read that too many isn’t good either – a serialization issue, I believe. -don- *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Don Doerner *Sent:* 04 March, 2015 12:49 *To:* Kyle Hutson *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] New EC pool undersized Hmmm, I just struggled through this myself. How many racks do you have? If not more than 8, you might want to make your failure domain smaller? I.e., maybe host? That, at least, would allow you to debug the situation… -don- *From:* Kyle Hutson [mailto:kylehut...@ksu.edu kylehut...@ksu.edu] *Sent:* 04 March, 2015 12:43 *To:* Don Doerner *Cc:* Ceph Users *Subject:* Re: [ceph-users] New EC pool undersized It wouldn't let me simply change the pg_num, giving Error EEXIST: specified pg_num 2048 = current 8192 But that's not a big deal, I just deleted the pool and recreated with 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' ...and the result is quite similar: 'ceph status' is now ceph status cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized monmap e1: 4 mons at {hobbit01= 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e412: 144 osds: 144 up, 144 in pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects 90590 MB used, 640 TB / 640 TB avail 4 active+undersized+degraded 6140 active+clean 'ceph pg dump_stuck results' in ok pg_stat objects mip degr misp unf bytes log disklog state state_stampvreported up up_primary actingacting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.296 00000000 active+undersized+degraded2015-03-04 14:33:26.672224 0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] 50'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 2.69c 00000000 active+undersized+degraded2015-03-04 14:33:24.984802 0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647 ,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 14:33:15.695747 2.36d 00000000 active+undersized+degraded2015-03-04 14:33:21.937620 0'0 412:9 [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63, 2147483647]12 0'0 2015-03-04 14:33:15.6524800'0 2015-03-04 14:33:15.652480 2.5f7 00000000 active+undersized+degraded2015-03-04 14:33:26.169242 0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647 ,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 14:33:15.687695 I do have questions for you, even at this point, though. 1) Where did you find the formula (14400/(k+m))? 2) I was really trying to size this for when it goes to production, at which point it may have as many as 384 OSDs. Doesn't that imply I should have even more pgs? On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.com wrote: Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 2048. -don- *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Don Doerner *Sent:* 04 March, 2015 12:14 *To:* Kyle Hutson; Ceph Users *Subject:* Re: [ceph-users] New EC pool undersized In this case, that number means that there is not an OSD that can be assigned. What’s your k, m from you erasure coded pool? You’ll need
Re: [ceph-users] New EC pool undersized
That did it. 'step set_choose_tries 200' fixed the problem right away. Thanks Yann! On Wed, Mar 4, 2015 at 2:59 PM, Yann Dupont y...@objoo.org wrote: Le 04/03/2015 21:48, Don Doerner a écrit : Hmmm, I just struggled through this myself. How many racks do you have? If not more than 8, you might want to make your failure domain smaller? I.e., maybe host? That, at least, would allow you to debug the situation… -don- Hello, I think I already had this problem. It's explained here http://tracker.ceph.com/issues/10350 And solution is probably here : http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ Section : CRUSH gives up too soon Cheers, Yann ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Teething Problems
On 04/03/2015 20:27, Datatone Lists wrote: I have been following ceph for a long time. I have yet to put it into service, and I keep coming back as btrfs improves and ceph reaches higher version numbers. I am now trying ceph 0.93 and kernel 4.0-rc1. Q1) Is it still considered that btrfs is not robust enough, and that xfs should be used instead? [I am trying with btrfs]. XFS is still the recommended default backend (http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/#filesystems) I followed the manual deployment instructions on the web site (http://ceph.com/docs/master/install/manual-deployment/) and I managed to get a monitor and several osds running and apparently working. The instructions fizzle out without explaining how to set up mds. I went back to mkcephfs and got things set up that way. The mds starts. [Please don't mention ceph-deploy] This kind of comment isn't very helpful unless there is a specific issue with ceph-deploy that is preventing you from using it, and causing you to resort to manual steps. I happy to find ceph-deploy very useful, so I'm afraid I'm going to mention it anyway :-) The first thing that I noticed is that (whether I set up mon and osds by following the manual deployment, or using mkcephfs), the correct default pools were not created. This is not a bug. The 'data' and 'metadata' pools are no longer created by default. http://docs.ceph.com/docs/master/cephfs/createfs/ I get only 'rbd' created automatically. I deleted this pool, and re-created data, metadata and rbd manually. When doing this, I had to juggle with the pg- num in order to avoid the 'too many pgs for osd'. I have three osds running at the moment, but intend to add to these when I have some experience of things working reliably. I am puzzled, because I seem to have to set the pg-num for the pool to a number that makes (N-pools x pg-num)/N-osds come to the right kind of number. So this implies that I can't really expand a set of pools by adding osds at a later date. You should pick an appropriate number of PGs for the number of OSDs you have at the present time. When you add more OSDs, you can increase the number of PGs. You would not want to create the larger number of PGs initially, as they could exceed the resources available on your initial small number of OSDs. Q4) Can you give me an idea of what is wrong that causes the mds to not play properly? You have to explicitly enable the filesystem now (also http://docs.ceph.com/docs/master/cephfs/createfs/) I think that there are some typos on the manual deployment pages, for example: ceph-osd id={osd-num} This is not right. As far as I am aware it should be: ceph-osd -i {osd-num} ceph-osd id={osd-num} is an upstart invokation (i.e. it's prefaced with sudo start on the manual deployment page). In that context it's correct afaik, unless you're finding otherwise? John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
The change is only on OSD (and not on OSD journal). do you see twice iops for read and write ? if only read, maybe a read ahead bug could explain this. - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 15:13:30 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Ceph health is OK yes. The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by ceph : there is no change between dumpling and firefly. The change is only on OSD (and not on OSD journal). Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : The load problem is permanent : I have twice IO/s on HDD since firefly. Oh, permanent, that's strange. (If you don't see more traffic coming from clients, I don't understand...) do you see also twice ios/ ops in ceph -w stats ? is the ceph health ok ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 14:49:41 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-kvm and cloned rbd image
On 03/02/2015 04:16 AM, koukou73gr wrote: Hello, Today I thought I'd experiment with snapshots and cloning. So I did: rbd import --image-format=2 vm-proto.raw rbd/vm-proto rbd snap create rbd/vm-proto@s1 rbd snap protect rbd/vm-proto@s1 rbd clone rbd/vm-proto@s1 rbd/server And then proceeded to create a qemu-kvm guest with rbd/server as its backing store. The guest booted but as soon as it got to mount the root fs, things got weird: What does the qemu command line look like? [...] scsi2 : Virtio SCSI HBA scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0 ANSI: 5 sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sd 2:0:0:0: [sda] Attached SCSI disk dracut: Scanning devices sda2 for LVM logical volumes vg_main/lv_swap vg_main/lv_root dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit EXT4-fs (dm-1): INFO: recovery required on readonly filesystem This suggests the disk is being exposed as read-only via QEMU, perhaps via qemu's snapshot or other options. You can use a clone in exactly the same way as any other rbd image. If you're running QEMU manually, for example, something like: qemu-kvm -drive file=rbd:rbd/server,format=raw,cache=writeback is fine for using the clone. QEMU is supposed to be unaware of any snapshots, parents, etc. at the rbd level. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
It wouldn't let me simply change the pg_num, giving Error EEXIST: specified pg_num 2048 = current 8192 But that's not a big deal, I just deleted the pool and recreated with 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' ...and the result is quite similar: 'ceph status' is now ceph status cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized monmap e1: 4 mons at {hobbit01= 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e412: 144 osds: 144 up, 144 in pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects 90590 MB used, 640 TB / 640 TB avail 4 active+undersized+degraded 6140 active+clean 'ceph pg dump_stuck results' in ok pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.296 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.672224 0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] 5 0'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 2.69c 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:24.984802 0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 14:33:15.695747 2.36d 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:21.937620 0'0 412:9 [12,108,136,104,52,18,63,2147483647] 12 [12,108,136,104,52,18,63,2147483647] 12 0'0 2015-03-04 14:33:15.652480 0'0 2015-03-04 14:33:15.652480 2.5f7 0 0 0 0 0 0 0 0 active+undersized+degraded 2015-03-04 14:33:26.169242 0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 14:33:15.687695 I do have questions for you, even at this point, though. 1) Where did you find the formula (14400/(k+m))? 2) I was really trying to size this for when it goes to production, at which point it may have as many as 384 OSDs. Doesn't that imply I should have even more pgs? On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.com wrote: Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 2048. -don- *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Don Doerner *Sent:* 04 March, 2015 12:14 *To:* Kyle Hutson; Ceph Users *Subject:* Re: [ceph-users] New EC pool undersized In this case, that number means that there is not an OSD that can be assigned. What’s your k, m from you erasure coded pool? You’ll need approximately (14400/(k+m)) PGs, rounded up to the next power of 2… -don- *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com ceph-users-boun...@lists.ceph.com] *On Behalf Of *Kyle Hutson *Sent:* 04 March, 2015 12:06 *To:* Ceph Users *Subject:* [ceph-users] New EC pool undersized Last night I blew away my previous ceph configuration (this environment is pre-production) and have 0.87.1 installed. I've manually edited the crushmap so it down looks like https://dpaste.de/OLEa https://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEak=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442 I currently have 144 OSDs on 8 nodes. After increasing pg_num and pgp_num to a more suitable 1024 (due to the high number of OSDs), everything looked happy. So, now I'm trying to play with an erasure-coded pool. I did: ceph osd erasure-code-profile set ec44profile k=4 m=4 ruleset-failure-domain=rack ceph osd pool create ec44pool 8192 8192 erasure ec44profile After settling for a bit 'ceph status' gives cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 7 pgs degraded; 7 pgs stuck degraded; 7 pgs stuck unclean; 7 pgs stuck undersized; 7 pgs undersized monmap e1: 4 mons at {hobbit01= 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=6fe07b47a00235857630057e09cfb702dcddcea1d3f98d81a574020ee95dee44}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e409: 144 osds: 144 up, 144 in pgmap v6763: 12288 pgs, 2 pools, 0 bytes data, 0 objects 90598 MB used, 640 TB / 640 TB avail 7 active+undersized+degraded 12281 active+clean So to
Re: [ceph-users] New EC pool undersized
Hmmm, I just struggled through this myself. How many racks do you have? If not more than 8, you might want to make your failure domain smaller? I.e., maybe host? That, at least, would allow you to debug the situation… -don- From: Kyle Hutson [mailto:kylehut...@ksu.edu] Sent: 04 March, 2015 12:43 To: Don Doerner Cc: Ceph Users Subject: Re: [ceph-users] New EC pool undersized It wouldn't let me simply change the pg_num, giving Error EEXIST: specified pg_num 2048 = current 8192 But that's not a big deal, I just deleted the pool and recreated with 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' ...and the result is quite similar: 'ceph status' is now ceph status cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized monmap e1: 4 mons at {hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e412: 144 osds: 144 up, 144 in pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects 90590 MB used, 640 TB / 640 TB avail 4 active+undersized+degraded 6140 active+clean 'ceph pg dump_stuck results' in ok pg_stat objects mip degr misp unf bytes log disklog state state_stampvreported up up_primary actingacting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.296 00000000 active+undersized+degraded 2015-03-04 14:33:26.672224 0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] 5 0'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 2.69c 00000000 active+undersized+degraded 2015-03-04 14:33:24.984802 0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 14:33:15.695747 2.36d 00000000 active+undersized+degraded 2015-03-04 14:33:21.937620 0'0 412:9 [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63,2147483647]12 0'0 2015-03-04 14:33:15.652480 0'0 2015-03-04 14:33:15.652480 2.5f7 00000000 active+undersized+degraded 2015-03-04 14:33:26.169242 0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 14:33:15.687695 I do have questions for you, even at this point, though. 1) Where did you find the formula (14400/(k+m))? 2) I was really trying to size this for when it goes to production, at which point it may have as many as 384 OSDs. Doesn't that imply I should have even more pgs? On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.commailto:don.doer...@quantum.com wrote: Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 2048. -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don Doerner Sent: 04 March, 2015 12:14 To: Kyle Hutson; Ceph Users Subject: Re: [ceph-users] New EC pool undersized In this case, that number means that there is not an OSD that can be assigned. What’s your k, m from you erasure coded pool? You’ll need approximately (14400/(k+m)) PGs, rounded up to the next power of 2… -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kyle Hutson Sent: 04 March, 2015 12:06 To: Ceph Users Subject: [ceph-users] New EC pool undersized Last night I blew away my previous ceph configuration (this environment is pre-production) and have 0.87.1 installed. I've manually edited the crushmap so it down looks like https://dpaste.de/OLEahttps://urldefense.proofpoint.com/v1/url?u=https://dpaste.de/OLEak=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=JSfAuDHRgKln0yM%2FQGMT3hZb3rVLUpdn2wGdV3C0Rbk%3D%0As=c1bd46dcd96e656554817882d7f6581903b1e3c6a50313f4bf7494acfd12b442 I currently have 144 OSDs on 8 nodes. After increasing pg_num and pgp_num to a more suitable 1024 (due to the high number of OSDs), everything looked happy. So, now I'm trying to play with an erasure-coded pool. I did: ceph osd erasure-code-profile set ec44profile k=4 m=4 ruleset-failure-domain=rack ceph osd pool create ec44pool 8192 8192 erasure ec44profile After settling for a bit 'ceph status' gives cluster
Re: [ceph-users] Ceph User Teething Problems
I can't help much on the MDS front, but here is some answers and my view on some of it. On Wed, Mar 4, 2015 at 1:27 PM, Datatone Lists li...@datatone.co.uk wrote: I have been following ceph for a long time. I have yet to put it into service, and I keep coming back as btrfs improves and ceph reaches higher version numbers. I am now trying ceph 0.93 and kernel 4.0-rc1. Q1) Is it still considered that btrfs is not robust enough, and that xfs should be used instead? [I am trying with btrfs]. We are moving forward with btrfs on our production cluster aware that there may be performance issues. So far, it seems the later kernels have resolved the issues we've seen with snapshots. As the system grows we will keep an eye on it and are prepared to move to XFS if needed. I followed the manual deployment instructions on the web site (http://ceph.com/docs/master/install/manual-deployment/) and I managed to get a monitor and several osds running and apparently working. The instructions fizzle out without explaining how to set up mds. I went back to mkcephfs and got things set up that way. The mds starts. [Please don't mention ceph-deploy] The first thing that I noticed is that (whether I set up mon and osds by following the manual deployment, or using mkcephfs), the correct default pools were not created. bash-4.3# ceph osd lspools 0 rbd, bash-4.3# I get only 'rbd' created automatically. I deleted this pool, and re-created data, metadata and rbd manually. When doing this, I had to juggle with the pg- num in order to avoid the 'too many pgs for osd'. I have three osds running at the moment, but intend to add to these when I have some experience of things working reliably. I am puzzled, because I seem to have to set the pg-num for the pool to a number that makes (N-pools x pg-num)/N-osds come to the right kind of number. So this implies that I can't really expand a set of pools by adding osds at a later date. Q2) Is there any obvious reason why my default pools are not getting created automatically as expected? Since Giant, these pools are not automatically created, only the rbd pool is. Q3) Can pg-num be modified for a pool later? (If the number of osds is increased dramatically). pg_num and pgp_num can be increased (not decreased) on the fly later to expand with more OSDs. Finally, when I try to mount cephfs, I get a mount 5 error. A mount 5 error typically occurs if a MDS server is laggy or if it crashed. Ensure at least one MDS is up and running, and the cluster is active + healthy. My mds is running, but its log is not terribly active: 2015-03-04 17:47:43.177349 7f42da2c47c0 0 ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4), process ceph-mds, pid 4110 2015-03-04 17:47:43.182716 7f42da2c47c0 -1 mds.-1.0 log_to_monitors {default=true} (This is all there is in the log). I think that a key indicator of the problem must be this from the monitor log: 2015-03-04 16:53:20.715132 7f3cd0014700 1 mon.ceph-mon-00@0(leader).mds e1 warning, MDS mds.? [2001:8b0::5fb3::1fff::9054]:6800/4036 up but filesystem disabled (I have added the '' sections to obscure my ip address) Q4) Can you give me an idea of what is wrong that causes the mds to not play properly? I think that there are some typos on the manual deployment pages, for example: ceph-osd id={osd-num} This is not right. As far as I am aware it should be: ceph-osd -i {osd-num} There are a few of these, usually running --help for the command gives you the right syntax needed for the version you have installed. But it is still very confusing. An observation. In principle, setting things up manually is not all that complicated, provided that clear and unambiguous instructions are provided. This simple piece of documentation is very important. My view is that the existing manual deployment instructions gets a bit confused and confusing when it gets to the osd setup, and the mds setup is completely absent. For someone who knows, this would be a fairly simple and fairly quick operation to review and revise this part of the documentation. I suspect that this part suffers from being really obvious stuff to the well initiated. For those of us closer to the start, this forms the ends of the threads that have to be picked up before the journey can be made. Very best regards, David ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
Le 04/03/2015 21:48, Don Doerner a écrit : Hmmm, I just struggled through this myself.How many racks do you have?If not more than 8, you might want to make your failure domain smaller?I.e., maybe host?That, at least, would allow you to debug the situation… -don- Hello, I think I already had this problem. It's explained here http://tracker.ceph.com/issues/10350 And solution is probably here : http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ Section : CRUSH gives up too soon Cheers, Yann ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
My lowest level (other than OSD) is 'disktype' (based on the crushmaps at http://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/ ) since I have SSDs and HDDs on the same host. I just made that change (deleted the pool, deleted the profile, deleted the crush ruleset), then re-created using ruleset-failure-domain=disktype. Very similar results. health HEALTH_WARN 3 pgs degraded; 3 pgs stuck unclean; 3 pgs undersized 'ceph pg dump stuck' looks very similar to the last one I posted. On Wed, Mar 4, 2015 at 2:48 PM, Don Doerner don.doer...@quantum.com wrote: Hmmm, I just struggled through this myself. How many racks do you have? If not more than 8, you might want to make your failure domain smaller? I.e., maybe host? That, at least, would allow you to debug the situation… -don- *From:* Kyle Hutson [mailto:kylehut...@ksu.edu] *Sent:* 04 March, 2015 12:43 *To:* Don Doerner *Cc:* Ceph Users *Subject:* Re: [ceph-users] New EC pool undersized It wouldn't let me simply change the pg_num, giving Error EEXIST: specified pg_num 2048 = current 8192 But that's not a big deal, I just deleted the pool and recreated with 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' ...and the result is quite similar: 'ceph status' is now ceph status cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized monmap e1: 4 mons at {hobbit01= 10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0 https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e412: 144 osds: 144 up, 144 in pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects 90590 MB used, 640 TB / 640 TB avail 4 active+undersized+degraded 6140 active+clean 'ceph pg dump_stuck results' in ok pg_stat objects mip degr misp unf bytes log disklog state state_stampvreported up up_primary actingacting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.296 00000000 active+undersized+degraded2015-03-04 14:33:26.672224 0'0 412:9 [5,55,91,2147483647,83,135,53,26] 5 [5,55,91,2147483647,83,135,53,26] 50'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 2.69c 00000000 active+undersized+degraded2015-03-04 14:33:24.984802 0'0 412:9 [93,134,1,74,112,28,2147483647,60] 93 [93,134,1,74,112,28,2147483647 ,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 14:33:15.695747 2.36d 00000000 active+undersized+degraded2015-03-04 14:33:21.937620 0'0 412:9 [12,108,136,104,52,18,63,2147483647]12 [12,108,136,104,52,18,63, 2147483647]12 0'0 2015-03-04 14:33:15.6524800'0 2015-03-04 14:33:15.652480 2.5f7 00000000 active+undersized+degraded2015-03-04 14:33:26.169242 0'0 412:9 [94,128,73,22,4,60,2147483647,113] 94 [94,128,73,22,4,60,2147483647 ,113] 94 0'0 2015-03-04 14:33:15.687695 0'0 2015-03-04 14:33:15.687695 I do have questions for you, even at this point, though. 1) Where did you find the formula (14400/(k+m))? 2) I was really trying to size this for when it goes to production, at which point it may have as many as 384 OSDs. Doesn't that imply I should have even more pgs? On Wed, Mar 4, 2015 at 2:15 PM, Don Doerner don.doer...@quantum.com wrote: Oh duh… OK, then given a 4+4 erasure coding scheme, 14400/8 is 1800, so try 2048. -don- *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Don Doerner *Sent:* 04 March, 2015 12:14 *To:* Kyle Hutson; Ceph Users *Subject:* Re: [ceph-users] New EC pool undersized In this case, that number means that there is not an OSD that can be assigned. What’s your k, m from you erasure coded pool? You’ll need approximately (14400/(k+m)) PGs, rounded up to the next power of 2… -don- *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com ceph-users-boun...@lists.ceph.com] *On Behalf Of *Kyle Hutson *Sent:* 04 March, 2015 12:06 *To:* Ceph Users *Subject:* [ceph-users] New EC pool undersized Last night I blew away my previous ceph configuration (this environment is pre-production) and have 0.87.1 installed. I've manually edited the crushmap so it down looks like https://dpaste.de/OLEa
Re: [ceph-users] Hammer sharded radosgw bucket indexes question
- Original Message - From: Ben Hines bhi...@gmail.com To: ceph-users ceph-users@lists.ceph.com Sent: Wednesday, March 4, 2015 1:03:16 PM Subject: [ceph-users] Hammer sharded radosgw bucket indexes question Hi, These questions were asked previously but perhaps lost: We have some large buckets. - When upgrading to Hammer (0.93 or later), is it necessary to recreate the buckets to get a sharded index? - What parameters does the system use for deciding when to shard the index? The system does not re-shard the bucket index, it will only affect new buckets. There is a per-zone configurable that specifies num of shards for buckets created in that zone (by default it's disabled). There's also a ceph.conf configurable that can be set to override that value. Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Teething Problems
On Wed, Mar 4, 2015 at 4:43 PM, Lionel Bouton lionel-subscript...@bouton.name wrote: On 03/04/15 22:18, John Spray wrote: On 04/03/2015 20:27, Datatone Lists wrote: [...] [Please don't mention ceph-deploy] This kind of comment isn't very helpful unless there is a specific issue with ceph-deploy that is preventing you from using it, and causing you to resort to manual steps. As a new maintainer of ceph-deploy, I'm happy to hear all gripes. :) ceph-deploy is a subject I never took the time to give feedback on. We can't use it (we use Gentoo which isn't supported by ceph-deploy) and even if we could I probably wouldn't allow it: I believe that for important pieces of infrastructure like Ceph you have to understand its inner workings to the point where you can hack your way out in cases of problems and build tools to integrate them better with your environment (you can understand one of the reasons why we use Gentoo in production with other distributions...). I believe using ceph-deploy makes it more difficult to acquire the knowledge to do so. For example we have a script to replace a defective OSD (destroying an existing one and replacing with a new one) locking data in place as long as we can to avoid crush map changes to trigger movements until the map reaches its original state again which minimizes the total amount of data copied around. It might have been possible to achieve this with ceph-deploy, but I doubt we would have achieved it as easily (from understanding the causes of data movements through understanding the osd identifiers allocation process to implementing the script) if we hadn't created the OSD by hand repeatedly before scripting some processes. Thanks for this feedback. I share a lot of your sentiments, especially that it is good to understand as much of the system as you can. Everyone's skill level and use-case is different, and ceph-deploy is targeted more towards PoC use-cases. It tries to make things as easy as possible, but that necessarily abstracts most of the details away. Last time I searched for documentation on manual configuration it was much harder to find (mds manual configuration was indeed something I didn't find at all too). Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-kvm and cloned rbd image
Hi Josh, Thanks for taking a look at this. I 'm answering your questions inline. On 03/04/2015 10:01 PM, Josh Durgin wrote: [...] And then proceeded to create a qemu-kvm guest with rbd/server as its backing store. The guest booted but as soon as it got to mount the root fs, things got weird: What does the qemu command line look like? I am using libvirt, so I'll be copy-pasting from the log file: LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin /usr/libexec/qemu-kvm -name server -S -machine rhel6.5.0,accel=kvm,usb=off -cpu Penryn,+dca,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 1024 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid ee13f9a0-b7eb-93fd-aa8c-18da9e23ba5c -nographic -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/server.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot order=nc,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -drive file=rbd:libvirt-pool/server:id=libvirt:key=AQAeDqRTQEknIhAA5Gqfl/CkWIfh+nR01hEgzA==:auth_supported=cephx\;none,if=none,id=drive-scsi0-0-0-0 -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:73:98:a9,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 [...] scsi2 : Virtio SCSI HBA scsi 2:0:0:0: Direct-Access QEMU QEMU HARDDISK1.5. PQ: 0 ANSI: 5 sd 2:0:0:0: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sd 2:0:0:0: [sda] Attached SCSI disk dracut: Scanning devices sda2 for LVM logical volumes vg_main/lv_swap vg_main/lv_root dracut: inactive '/dev/vg_main/lv_swap' [1.00 GiB] inherit dracut: inactive '/dev/vg_main/lv_root' [6.50 GiB] inherit EXT4-fs (dm-1): INFO: recovery required on readonly filesystem This suggests the disk is being exposed as read-only via QEMU, perhaps via qemu's snapshot or other options. You're right, the disk does seem R/O but also corrupt. The disk image was cleanly unmounted before creating the snapshot and cloning it. What is more, if I just flatten the image and start the guest again it boots fine and there is no recovery needed on the fs. There are also a some: block I/O error in device 'drive-scsi0-0-0-0': Operation not permitted (1) messages logged in /var/log/libvirt/qemu/server.log You can use a clone in exactly the same way as any other rbd image. If you're running QEMU manually, for example, something like: qemu-kvm -drive file=rbd:rbd/server,format=raw,cache=writeback is fine for using the clone. QEMU is supposed to be unaware of any snapshots, parents, etc. at the rbd level. In a sense, the parameters passed to QEMU from libvirt boil down to your suggested command line. I think it should work as well, it is written all over the place :) I'm a still a newbie wrt ceph, maybe I am missing something flat-out obvious. Thanks for your time, -K. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph User Teething Problems
On 03/04/15 22:50, Travis Rhoden wrote: [...] Thanks for this feedback. I share a lot of your sentiments, especially that it is good to understand as much of the system as you can. Everyone's skill level and use-case is different, and ceph-deploy is targeted more towards PoC use-cases. It tries to make things as easy as possible, but that necessarily abstracts most of the details away. To follow up on this subject, assuming ceph-deploy worked with Gentoo, one feature which would make it really useful to us would be for it to dump each and every one of the commands it uses so that they might be replicated manually. Documentation might be inaccurate or hard to browse for various reasons, but a tool which achieves its purpose can't be wrong about the command it uses (assuming it simply calls standard command-line tools and not some API over a socket...). There might be a way to do it already (seems something you would want at least when developing it) but obviously I didn't check. Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] qemu-kvm and cloned rbd image
On 03/04/2015 01:36 PM, koukou73gr wrote: On 03/03/2015 05:53 PM, Jason Dillaman wrote: Your procedure appears correct to me. Would you mind re-running your cloned image VM with the following ceph.conf properties: [client] rbd cache off debug rbd = 20 log file = /path/writeable/by/qemu.$pid.log If you recreate the issue, would you mind opening a ticket at http://tracker.ceph.com/projects/rbd/issues? Jason, Thanks for the reply. Recreating the issue is not a problem, I can reproduce it any time. The log file was getting a bit large, I destroyed the guest after letting it thrash for about ~3 minutes, plenty of time to hit the problem. I've uploaded it at: http://paste.scsys.co.uk/468868 (~19MB) It looks like your libvirt rados user doesn't have access to whatever pool the parent image is in: librbd::AioRequest: write 0x7f1ec6ad6960 rbd_data.24413d1b58ba.0186 1523712~4096 should_complete: r = -1 -1 is EPERM, for operation not permitted. Check the libvirt user capabilites shown in ceph auth list - it should have at least r and class-read access to the pool storing the parent image. You can update it via the 'ceph auth caps' command. Josh ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] New EC pool undersized
I don’t know – I am playing with crush; someday I may fully comprehend it. Not today. I think you have to look at it like this: if your possible failure domain options are OSDs, hosts, racks, …, and you choose racks as your failure domain, and you have exactly as many racks as your pool size (and it can’t be any smaller, right?), then each PG has to have an OSD from each rack. If your 144 OSDs are split evenly across 8 racks, then you have 18 OSDs in each rack (presumably distributed over the hosts in that rack, though I don’t think that distribution is important for this calculation). And so your total number of choices is 18 to the 8th power, or just over 11 billion (actually, 11,019,960,576J). So probably the only thing you have to worry about is “crush giving up too soon”, and Yann’s resolution. -don- From: Kyle Hutson [mailto:kylehut...@ksu.edu] Sent: 04 March, 2015 13:15 To: Don Doerner Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] New EC pool undersized So it sounds like I should figure out at 'how many nodes' do I need to increase pg_num to 4096, and again for 8192, and increase those incrementally when as I add more hosts, correct? On Wed, Mar 4, 2015 at 3:04 PM, Don Doerner don.doer...@quantum.commailto:don.doer...@quantum.com wrote: Sorry, I missed your other questions, down at the bottom. See herehttps://urldefense.proofpoint.com/v1/url?u=http://ceph.com/docs/master/rados/operations/placement-groups/k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=5LgZXqMcdSY9dR535Wik6Qn%2Fv%2FLOohBS%2FXU6MSfnaEM%3D%0As=2a27a052900c3daae01c06d6a26f502b3be9b9e43bd75515319da5df690823f9 (look for “number of replicas for replicated pools or the K+M sum for erasure coded pools”) for the formula; 38400/8 probably implies 8192. The thing is, you’ve got to think about how many ways you can form combinations of 8 unique OSDs (with replacement) that match your failure domain rules. If you’ve only got 8 hosts, and your failure domain is hosts, it severely limits this number. And I have read that too many isn’t good either – a serialization issue, I believe. -don- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Don Doerner Sent: 04 March, 2015 12:49 To: Kyle Hutson Cc: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com Subject: Re: [ceph-users] New EC pool undersized Hmmm, I just struggled through this myself. How many racks do you have? If not more than 8, you might want to make your failure domain smaller? I.e., maybe host? That, at least, would allow you to debug the situation… -don- From: Kyle Hutson [mailto:kylehut...@ksu.edu] Sent: 04 March, 2015 12:43 To: Don Doerner Cc: Ceph Users Subject: Re: [ceph-users] New EC pool undersized It wouldn't let me simply change the pg_num, giving Error EEXIST: specified pg_num 2048 = current 8192 But that's not a big deal, I just deleted the pool and recreated with 'ceph osd pool create ec44pool 2048 2048 erasure ec44profile' ...and the result is quite similar: 'ceph status' is now ceph status cluster 196e5eb8-d6a7-4435-907e-ea028e946923 health HEALTH_WARN 4 pgs degraded; 4 pgs stuck unclean; 4 pgs undersized monmap e1: 4 mons at {hobbit01=10.5.38.1:6789/0,hobbit02=10.5.38.2:6789/0,hobbit13=10.5.38.13:6789/0,hobbit14=10.5.38.14:6789/0https://urldefense.proofpoint.com/v1/url?u=http://10.5.38.1:6789/0%2Chobbit02%3D10.5.38.2:6789/0%2Chobbit13%3D10.5.38.13:6789/0%2Chobbit14%3D10.5.38.14:6789/0k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=fHQcjtxx3uADdikQAQAh65Z0s%2FzNFIj544bRY5zThgI%3D%0As=01b7463be37041310163f5d75abc634fab3280633eaef2158ed6609c6f3978d8}, election epoch 6, quorum 0,1,2,3 hobbit01,hobbit02,hobbit13,hobbit14 osdmap e412: 144 osds: 144 up, 144 in pgmap v6798: 6144 pgs, 2 pools, 0 bytes data, 0 objects 90590 MB used, 640 TB / 640 TB avail 4 active+undersized+degraded 6140 active+clean 'ceph pg dump_stuck results' in ok pg_stat objects mip degr misp unf bytes log disklog state state_stampvreported up up_primary actingacting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp 2.296 00000000 active+undersized+degraded 2015-03-04 14:33:26.672224 0'0 412:9 [5,55,91,2147483647tel:2147483647,83,135,53,26] 5 [5,55,91,2147483647tel:2147483647,83,135,53,26] 50'0 2015-03-04 14:33:15.649911 0'0 2015-03-04 14:33:15.649911 2.69c 00000000 active+undersized+degraded 2015-03-04 14:33:24.984802 0'0 412:9 [93,134,1,74,112,28,2147483647tel:2147483647,60] 93 [93,134,1,74,112,28,2147483647tel:2147483647,60] 93 0'0 2015-03-04 14:33:15.695747 0'0 2015-03-04 14:33:15.695747 2.36d 00000000
Re: [ceph-users] Persistent Write Back Cache
Hello Nick, On Wed, 4 Mar 2015 08:49:22 - Nick Fisk wrote: Hi Christian, Yes that's correct, it's on the client side. I don't see this much different to a battery backed Raid controller, if you lose power, the data is in the cache until power resumes when it is flushed. If you are going to have the same RBD accessed by multiple servers/clients then you need to make sure the SSD is accessible to both (eg DRBD / Dual Port SAS). But then something like pacemaker would be responsible for ensuring the RBD and cache device are both present before allowing client access. Which is pretty much any and all use cases I can think about. Because it's not only concurrent (active/active) accesses, but you really need to have things consistent across all possible client hosts in case of a node failure. I'm no stranger to DRBD and Pacemaker (which incidentally didn't make it into Debian Jessie, queue massive laughter and ridicule), btw. When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's, however I can understand that this feature would prove more of a challenge if you are using Qemu and RBD. One of the reasons I'm using Ceph/RBD instead of DRBD (which is vastly more suited for some use cases) is that it allows me n+1 instead of n+n redundancy when it comes to consumers (compute nodes in my case). Now for your iSCSI head (looking forward to your results and any config recipes) that limitation to a pair may be just as well, but as others wrote it might be best to go forward with this outside of Ceph. Especially since you're already dealing with a HA cluster/pacemaker in that scenario. Christian Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 04 March 2015 08:40 To: ceph-users@lists.ceph.com Cc: Nick Fisk Subject: Re: [ceph-users] Persistent Write Back Cache Hello, If I understand you correctly, you're talking about the rbd cache on the client side. So assume that host or the cache SSD on if fail terminally. The client thinks its sync'ed are on the permanent storage (the actual ceph storage cluster), while they are only present locally. So restarting that service or VM on a different host now has to deal with likely crippling data corruption. Regards, Christian On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote: Hi All, Is there anything in the pipeline to add the ability to write the librbd cache to ssd so that it can safely ignore sync requests? I have seen a thread a few years back where Sage was discussing something similar, but I can't find anything more recent discussing it. I've been running lots of tests on our new cluster, buffered/parallel performance is amazing (40K Read 10K write iops), very impressed. However sync writes are actually quite disappointing. Running fio with 128k block size and depth=1, normally only gives me about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's and from what I hear that's about normal, so I don't think I have a ceph config problem. For applications which do a lot of sync's, like ESXi over iSCSI or SQL databases, this has a major performance impact. Traditional storage arrays work around this problem by having a battery backed cache which has latency 10-100 times less than what you can currently achieve with Ceph and an SSD . Whilst librbd does have a writeback cache, from what I understand it will not cache syncs and so in my usage case, it effectively acts like a write through cache. To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. Nick -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.93: Bucket removal with data purge
Ah, nevermind - i had to pass the --bucket=bucketname argument. You'd think the command would print an error if missing the critical argument. -Ben On Wed, Mar 4, 2015 at 6:06 PM, Ben Hines bhi...@gmail.com wrote: One of the release notes says: rgw: fix bucket removal with data purge (Yehuda Sadeh) Just tried this and it didnt seem to work: bash-4.1$ time radosgw-admin bucket rm mike-cache2 --purge-objects real0m7.711s user0m0.109s sys 0m0.072s Yet the bucket was not deleted, nor purged: -bash-4.1$ radosgw-admin bucket stats [ mike-cache2, { bucket: mike-cache2, pool: .rgw.buckets, index_pool: .rgw.buckets.index, id: default.2769570.4, marker: default.2769570.4, owner: smbuildmachine, ver: 0#329, master_ver: 0#0, mtime: 2014-11-11 16:10:31.00, max_marker: 0#, usage: { rgw.main: { size_kb: 223355, size_kb_actual: 223768, num_objects: 164 } }, bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1 } }, ] -Ben ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] v0.93: Bucket removal with data purge
One of the release notes says: rgw: fix bucket removal with data purge (Yehuda Sadeh) Just tried this and it didnt seem to work: bash-4.1$ time radosgw-admin bucket rm mike-cache2 --purge-objects real0m7.711s user0m0.109s sys 0m0.072s Yet the bucket was not deleted, nor purged: -bash-4.1$ radosgw-admin bucket stats [ mike-cache2, { bucket: mike-cache2, pool: .rgw.buckets, index_pool: .rgw.buckets.index, id: default.2769570.4, marker: default.2769570.4, owner: smbuildmachine, ver: 0#329, master_ver: 0#0, mtime: 2014-11-11 16:10:31.00, max_marker: 0#, usage: { rgw.main: { size_kb: 223355, size_kb_actual: 223768, num_objects: 164 } }, bucket_quota: { enabled: false, max_size_kb: -1, max_objects: -1 } }, ] -Ben ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pool distribution quality report script
Hi All, Recently some folks showed interest in gathering pool distribution statistics and I remembered I wrote a script to do that a while back. It was broken due to a change in the ceph pg dump output format that was committed a while back, so I cleaned the script up, added detection of header fields, automatic json support, and also added in calculation of expected max and min PGs per OSD and std deviation. The script is available here: https://github.com/ceph/ceph-tools/blob/master/cbt/tools/readpgdump.py Some general comments: 1) Expected numbers are derived by treating PGs and OSDs as a balls-in-buckets problem ala Raab Steger: http://www14.in.tum.de/personen/raab/publ/balls.pdf 2) You can invoke it either by passing it a file or stdout, ie: ceph pg dump -f json | ./readpgdump.py or ./readpgdump.py ~/pgdump.out 3) Here's a snippet of some of some sample output from a 210 OSD cluster. Does this output make sense to people? Is it useful? [nhm@burnupiX tools]$ ./readpgdump.py ~/pgdump.out ++ | Detected input as plain| ++ ++ | Pool ID: 681 | ++ | Participating OSDs: 210| | Participating PGs: 4096| ++ | OSDs in Primary Role (Acting) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2| | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9) | | Avg Deviation from Most Subscribed OSD: 54.6% | ++ | OSDs in Secondary Role (Acting)| | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2 | | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 | | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57)| | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20) | | Avg Deviation from Most Subscribed OSD: 36.0% | ++ | OSDs in All Roles (Acting) | | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5 | | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6| | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91) | | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) | | Avg Deviation from Most Subscribed OSD: 37.1% | ++ | OSDs in Primary Role (Up) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2| | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9) | | Avg Deviation from Most Subscribed OSD: 54.6% | ++ | OSDs in Secondary Role (Up)| | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2 | | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 | | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57)| | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20) | | Avg Deviation from Most Subscribed OSD: 36.0% | ++ | OSDs in All Roles (Up) | | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5 | | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6| | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91) | | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) | | Avg Deviation from Most Subscribed OSD: 37.1% |
Re: [ceph-users] Unexpected OSD down during deep-scrub
New issue created - http://tracker.ceph.com/issues/11027 Regards. Italo Santos http://italosantos.com.br/ On Tuesday, March 3, 2015 at 9:23 PM, Loic Dachary wrote: Hi Yann, That seems related to http://tracker.ceph.com/issues/10536 which seems to be resolved. Could you create a new issue with a link to 10536 ? More logs and ceph report would also be useful to figure out why it resurfaced. Thanks ! On 04/03/2015 00:04, Yann Dupont wrote: Le 03/03/2015 22:03, Italo Santos a écrit : I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down? I'm afraid I've seen this this afternoon too on my test cluster, just after upgrading from 0.87 to 0.93. After an initial migration success, some OSD started to go down : All presented similar stack traces , with magic word scrub in it : ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4) 1: /usr/bin/ceph-osd() [0xbeb3dc] 2: (()+0xf0a0) [0x7f8f3ca130a0] 3: (gsignal()+0x35) [0x7f8f3b37d165] 4: (abort()+0x180) [0x7f8f3b3803e0] 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d] 6: (()+0x63996) [0x7f8f3bbd1996] 7: (()+0x639c3) [0x7f8f3bbd19c3] 8: (()+0x63bee) [0x7f8f3bbd1bee] 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcd74f0] 10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x1fc) [0x97259c] 11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x97344a] 12: (ReplicatedPG::_scrub(ScrubMap, std::maphobject_t, std::pairunsigned int, unsigned int, std::lesshobject_t, std::allocatorstd::pairhobject_t const, std::pa irunsigned int, unsigned intconst)+0x2e4d) [0x9a5ded] 13: (PG::scrub_compare_maps()+0x658) [0x916378] 14: (PG::chunky_scrub(ThreadPool::TPHandle)+0x202) [0x917ee2] 15: (PG::scrub(ThreadPool::TPHandle)+0x3a3) [0x919f83] 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0x13) [0x7eff93] 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49] 18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40] 19: (()+0x6b50) [0x7f8f3ca0ab50] 20: (clone()+0x6d) [0x7f8f3b42695d] As a temporary measure, noscrub and nodeep-scrub are now set for this cluster, and all is working fine right now. So there is probably something wrong here. Need to investigate further. Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pool distribution quality report script
Hi Mark, Cool, that looks handy. Though it'd be even better if it could go a step further and recommend re-weighting values to balance things out (or increased PG counts where needed). Cheers, On 5 March 2015 at 15:11, Mark Nelson mnel...@redhat.com wrote: Hi All, Recently some folks showed interest in gathering pool distribution statistics and I remembered I wrote a script to do that a while back. It was broken due to a change in the ceph pg dump output format that was committed a while back, so I cleaned the script up, added detection of header fields, automatic json support, and also added in calculation of expected max and min PGs per OSD and std deviation. The script is available here: https://github.com/ceph/ceph-tools/blob/master/cbt/tools/readpgdump.py Some general comments: 1) Expected numbers are derived by treating PGs and OSDs as a balls-in-buckets problem ala Raab Steger: http://www14.in.tum.de/personen/raab/publ/balls.pdf 2) You can invoke it either by passing it a file or stdout, ie: ceph pg dump -f json | ./readpgdump.py or ./readpgdump.py ~/pgdump.out 3) Here's a snippet of some of some sample output from a 210 OSD cluster. Does this output make sense to people? Is it useful? [nhm@burnupiX tools]$ ./readpgdump.py ~/pgdump.out ++ | Detected input as plain | ++ ++ | Pool ID: 681 | ++ | Participating OSDs: 210 | | Participating PGs: 4096 | ++ | OSDs in Primary Role (Acting) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2 | | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9) | | Avg Deviation from Most Subscribed OSD: 54.6% | ++ | OSDs in Secondary Role (Acting) | | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2 | | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 | | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57) | | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20) | | Avg Deviation from Most Subscribed OSD: 36.0% | ++ | OSDs in All Roles (Acting) | | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5 | | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6 | | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91) | | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) | | Avg Deviation from Most Subscribed OSD: 37.1% | ++ | OSDs in Primary Role (Up) | | Expected PGs Per OSD: Min: 4, Max: 33, Mean: 19.5, Std Dev: 7.2 | | Actual PGs Per OSD: Min: 7, Max: 43, Mean: 19.5, Std Dev: 6.5 | | 5 Most Subscribed OSDs: 199(43), 175(36), 149(34), 167(32), 20(31) | | 5 Least Subscribed OSDs: 121(7), 46(7), 70(8), 94(9), 122(9) | | Avg Deviation from Most Subscribed OSD: 54.6% | ++ | OSDs in Secondary Role (Up) | | Expected PGs Per OSD: Min: 18, Max: 59, Mean: 39.0, Std Dev: 10.2 | | Actual PGs Per OSD: Min: 17, Max: 61, Mean: 39.0, Std Dev: 9.7 | | 5 Most Subscribed OSDs: 44(61), 14(60), 2(59), 167(59), 164(57) | | 5 Least Subscribed OSDs: 35(17), 31(20), 37(20), 145(20), 16(20) | | Avg Deviation from Most Subscribed OSD: 36.0% | ++ | OSDs in All Roles (Up) | | Expected PGs Per OSD: Min: 32, Max: 83, Mean: 58.5, Std Dev: 12.5 | | Actual PGs Per OSD: Min: 29, Max: 93, Mean: 58.5, Std Dev: 14.6 | | 5 Most Subscribed OSDs: 199(93), 175(92), 44(92), 167(91), 14(91) | | 5 Least Subscribed OSDs: 121(29), 35(30), 47(30), 131(32), 145(32) | | Avg Deviation from Most Subscribed OSD: 37.1% | ++ This is shown for all the pools, followed by the totals: ++ | Pool ID: Totals (All Pools) | ++ | Participating OSDs: 210 | | Participating PGs: 131072 | ++ | OSDs in Primary Role (Acting) | | Expected PGs Per OSD: Min: 542, Max: 705, Mean:
[ceph-users] Inkscope packages and blog
Hi everyone, I'm proud to announce that DEB and RPM packages for Inkscope V1.1 are available on github (https://github.com/inkscope/inkscope-packaging). Inkscope has also its blog : http://inkscope.blogspot.fr. You will find there how to install Inkscope on debian servers (http://inkscope.blogspot.fr/2015/03/inkscope-installation-on-debian-servers.html) Feedback is welcome. Cheers, Alain _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rebalance/Backfill Throtling - anything missing here?
Thank you Rober - I'm wondering when I do remove total of 7 OSDs from crush map - weather that will cause more than 37% of data moved (80% or whatever) I'm also wondering if the thortling that I applied is fine or not - I will introduce the osd_recovery_delay_start 10sec as Irek said. I'm just wondering hom much will be the performance impact, because: - when stoping OSD, the impact while backfilling was fine more or a less - I can leave with this - when I removed OSD from cursh map - first 1h or so, impact was tremendous, and later on during recovery process impact was much less but still noticable... Thanks for the tip of course ! Andrija On 3 March 2015 at 18:34, Robert LeBlanc rob...@leblancnet.us wrote: I would be inclined to shut down both OSDs in a node, let the cluster recover. Once it is recovered, shut down the next two, let it recover. Repeat until all the OSDs are taken out of the cluster. Then I would set nobackfill and norecover. Then remove the hosts/disks from the CRUSH then unset nobackfill and norecover. That should give you a few small changes (when you shut down OSDs) and then one big one to get everything in the final place. If you are still adding new nodes, when nobackfill and norecover is set, you can add them in so that the one big relocate fills the new drives too. On Tue, Mar 3, 2015 at 5:58 AM, Andrija Panic andrija.pa...@gmail.com wrote: Thx Irek. Number of replicas is 3. I have 3 servers with 2 OSDs on them on 1g switch (1 OSD already decommissioned), which is further connected to a new 10G switch/network with 3 servers on it with 12 OSDs each. I'm decommissioning old 3 nodes on 1G network... So you suggest removing whole node with 2 OSDs manually from crush map? Per my knowledge, ceph never places 2 replicas on 1 node, all 3 replicas were originally been distributed over all 3 nodes. So anyway It could be safe to remove 2 OSDs at once together with the node itself...since replica count is 3... ? Thx again for your time On Mar 3, 2015 1:35 PM, Irek Fasikhov malm...@gmail.com wrote: Once you have only three nodes in the cluster. I recommend you add new nodes to the cluster, and then delete the old. 2015-03-03 15:28 GMT+03:00 Irek Fasikhov malm...@gmail.com: You have a number of replication? 2015-03-03 15:14 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Hi Irek, yes, stoping OSD (or seting it to OUT) resulted in only 3% of data degraded and moved/recovered. When I after that removed it from Crush map ceph osd crush rm id, that's when the stuff with 37% happened. And thanks Irek for help - could you kindly just let me know of the prefered steps when removing whole node? Do you mean I first stop all OSDs again, or just remove each OSD from crush map, or perhaps, just decompile cursh map, delete the node completely, compile back in, and let it heal/recover ? Do you think this would result in less data missplaces and moved arround ? Sorry for bugging you, I really appreaciate your help. Thanks On 3 March 2015 at 12:58, Irek Fasikhov malm...@gmail.com wrote: A large percentage of the rebuild of the cluster map (But low percentage degradation). If you had not made ceph osd crush rm id, the percentage would be low. In your case, the correct option is to remove the entire node, rather than each disk individually 2015-03-03 14:27 GMT+03:00 Andrija Panic andrija.pa...@gmail.com: Another question - I mentioned here 37% of objects being moved arround - this is MISPLACED object (degraded objects were 0.001%, after I removed 1 OSD from cursh map (out of 44 OSD or so). Can anybody confirm this is normal behaviour - and are there any workarrounds ? I understand this is because of the object placement algorithm of CEPH, but still 37% of object missplaces just by removing 1 OSD from crush maps out of 44 make me wonder why this large percentage ? Seems not good to me, and I have to remove another 7 OSDs (we are demoting some old hardware nodes). This means I can potentialy go with 7 x the same number of missplaced objects...? Any thoughts ? Thanks On 3 March 2015 at 12:14, Andrija Panic andrija.pa...@gmail.com wrote: Thanks Irek. Does this mean, that after peering for each PG, there will be delay of 10sec, meaning that every once in a while, I will have 10sec od the cluster NOT being stressed/overloaded, and then the recovery takes place for that PG, and then another 10sec cluster is fine, and then stressed again ? I'm trying to understand process before actually doing stuff (config reference is there on ceph.com but I don't fully understand the process) Thanks, Andrija On 3 March 2015 at 11:32, Irek Fasikhov malm...@gmail.com wrote: Hi. Use value osd_recovery_delay_start example: [root@ceph08 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.94.asok config show |
Re: [ceph-users] Persistent Write Back Cache
Hi Christian, Yes that's correct, it's on the client side. I don't see this much different to a battery backed Raid controller, if you lose power, the data is in the cache until power resumes when it is flushed. If you are going to have the same RBD accessed by multiple servers/clients then you need to make sure the SSD is accessible to both (eg DRBD / Dual Port SAS). But then something like pacemaker would be responsible for ensuring the RBD and cache device are both present before allowing client access. When I wrote this I was thinking more about 2 HA iSCSI servers with RBD's, however I can understand that this feature would prove more of a challenge if you are using Qemu and RBD. Nick -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer Sent: 04 March 2015 08:40 To: ceph-users@lists.ceph.com Cc: Nick Fisk Subject: Re: [ceph-users] Persistent Write Back Cache Hello, If I understand you correctly, you're talking about the rbd cache on the client side. So assume that host or the cache SSD on if fail terminally. The client thinks its sync'ed are on the permanent storage (the actual ceph storage cluster), while they are only present locally. So restarting that service or VM on a different host now has to deal with likely crippling data corruption. Regards, Christian On Wed, 4 Mar 2015 08:26:52 - Nick Fisk wrote: Hi All, Is there anything in the pipeline to add the ability to write the librbd cache to ssd so that it can safely ignore sync requests? I have seen a thread a few years back where Sage was discussing something similar, but I can't find anything more recent discussing it. I've been running lots of tests on our new cluster, buffered/parallel performance is amazing (40K Read 10K write iops), very impressed. However sync writes are actually quite disappointing. Running fio with 128k block size and depth=1, normally only gives me about 300iops or 30MB/s. I'm seeing 2-3ms latency writing to SSD OSD's and from what I hear that's about normal, so I don't think I have a ceph config problem. For applications which do a lot of sync's, like ESXi over iSCSI or SQL databases, this has a major performance impact. Traditional storage arrays work around this problem by having a battery backed cache which has latency 10-100 times less than what you can currently achieve with Ceph and an SSD . Whilst librbd does have a writeback cache, from what I understand it will not cache syncs and so in my usage case, it effectively acts like a write through cache. To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO's are getting coalesced into nice large 512kb IO's at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. Nick -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ?
Hi, Many thanks for the explanations. I haven't used the nodcache option when mounting cephfs, it actually got there by default My mount command is/was : # mount -t ceph 1.2.3.4:6789:/ /mnt -o name=puppet,secretfile=./puppet.secret I don't know what causes this option to be default, maybe it's the kernel module I compiled from git (because there is no kmod-ceph or kmod-rbd in any RHEL-like distributions except RHEV), I'll try to update/check ... Concerning the rados pool ls, indeed : I created empty files in the pool, and they were not showing up probably because they were just empty - but when I create a non empty file, I see things in rados ls... Thanks again Frederic -Message d'origine- De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de John Spray Envoyé : mardi 3 mars 2015 17:15 À : ceph-users@lists.ceph.com Objet : Re: [ceph-users] cephfs filesystem layouts : authentication gotchas ? On 03/03/2015 15:21, SCHAER Frederic wrote: By the way : looks like the ceph fs ls command is inconsistent when the cephfs is mounted (I used a locally compiled kmod-ceph rpm): [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet ] (umount /mnt .) [root@ceph0 ~]# ceph fs ls name: cephfs_puppet, metadata pool: puppet_metadata, data pools: [puppet root ] This is probably #10288, which was fixed in 0.87.1 So, I have this pool named root that I added in the cephfs filesystem. I then edited the filesystem xattrs : [root@ceph0 ~]# getfattr -n ceph.dir.layout /mnt/root getfattr: Removing leading '/' from absolute path names # file: mnt/root ceph.dir.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root I'm therefore assuming client.puppet should not be allowed to write or read anything in /mnt/root, which belongs to the root pool. but that is not the case. On another machine where I mounted cephfs using the client.puppet key, I can do this : The mount was done with the client.puppet key, not the admin one that is not deployed on that node : 1.2.3.4:6789:/ on /mnt type ceph (rw,relatime,name=puppet,secret=hidden,nodcache) [root@dev7248 ~]# echo not allowed /mnt/root/secret.notfailed [root@dev7248 ~]# [root@dev7248 ~]# cat /mnt/root/secret.notfailed not allowed This is data you're seeing from the page cache, it hasn't been written to RADOS. You have used the nodcache setting, but that doesn't mean what you think it does (it was about caching dentries, not data). It's actually not even used in recent kernels (http://tracker.ceph.com/issues/11009). You could try the nofsc option, but I don't know exactly how much caching that turns off -- the safer approach here is probably to do your testing using I/Os that have O_DIRECT set. And I can even see the xattrs inherited from the parent dir : [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=root Whereas on the node where I mounted cephfs as ceph admin, I get nothing : [root@ceph0 ~]# cat /mnt/root/secret.notfailed [root@ceph0 ~]# ls -l /mnt/root/secret.notfailed -rw-r--r-- 1 root root 12 Mar 3 15:27 /mnt/root/secret.notfailed After some time, the file also gets empty on the puppet client host : [root@dev7248 ~]# cat /mnt/root/secret.notfailed [root@dev7248 ~]# (but the metadata remained ?) Right -- eventually the cache goes away, and you see the true (empty) state of the file. Also, as an unpriviledged user, I can get ownership of a secret file by changing the extended attribute : [root@dev7248 ~]# setfattr -n ceph.file.layout.pool -v puppet /mnt/root/secret.notfailed [root@dev7248 ~]# getfattr -n ceph.file.layout /mnt/root/secret.notfailed getfattr: Removing leading '/' from absolute path names # file: mnt/root/secret.notfailed ceph.file.layout=stripe_unit=4194304 stripe_count=1 object_size=4194304 pool=puppet Well, you're not really getting ownership of anything here: you're modifying the file's metadata, which you are entitled to do (pool permissions have nothing to do with file metadata). There was a recent bug where a file's pool layout could be changed even if it had data, but that was about safety rather than permissions. Final question for those that read down here : it appears that before creating the cephfs filesystem, I used the puppet pool to store a test rbd instance. And it appears I cannot get the list of cephfs objects in that pool, whereas I can get those that are on the newly created root pool : [root@ceph0 ~]# rados -p puppet ls test.rbd rbd_directory [root@ceph0 ~]# rados -p root ls 10a. 10b. Bug, or feature ? I didn't see anything in your earlier steps that would have led to any objects in
Re: [ceph-users] Fail to bring OSD back to cluster
Hi Luke, May be you can set these flags: ceph osd set nodown ceph osd set noout Regards Sahana On Wed, Mar 4, 2015 at 2:32 PM, Luke Kao luke@mycom-osi.com wrote: Hello ceph community, We need some immediate help that our cluster is in a very strange and bad status after unexpected reboot of many OSD nodes in a very short time frame. We have a cluster with 195 osd configured on 9 different OSD nodes, original version 0.80.5. After some issue of the datacenter, at least 5 OSD nodes rebooted and after reboot not all OSDs goes up then trigger a lot of recovery, also many PGs goes into dead / incomplete state. Then we try to restart OSD, and found OSD keep crashes with error FAILED assert(log.head = olog.tail olog.head = log.tail), so we upgrade to 0.80.7 which covers fix of #9482, however we still see the error with different behavior: 0.80.5: once OSD crashes with this error, any trial to restart the OSD, it will crash with same error at the end 0.80.7: OSD can be restarted, but after some time, there will be another OSD will crash with this error We also tried to set nobackfill and norecover flag but doesn't help. So the cluster get stuck that we cannot bring more osd back. Any suggestion that we may have the chance to recover the cluster? Many thanks, Luke Kao MYCOM-OSI http://www.mycom-osi.com -- This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Implement replication network with live cluster
Hi, I'm having a live cluster with only public network (so no explicit network configuraion in the ceph.conf file) I'm wondering what is the procedure to implement dedicated Replication/Private and Public network. I've read the manual, know how to do it in ceph.conf, but I'm wondering since this is already running cluster - what should I do after I change ceph.conf on all nodes ? Restarting OSDs one by one, or... ? Is there any downtime expected ? - for the replication network to actually imlemented completely. Another related quetion: Also, I'm demoting some old OSDs, on old servers, I will have them all stoped, but would like to implement replication network before actually removing old OSDs from crush map - since lot of data will be moved arround. My old nodes/OSDs (that will be stoped before I implement replication network) - do NOT have dedicated NIC for replication network, in contrast to new nodes/OSDs. So there will be still reference to these old OSD in the crush map. Will this be a problem - me changing/implementing replication network that WILL work on new nodes/OSDs, but not on old ones since they don't have dedicated NIC ? I guess not since old OSDs are stoped anyway, but would like opinion. Or perhaps i might remove OSD from crush map with prior seting of nobackfill and norecover (so no rebalancing happens) and then implement replication netwotk? Sorry for old post, but... Thanks, -- Andrija Panić ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Persistent Write Back Cache
On 04/03/2015 08:26, Nick Fisk wrote: To illustrate the difference a proper write back cache can make, I put a 1GB (512mb dirty threshold) flashcache in front of my RBD and tweaked the flush parameters to flush dirty blocks at a large queue depth. The same fio test (128k iodepth=1) now runs at 120MB/s and is limited by the performance of SSD used by flashcache, as everything is stored as 4k blocks on the ssd. In fact since everything is stored as 4k blocks, pretty much all IO sizes are accelerated to max speed of the SSD. Looking at iostat I can see all the IO’s are getting coalesced into nice large 512kb IO’s at a high queue depth, which Ceph easily swallows. If librbd could support writing its cache out to SSD it would hopefully achieve the same level of performance and having it integrated would be really neat. What are you hoping to gain from building something into ceph instead of using flashcache/bcache/dm-cache on top of it? It seems like since you would anyway need to handle your HA configuration, setting up the actual cache device would be the simple part. Cheers, John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Perf problem after upgrade from dumpling to firefly
Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rbd image's data deletion
An RBD image is split up into (by default 4MB) objects within the OSDs. When you delete an RBD image, all the objects associated with the image are removed from the OSDs. The objects are not securely erased from the OSDs if that is what you are asking. -- Jason Dillaman Red Hat dilla...@redhat.com http://www.redhat.com - Original Message - From: Giuseppe Civitella giuseppe.civite...@gmail.com To: ceph-users ceph-us...@ceph.com Sent: Tuesday, March 3, 2015 11:36:46 AM Subject: [ceph-users] Rbd image's data deletion Hi all, what happens to data contained in an rbd image when the image itself gets deleted? Are the data just unlinked or are them destroyed in a way that make them unreadable? thanks Giuseppe ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Firefly, cephfs issues: different unix rights depending on the client and ls are slow
Hi, I'm trying cepfs and I have some problems. Here is the context: All the nodes (in cluster and the clients) are Ubuntu 14.04 with a 3.16 kernel (after apt-get install linux-generic-lts-utopic reboot). The cluster: - one server with just one monitor daemon (RAM 2GB) - 2 servers (RAM 24GB) with one monitor daemon, ~10 OSDs daemon (one per disk of 275 GB), and one mds daemon (I use the default active/standby mode and the pools for cephfs are data and metadata) The cluster is totally unused (the servers are idle as regards the RAM and the load overage etc), it's a little cluster for testing, the raw space is 5172G, number of replicas is 2. Another remark, facing my problem, I have put in my ceph conf mds cache size = 100 but without lof of effect (or else I would not be posting this message). Initially, the cephfs is completely empty. The clients, test-cephfs and test-cephfs2, have 512MB of RAM. In these clients, I mount the cephfs like this (with the root account): ~# mkdir /cephfs ~# mount -t ceph 10.0.2.150,10.0.2.151,10.0.2.152:/ /cephfs/ -o name=cephfs,secretfile=/etc/ceph/ceph.client.cephfs.secret Then in ceph-testfs, I do: root@test-cephfs:~# mkdir /cephfs/d1 root@test-cephfs:~# ll /cephfs/ total 4 drwxr-xr-x 1 root root0 Mar 4 11:45 ./ drwxr-xr-x 24 root root 4096 Mar 4 11:42 ../ drwxr-xr-x 1 root root0 Mar 4 11:45 d1/ After, in test-cephfs2, I do: root@test-cephfs2:~# ll /cephfs/ total 4 drwxr-xr-x 1 root root0 Mar 4 11:45 ./ drwxr-xr-x 24 root root 4096 Mar 4 11:42 ../ drwxrwxrwx 1 root root0 Mar 4 11:45 d1/ 1) Why the unix rights of d1/ are different when I'm in test-cephfs and when I'm in test-cephfs2? It should be the same, isn't? 2) If I create 100 files in /cephfs/d1/ with test-cephfs: for i in $(seq 100) do echo $(date +%s.%N) /cephfs/d1/f_$i done sometimes, in test-cephfs2, when I do a simple: root@test-cephfs2:~# time \ls -la /cephfs the command can take 2 or 3 seconds which seems to me very long for a directory with just 100 files. Generally, if I repeat the command on test-cephfs2 just after, it's immediate but not always. I can not reproduce the problem in a determinist way. Sometimes, to reproduce the problem, I must remove all the files in /cephfs/ on test-cepfs and recreate them. It's very strange. Sometimes and randomly, something seems to be stalled but I don't know what. I suspect a problem of mds tuning but, In fact, I don't know what to do. Do have an idea of the problem? 3) I plan to use cephfs in production in a project of web servers (which share together a cephfs storage) but I would like to solve the issue above before. If you have any suggestion about cephfs and mds tuning, I am highly interested. Thanks in advance for your help. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
The load problem is permanent : I have twice IO/s on HDD since firefly. Oh, permanent, that's strange. (If you don't see more traffic coming from clients, I don't understand...) do you see also twice ios/ ops in ceph -w stats ? is the ceph health ok ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 14:49:41 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Perf problem after upgrade from dumpling to firefly
Ceph health is OK yes. The «firefly-upgrade-cluster-IO.png» graph is about IO stats seen by ceph : there is no change between dumpling and firefly. The change is only on OSD (and not on OSD journal). Le mercredi 04 mars 2015 à 15:05 +0100, Alexandre DERUMIER a écrit : The load problem is permanent : I have twice IO/s on HDD since firefly. Oh, permanent, that's strange. (If you don't see more traffic coming from clients, I don't understand...) do you see also twice ios/ ops in ceph -w stats ? is the ceph health ok ? - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: aderumier aderum...@odiso.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 14:49:41 Objet: Re: [ceph-users] Perf problem after upgrade from dumpling to firefly Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 Dumpling: removing many snapshots in a short time makes OSDs go berserk http://tracker.ceph.com/issues/9487 dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre - Mail original - De: Olivier Bonvalet ceph.l...@daevel.fr À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: [ceph-users] Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com