Re: [ceph-users] anti-cephalopod question
Christian. I'll start out with 4 nodes. I understand re-balancing takes time. [ Eventually I'll need to swap out one of the nodes with a host I'm using for production.. But that'll be on a Saturday afternoon.. ] However I do not fully get this: *No, the default is to split at host level. So once you have enough nodes in one room to fulfill the replication level (3) some PGs will be all in that location * *can you please send this:* *non default firefly cepf.conf settings for a 4 node anti-cephalopod cluster ? * I want to start my testing with close to ideal ceph settings . Then do a lot of testing of noout and other things. After I'm done I'll document what was done and post it a few places. I appreciate the suggestions you've sent . kind regards, rob fantini On Tue, Jul 29, 2014 at 9:49 PM, Christian Balzer ch...@gol.com wrote: Hello, On Tue, 29 Jul 2014 06:33:14 -0400 Robert Fantini wrote: Christian - Thank you for the answer, I'll get around to reading 'Crush Maps ' a few times , it is important to have a good understanding of ceph parts. So another question - As long as I keep the same number of nodes in both rooms, will firefly defaults keep data balanced? No, the default is to split at host level. So once you have enough nodes in one room to fulfill the replication level (3) some PGs will be all in that location. If not I'll stick with 2 each room until I understand how configure things. That will work, but I would strongly advise you to get it right from the start, as in configure the Crush map to your needs split on room or such. Because if you introduce this change later, your data will be rebalanced... Christian On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer ch...@gol.com wrote: On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote: target replication level of 3 with a min of 1 across the node level After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/ , I assume that to accomplish that then set these in ceph.conf ? osd pool default size = 3 osd pool default min size = 1 Not really, the min size specifies how few replicas need to be online for Ceph to accept IO. These (the current Firefly defaults) settings with the default crush map will have 3 sets of data spread over 3 OSDs and not use the same node (host) more than once. So with 2 nodes in each location, a replica will always be both locations. However if you add more nodes, all of them could wind up in the same building. To prevent this, you have location qualifiers beyond host and you can modify the crush map to enforce that at least one replica is in a different rack, row, room, region, etc. Advanced material, but one really needs to understand this: http://ceph.com/docs/master/rados/operations/crush-map/ Christian On Mon, Jul 28, 2014 at 2:56 PM, Michael mich...@onlinefusion.co.uk wrote: If you've two rooms then I'd go for two OSD nodes in each room, a target replication level of 3 with a min of 1 across the node level, then have 5 monitors and put the last monitor outside of either room (The other MON's can share with the OSD nodes if needed). Then you've got 'safe' replication for OSD/node replacement on failure with some 'shuffle' room for when it's needed and either room can be down while the external last monitor allows the decisions required to allow a single room to operate. There's no way you can do a 3/2 MON split that doesn't risk the two nodes being up and unable to serve data while the three are down so you'd need to find a way to make it a 2/2/1 split instead. -Michael On 28/07/2014 18:41, Robert Fantini wrote: OK for higher availability then 5 nodes is better then 3 . So we'll run 5 . However we want normal operations with just 2 nodes. Is that possible? Eventually 2 nodes will be next building 10 feet away , with a brick wall in between. Connected with Infiniband or better. So one room can go off line the other will be on. The flip of the coin means the 3 node room will probably go down. All systems will have dual power supplies connected to different UPS'. In addition we have a power generator. Later we'll have a 2-nd generator. and then the UPS's will use different lines attached to those generators somehow.. Also of course we never count on one cluster to have our data. We have 2 co-locations with backup going to often using zfs send receive and or rsync . So for the 5 node cluster, how do we set it so 2 nodes up = OK ? Or is that a bad idea? PS: any other idea on how to increase availability are welcome .
Re: [ceph-users] anti-cephalopod question
Christian - Thank you for the answer, I'll get around to reading 'Crush Maps ' a few times , it is important to have a good understanding of ceph parts. So another question - As long as I keep the same number of nodes in both rooms, will firefly defaults keep data balanced? If not I'll stick with 2 each room until I understand how configure things. On Mon, Jul 28, 2014 at 9:19 PM, Christian Balzer ch...@gol.com wrote: On Mon, 28 Jul 2014 18:11:33 -0400 Robert Fantini wrote: target replication level of 3 with a min of 1 across the node level After reading http://ceph.com/docs/master/rados/configuration/ceph-conf/ , I assume that to accomplish that then set these in ceph.conf ? osd pool default size = 3 osd pool default min size = 1 Not really, the min size specifies how few replicas need to be online for Ceph to accept IO. These (the current Firefly defaults) settings with the default crush map will have 3 sets of data spread over 3 OSDs and not use the same node (host) more than once. So with 2 nodes in each location, a replica will always be both locations. However if you add more nodes, all of them could wind up in the same building. To prevent this, you have location qualifiers beyond host and you can modify the crush map to enforce that at least one replica is in a different rack, row, room, region, etc. Advanced material, but one really needs to understand this: http://ceph.com/docs/master/rados/operations/crush-map/ Christian On Mon, Jul 28, 2014 at 2:56 PM, Michael mich...@onlinefusion.co.uk wrote: If you've two rooms then I'd go for two OSD nodes in each room, a target replication level of 3 with a min of 1 across the node level, then have 5 monitors and put the last monitor outside of either room (The other MON's can share with the OSD nodes if needed). Then you've got 'safe' replication for OSD/node replacement on failure with some 'shuffle' room for when it's needed and either room can be down while the external last monitor allows the decisions required to allow a single room to operate. There's no way you can do a 3/2 MON split that doesn't risk the two nodes being up and unable to serve data while the three are down so you'd need to find a way to make it a 2/2/1 split instead. -Michael On 28/07/2014 18:41, Robert Fantini wrote: OK for higher availability then 5 nodes is better then 3 . So we'll run 5 . However we want normal operations with just 2 nodes. Is that possible? Eventually 2 nodes will be next building 10 feet away , with a brick wall in between. Connected with Infiniband or better. So one room can go off line the other will be on. The flip of the coin means the 3 node room will probably go down. All systems will have dual power supplies connected to different UPS'. In addition we have a power generator. Later we'll have a 2-nd generator. and then the UPS's will use different lines attached to those generators somehow.. Also of course we never count on one cluster to have our data. We have 2 co-locations with backup going to often using zfs send receive and or rsync . So for the 5 node cluster, how do we set it so 2 nodes up = OK ? Or is that a bad idea? PS: any other idea on how to increase availability are welcome . On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer ch...@gol.com wrote: On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote: On 07/28/2014 08:49 AM, Christian Balzer wrote: Hello, On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote: Hello Christian, Let me supply more info and answer some questions. * Our main concern is high availability, not speed. Our storage requirements are not huge. However we want good keyboard response 99.99% of the time. We mostly do data entry and reporting. 20-25 users doing mostly order , invoice processing and email. * DRBD has been very reliable , but I am the SPOF . Meaning that when split brain occurs [ every 18-24 months ] it is me or no one who knows what to do. Try to explain how to deal with split brain in advance For the future ceph looks like it will be easier to maintain. The DRBD people would of course tell you to configure things in a way that a split brain can't happen. ^o^ Note that given the right circumstances (too many OSDs down, MONs down) Ceph can wind up in a similar state. I am not sure what you mean by ceph winding up in a similar state. If you mean regarding 'split brain' in the usual sense of the term, it does not occur in Ceph. If it does, you have surely found a bug and you should let us know with lots of CAPS. What you can incur though if you have too many monitors
Re: [ceph-users] anti-cephalopod question
Hello, On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote: Hello Christian, Let me supply more info and answer some questions. * Our main concern is high availability, not speed. Our storage requirements are not huge. However we want good keyboard response 99.99% of the time. We mostly do data entry and reporting. 20-25 users doing mostly order , invoice processing and email. * DRBD has been very reliable , but I am the SPOF . Meaning that when split brain occurs [ every 18-24 months ] it is me or no one who knows what to do. Try to explain how to deal with split brain in advance For the future ceph looks like it will be easier to maintain. The DRBD people would of course tell you to configure things in a way that a split brain can't happen. ^o^ Note that given the right circumstances (too many OSDs down, MONs down) Ceph can wind up in a similar state. * We use Proxmox . So ceph and mons will share each node. I've used proxmox for a few years and like the kvm / openvz management. I tried it some time ago, but at that time it was still stuck with 2.6.32 due to OpenVZ and that wasn't acceptable to me for various reasons. I think it still is, too. * Ceph hardware: Four hosts . 8 drives each. OPSYS: raid-1 on ssd . Good, that should be sufficient for running MONs (you will want 3). OSD: four disk raid 10 array using 2-TB drives. Two of the systems will use Seagate Constellation ES.3 2TB 7200 RPM 128MB Cache SAS 6Gb/s the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB Cache SATA 6.0Gb/s drives. Journal: 200GB Intel DC S3700 Series Spare disk for raid. * more questions. you wrote: In essence, if your current setup can't handle the loss of a single disk, what happens if a node fails? You will need to design (HW) and configure (various Ceph options) your cluster to handle these things because at some point a recovery might be unavoidable. To prevent recoveries based on failed disks, use RAID, for node failures you could permanently set OSD noout or have a monitoring software do that when it detects a node failure. I'll research 'OSD noout' . You probably might be happy with the mon osd downout subtree limit set to host as well. In that case you will need to manually trigger a rebuild (set that node/OSD to out) if you can't repair a failed node in a short time and keep your redundancy levels. Are there other setting I should read up on / consider? For node reboots due to kernel upgrades - how is that handled? Of course that would be scheduled for off hours. Set noout before a planned downtime or live dangerously and assume it comes back within the timeout period (5 minutes IIRC). Any other suggestions? Test your cluster extensively before going into production. Fill it with enough data to be close to what you're expecting and fail one node/OSD. See how bad things become, try to determine where any bottlenecks are with tools like atop. While you've done pretty much everything to prevent that scenario from a disk failure with the RAID10 and by keeping nodes from being set out by whatever means you choose (mon osd downout subtree limit = host seems to work, I just tested it), having a cluster that doesn't melt down when recovering or at least knowing how bad things will be in such a scenario helps a lot. Regards, Christian thanks for the suggestions, Rob On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer ch...@gol.com wrote: Hello, actually replying in the other thread was fine by me, it was after relevant in a sense to it. And you mentioned something important there, which you didn't mention below, that you're coming from DRBD with a lot of experience there. So do I and Ceph/RBD simply isn't (and probably never will be) an adequate replacement for DRBD in some use cases. I certainly plan to keep deploying DRBD where it makes more sense (IOPS/speed), while migrating everything else to Ceph. Anyway, lets look at your mail: On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote: I've a question regarding advice from these threads: https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html Our current setup has 4 osd's per node.When a drive fails the cluster is almost unusable for data entry. I want to change our set up so that under no circumstances ever happens. While you can pretty much avoid this from happening, your cluster should be able to handle a recovery. While Ceph is a bit more hamfisted than DRBD and definitely needs more controls and tuning to make recoveries have less of an impact you would see something similar with DRBD and badly configured recovery speeds. In essence, if your current setup can't handle the loss of a single disk, what happens if a node fails? You
Re: [ceph-users] anti-cephalopod question
If you've two rooms then I'd go for two OSD nodes in each room, a target replication level of 3 with a min of 1 across the node level, then have 5 monitors and put the last monitor outside of either room (The other MON's can share with the OSD nodes if needed). Then you've got 'safe' replication for OSD/node replacement on failure with some 'shuffle' room for when it's needed and either room can be down while the external last monitor allows the decisions required to allow a single room to operate. There's no way you can do a 3/2 MON split that doesn't risk the two nodes being up and unable to serve data while the three are down so you'd need to find a way to make it a 2/2/1 split instead. -Michael On 28/07/2014 18:41, Robert Fantini wrote: OK for higher availability then 5 nodes is better then 3 . So we'll run 5 . However we want normal operations with just 2 nodes. Is that possible? Eventually 2 nodes will be next building 10 feet away , with a brick wall in between. Connected with Infiniband or better. So one room can go off line the other will be on. The flip of the coin means the 3 node room will probably go down. All systems will have dual power supplies connected to different UPS'. In addition we have a power generator. Later we'll have a 2-nd generator. and then the UPS's will use different lines attached to those generators somehow.. Also of course we never count on one cluster to have our data. We have 2 co-locations with backup going to often using zfs send receive and or rsync . So for the 5 node cluster, how do we set it so 2 nodes up = OK ? Or is that a bad idea? PS: any other idea on how to increase availability are welcome . On Mon, Jul 28, 2014 at 12:29 PM, Christian Balzer ch...@gol.com mailto:ch...@gol.com wrote: On Mon, 28 Jul 2014 11:22:38 +0100 Joao Eduardo Luis wrote: On 07/28/2014 08:49 AM, Christian Balzer wrote: Hello, On Sun, 27 Jul 2014 18:20:43 -0400 Robert Fantini wrote: Hello Christian, Let me supply more info and answer some questions. * Our main concern is high availability, not speed. Our storage requirements are not huge. However we want good keyboard response 99.99% of the time. We mostly do data entry and reporting. 20-25 users doing mostly order , invoice processing and email. * DRBD has been very reliable , but I am the SPOF . Meaning that when split brain occurs [ every 18-24 months ] it is me or no one who knows what to do. Try to explain how to deal with split brain in advance For the future ceph looks like it will be easier to maintain. The DRBD people would of course tell you to configure things in a way that a split brain can't happen. ^o^ Note that given the right circumstances (too many OSDs down, MONs down) Ceph can wind up in a similar state. I am not sure what you mean by ceph winding up in a similar state. If you mean regarding 'split brain' in the usual sense of the term, it does not occur in Ceph. If it does, you have surely found a bug and you should let us know with lots of CAPS. What you can incur though if you have too many monitors down is cluster downtime. The monitors will ensure you need a strict majority of monitors up in order to operate the cluster, and will not serve requests if said majority is not in place. The monitors will only serve requests when there's a formed 'quorum', and a quorum is only formed by (N/2)+1 monitors, N being the total number of monitors in the cluster (via the monitor map -- monmap). This said, if out of 3 monitors you have 2 monitors down, your cluster will cease functioning (no admin commands, no writes or reads served). As there is no configuration in which you can have two strict majorities, thus no two partitions of the cluster are able to function at the same time, you do not incur in split brain. I wrote similar state, not same state. From a user perspective it is purely semantics how and why your shared storage has seized up, the end result is the same. And yes, that MON example was exactly what I was aiming for, your cluster might still have all the data (another potential failure mode of cause), but is inaccessible. DRBD will see and call it a split brain, Ceph will call it a Paxos voting failure, it doesn't matter one iota to the poor sod relying on that particular storage. My point was and is, when you design a cluster of whatever flavor, make sure you understand how it can (and WILL) fail, how to prevent that from happening if at all possible and how to recover from it if not. Potentially (hopefully) in the case of Ceph it would be
Re: [ceph-users] anti-cephalopod question
Hello Christian, Let me supply more info and answer some questions. * Our main concern is high availability, not speed. Our storage requirements are not huge. However we want good keyboard response 99.99% of the time. We mostly do data entry and reporting. 20-25 users doing mostly order , invoice processing and email. * DRBD has been very reliable , but I am the SPOF . Meaning that when split brain occurs [ every 18-24 months ] it is me or no one who knows what to do. Try to explain how to deal with split brain in advance For the future ceph looks like it will be easier to maintain. * We use Proxmox . So ceph and mons will share each node. I've used proxmox for a few years and like the kvm / openvz management. * Ceph hardware: Four hosts . 8 drives each. OPSYS: raid-1 on ssd . OSD: four disk raid 10 array using 2-TB drives. Two of the systems will use Seagate Constellation ES.3 2TB 7200 RPM 128MB Cache SAS 6Gb/s the other two hosts use Western Digital RE WD2000FYYZ 2TB 7200 RPM 64MB Cache SATA 6.0Gb/s drives. Journal: 200GB Intel DC S3700 Series Spare disk for raid. * more questions. you wrote: In essence, if your current setup can't handle the loss of a single disk, what happens if a node fails? You will need to design (HW) and configure (various Ceph options) your cluster to handle these things because at some point a recovery might be unavoidable. To prevent recoveries based on failed disks, use RAID, for node failures you could permanently set OSD noout or have a monitoring software do that when it detects a node failure. I'll research 'OSD noout' . Are there other setting I should read up on / consider? For node reboots due to kernel upgrades - how is that handled? Of course that would be scheduled for off hours. Any other suggestions? thanks for the suggestions, Rob On Sat, Jul 26, 2014 at 1:47 AM, Christian Balzer ch...@gol.com wrote: Hello, actually replying in the other thread was fine by me, it was after relevant in a sense to it. And you mentioned something important there, which you didn't mention below, that you're coming from DRBD with a lot of experience there. So do I and Ceph/RBD simply isn't (and probably never will be) an adequate replacement for DRBD in some use cases. I certainly plan to keep deploying DRBD where it makes more sense (IOPS/speed), while migrating everything else to Ceph. Anyway, lets look at your mail: On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote: I've a question regarding advice from these threads: https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html Our current setup has 4 osd's per node.When a drive fails the cluster is almost unusable for data entry. I want to change our set up so that under no circumstances ever happens. While you can pretty much avoid this from happening, your cluster should be able to handle a recovery. While Ceph is a bit more hamfisted than DRBD and definitely needs more controls and tuning to make recoveries have less of an impact you would see something similar with DRBD and badly configured recovery speeds. In essence, if your current setup can't handle the loss of a single disk, what happens if a node fails? You will need to design (HW) and configure (various Ceph options) your cluster to handle these things because at some point a recovery might be unavoidable. To prevent recoveries based on failed disks, use RAID, for node failures you could permanently set OSD noout or have a monitoring software do that when it detects a node failure. Network: we use 2 IB switches and bonding in fail over mode. Systems are two Dell Poweredge r720 and Supermicro X8DT3 . I'm confused. Those Dells tend to have 8 drive bays normally, don't they? So you're just using 4 HDDs for OSDs? No SSD journals? Just 2 storage nodes? Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual disk failures. Which will happen. Also that SM product number is for a motherboard, not a server, is that your monitor host? Anything production with data on in that you value should have 3 mon hosts, if you can't afford dedicated ones sharing them on an OSD node (preferably with the OS on SSDs to keep leveldb happy) is better than just one, because if that one dies or gets corrupted, your data is inaccessible. So looking at how to do things better we will try '#4- anti-cephalopod' . That is a seriously funny phrase! We'll switch to using raid-10 or raid-6 and have one osd per node, using high end raid controllers, hot spares etc. Are you still talking about the same hardware as above, just 4 HDDs for storage? With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you have more bays use up to 12 for RAID6 with a high performance and large HW cache controller. And use one Intel 200gb
Re: [ceph-users] anti-cephalopod question
Hello, actually replying in the other thread was fine by me, it was after relevant in a sense to it. And you mentioned something important there, which you didn't mention below, that you're coming from DRBD with a lot of experience there. So do I and Ceph/RBD simply isn't (and probably never will be) an adequate replacement for DRBD in some use cases. I certainly plan to keep deploying DRBD where it makes more sense (IOPS/speed), while migrating everything else to Ceph. Anyway, lets look at your mail: On Fri, 25 Jul 2014 14:33:56 -0400 Robert Fantini wrote: I've a question regarding advice from these threads: https://mail.google.com/mail/u/0/#label/ceph/1476b93097673ad7?compose=1476ec7fef10fd01 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg11011.html Our current setup has 4 osd's per node.When a drive fails the cluster is almost unusable for data entry. I want to change our set up so that under no circumstances ever happens. While you can pretty much avoid this from happening, your cluster should be able to handle a recovery. While Ceph is a bit more hamfisted than DRBD and definitely needs more controls and tuning to make recoveries have less of an impact you would see something similar with DRBD and badly configured recovery speeds. In essence, if your current setup can't handle the loss of a single disk, what happens if a node fails? You will need to design (HW) and configure (various Ceph options) your cluster to handle these things because at some point a recovery might be unavoidable. To prevent recoveries based on failed disks, use RAID, for node failures you could permanently set OSD noout or have a monitoring software do that when it detects a node failure. Network: we use 2 IB switches and bonding in fail over mode. Systems are two Dell Poweredge r720 and Supermicro X8DT3 . I'm confused. Those Dells tend to have 8 drive bays normally, don't they? So you're just using 4 HDDs for OSDs? No SSD journals? Just 2 storage nodes? Note that unless you do use RAIDed OSDs this leaves you vulnerable to dual disk failures. Which will happen. Also that SM product number is for a motherboard, not a server, is that your monitor host? Anything production with data on in that you value should have 3 mon hosts, if you can't afford dedicated ones sharing them on an OSD node (preferably with the OS on SSDs to keep leveldb happy) is better than just one, because if that one dies or gets corrupted, your data is inaccessible. So looking at how to do things better we will try '#4- anti-cephalopod' . That is a seriously funny phrase! We'll switch to using raid-10 or raid-6 and have one osd per node, using high end raid controllers, hot spares etc. Are you still talking about the same hardware as above, just 4 HDDs for storage? With 4 HDDs I'd go for RAID10 (definitely want a hotspare there), if you have more bays use up to 12 for RAID6 with a high performance and large HW cache controller. And use one Intel 200gb S3700 per node for journal That's barely enough for 4 HDDs at 365MB/s write speed, but will do nicely if those are in a RAID10 (half speed of individual drives). Keep in mind that your node will never be able to write faster than the speed of your journal. My questions: is there a minimum number of OSD's which should be used? If you have one OSD per node and the disks are RAIDed, 2 OSDs aka 2 nodes is sufficient to begin with. However your performance might not be what you expect (an OSD process seems to be incapable of doing more than 800 write IOPS). But with a 4 disk RAID10 (essentially 2 HDDs, so about 200 IOPS) that's not so much of an issue. In my case with a 11 disk RAID6 AND the 4GB HW cache Areca controller it certainly is rather frustrating. In short, the more nodes (OSDs) you can deploy, the better the performance will be. And of course in case a node dies and you don't think it can be brought back in a sensible short time frame, having more than 2 nodes will enable you to do a recovery/rebalance and restore your redundancy to the desired level. should OSD's per node be the same? It is advantageous to have identical disks and OSD sizes, makes the whole thing more predictable and you don't have to play with weights. As for having different number of OSDs per node, consider this example: 4 nodes with 1 OSD, one node with 4 OSDs (all OSDs are of the same size). What will happen here is that all the replicas from single OSD nodes might wind up on the 4 OSD node. So it better have more power in all aspects than the single OSD nodes. Now that node fails and you decide to let things rebalance as it can't be repaired shortly. But you cluster was half full and now it will be 100% full and become unusable (for writes). So the moral of the story, deploy as much identical HW as possible. Christian best regards, Rob PS: I had asked above in middle of another thread... please ignore there. --