Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
Hello, On Thu, 10 Mar 2016 22:25:10 -0500 Alex Gorbachev wrote: > Reviving an old thread: > > On Sunday, July 12, 2015, Lionel Boutonwrote: > > > On 07/12/15 05:55, Alex Gorbachev wrote: > > > FWIW. Based on the excellent research by Mark Nelson > > > ( > > http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/ > > ) > > > we have dropped SSD journals altogether, and instead went for the > > > battery protected controller writeback cache. > > > > Note that this has limitations (and the research is nearly 2 years > > old): > > - the controller writeback caches are relatively small (often less than > > 4GB, 2GB is common on the controller, a small portion is not usable, > > and 10% of the rest is often used for readahead/read cache) and this is > > shared by all of your drives. If your workload is not "write spikes" > > oriented, but nearly constant writes this won't help as you will be > > limited on each OSD by roughly half of the disk IOPS. With journals on > > SSDs when you hit their limit (which is ~5GB of buffer for 10GB > > journals and not <2GB divided by the amount of OSDs per controller), > > the limit is the raw disk IOPS. > > - you *must* make sure the controller is configured to switch to > > write-through when the battery/capacitor fails (or a power failure on > > hardware from the same generation could make you lose all of the OSDs > > connected to them in a single event which means data loss), > > - you should monitor the battery/capacitor status to trigger > > maintenance (and your cluster will slow down while the > > battery/capacitor is waiting for a replacement, you might want to down > > the associated OSDs depending on your cluster configuration). We > > mostly eliminated this problem by replacing the whole chassis of the > > servers we lease for new generations every 2 or 3 years: if you time > > the hardware replacement to match a fresh chassis generation this > > means fresh capacitors and they shouldn't fail you (ours are rated for > > 3 years). > > > > We just ordered Intel S3710 SSDs even though we have battery/capacitor > > backed caches on the controllers: the latencies have started to rise > > nevertheless when there are long periods of write intensive activity. > > I'm currently pondering if we should bypass the write-cache for the > > SSDs. The cache is obviously less effective on them and might be more > > useful overall if it is dedicated to the rotating disks. Does anyone > > have test results with cache active/inactive on SSD journals with HP > > Smart Array p420 or p840 controllers? > > > We have come to the same conclusion once we started seeing some more > constant write loads. Thank you for the great info - question: have you > tried SSD journals with and without additional controller cache? Any > benefit? > Haven't tried that with journals SSDs, simply because I tend to use basically DC S3700s there, which would benefit little considering the cost of a fast enough controller with ample cache. That said, I've done this both with HDDs and on disk journals (with the expected results as detailed above) and with consumer Intel 530 SSDs on some Twin servers that came with LSI 2108 controllers. In the later case these are OS disk, nothing Ceph related. But the HW controller cache nicely masks the garbage collection spikes and slowness of SYNC writes of these SSDs in medium load scenarios. In short, HW cache should always help, but it can do only so much (for so long) so unless you already have HW with it or can get it dirt cheap, it's not particular economic once you reach its limits. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
Reviving an old thread: On Sunday, July 12, 2015, Lionel Boutonwrote: > On 07/12/15 05:55, Alex Gorbachev wrote: > > FWIW. Based on the excellent research by Mark Nelson > > ( > http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/ > ) > > we have dropped SSD journals altogether, and instead went for the > > battery protected controller writeback cache. > > Note that this has limitations (and the research is nearly 2 years old): > - the controller writeback caches are relatively small (often less than > 4GB, 2GB is common on the controller, a small portion is not usable, and > 10% of the rest is often used for readahead/read cache) and this is > shared by all of your drives. If your workload is not "write spikes" > oriented, but nearly constant writes this won't help as you will be > limited on each OSD by roughly half of the disk IOPS. With journals on > SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals > and not <2GB divided by the amount of OSDs per controller), the limit is > the raw disk IOPS. > - you *must* make sure the controller is configured to switch to > write-through when the battery/capacitor fails (or a power failure on > hardware from the same generation could make you lose all of the OSDs > connected to them in a single event which means data loss), > - you should monitor the battery/capacitor status to trigger maintenance > (and your cluster will slow down while the battery/capacitor is waiting > for a replacement, you might want to down the associated OSDs depending > on your cluster configuration). We mostly eliminated this problem by > replacing the whole chassis of the servers we lease for new generations > every 2 or 3 years: if you time the hardware replacement to match a > fresh chassis generation this means fresh capacitors and they shouldn't > fail you (ours are rated for 3 years). > > We just ordered Intel S3710 SSDs even though we have battery/capacitor > backed caches on the controllers: the latencies have started to rise > nevertheless when there are long periods of write intensive activity. > I'm currently pondering if we should bypass the write-cache for the > SSDs. The cache is obviously less effective on them and might be more > useful overall if it is dedicated to the rotating disks. Does anyone > have test results with cache active/inactive on SSD journals with HP > Smart Array p420 or p840 controllers? We have come to the same conclusion once we started seeing some more constant write loads. Thank you for the great info - question: have you tried SSD journals with and without additional controller cache? Any benefit? Thank you, Alex > > Lionel > -- -- Alex Gorbachev Storcium ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
Hello, thanks to Lionel for writing pretty much what I was going to, in particular cache sizes and read-ahead cache allocations. In addition to this keep in mind that all writes still have to happen twice per drive, journal and actual OSD. So when the cache is to busy to merge writes nicely your HDD IOPS is being halved again. Christian On Sun, 12 Jul 2015 14:33:03 +0200 Lionel Bouton wrote: On 07/12/15 05:55, Alex Gorbachev wrote: FWIW. Based on the excellent research by Mark Nelson (http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/) we have dropped SSD journals altogether, and instead went for the battery protected controller writeback cache. Note that this has limitations (and the research is nearly 2 years old): - the controller writeback caches are relatively small (often less than 4GB, 2GB is common on the controller, a small portion is not usable, and 10% of the rest is often used for readahead/read cache) and this is shared by all of your drives. If your workload is not write spikes oriented, but nearly constant writes this won't help as you will be limited on each OSD by roughly half of the disk IOPS. With journals on SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals and not 2GB divided by the amount of OSDs per controller), the limit is the raw disk IOPS. - you *must* make sure the controller is configured to switch to write-through when the battery/capacitor fails (or a power failure on hardware from the same generation could make you lose all of the OSDs connected to them in a single event which means data loss), - you should monitor the battery/capacitor status to trigger maintenance (and your cluster will slow down while the battery/capacitor is waiting for a replacement, you might want to down the associated OSDs depending on your cluster configuration). We mostly eliminated this problem by replacing the whole chassis of the servers we lease for new generations every 2 or 3 years: if you time the hardware replacement to match a fresh chassis generation this means fresh capacitors and they shouldn't fail you (ours are rated for 3 years). We just ordered Intel S3710 SSDs even though we have battery/capacitor backed caches on the controllers: the latencies have started to rise nevertheless when there are long periods of write intensive activity. I'm currently pondering if we should bypass the write-cache for the SSDs. The cache is obviously less effective on them and might be more useful overall if it is dedicated to the rotating disks. Does anyone have test results with cache active/inactive on SSD journals with HP Smart Array p420 or p840 controllers? Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
On 07/12/15 05:55, Alex Gorbachev wrote: FWIW. Based on the excellent research by Mark Nelson (http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/) we have dropped SSD journals altogether, and instead went for the battery protected controller writeback cache. Note that this has limitations (and the research is nearly 2 years old): - the controller writeback caches are relatively small (often less than 4GB, 2GB is common on the controller, a small portion is not usable, and 10% of the rest is often used for readahead/read cache) and this is shared by all of your drives. If your workload is not write spikes oriented, but nearly constant writes this won't help as you will be limited on each OSD by roughly half of the disk IOPS. With journals on SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals and not 2GB divided by the amount of OSDs per controller), the limit is the raw disk IOPS. - you *must* make sure the controller is configured to switch to write-through when the battery/capacitor fails (or a power failure on hardware from the same generation could make you lose all of the OSDs connected to them in a single event which means data loss), - you should monitor the battery/capacitor status to trigger maintenance (and your cluster will slow down while the battery/capacitor is waiting for a replacement, you might want to down the associated OSDs depending on your cluster configuration). We mostly eliminated this problem by replacing the whole chassis of the servers we lease for new generations every 2 or 3 years: if you time the hardware replacement to match a fresh chassis generation this means fresh capacitors and they shouldn't fail you (ours are rated for 3 years). We just ordered Intel S3710 SSDs even though we have battery/capacitor backed caches on the controllers: the latencies have started to rise nevertheless when there are long periods of write intensive activity. I'm currently pondering if we should bypass the write-cache for the SSDs. The cache is obviously less effective on them and might be more useful overall if it is dedicated to the rotating disks. Does anyone have test results with cache active/inactive on SSD journals with HP Smart Array p420 or p840 controllers? Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
FWIW. Based on the excellent research by Mark Nelson ( http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/) we have dropped SSD journals altogether, and instead went for the battery protected controller writeback cache. Benefits: - No negative force multiplier with one SSD failure taking down multiple OSDs - OSD portability: move OSD drives across nodes - OSD recovery: stick them into a surviving OSD node and they keep working I agree on size=3, seems to be safest in all situations. Regards, Alex On Thu, Jul 9, 2015 at 6:38 PM, Quentin Hartman qhart...@direwolfdigital.com wrote: So, I was running with size=2, until we had a network interface on an OSD node go faulty, and start corrupting data. Because ceph couldn't tell which copy was right it caused all sorts of trouble. I might have been able to recover more gracefully had I caught the problem sooner and been able to identify the root right away, but as it was, we ended up labeling every VM in the cluster suspect destroying the whole thing and restoring from backups. I didn't end up managing to find the root of the problem until I was rebuilding the cluster and noticed one node felt weird when I was ssh'd into it. It was painful. We are currently running important vms from a ceph pool with size=3, and more disposable ones from a size=2 pool, and that seems to be a reasonable tradeoff so far, giving us a bit more IO overhead tha nwe would have running 3 for everything, but still having safety where we need it. QH On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke goetz.reini...@filmakademie.de wrote: Hi Warren, thanks for that feedback. regarding the 2 or 3 copies we had a lot of internal discussions and lots of pros and cons on 2 and 3 :) … and finally decided to give 2 copies in the first - now called evaluation cluster - a chance to prove. I bet in 2016 we will see, if that was a good decision or bad and data los is in that scenario ok. We evaluate. :) Regarding one P3700 for 12 SATA disks I do get it right, that if that P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me from my current knowledge. Or are the P3700 so much more reliable than the eg. S3500 or S3700? Or is the suggestion with the P3700 if we go in the direction of 20+ nodes and till than stay without SSDs for journaling. I really appreciate your thoughts and feedback and I’m aware of the fact that building a ceph cluster is some sort of knowing the specs, configuration option, math, experience, modification and feedback from best practices real world clusters. Finally all clusters are unique in some way and what works for one will not work for an other. Thanks for feedback, 100 kowtows . Götz Am 09.07.2015 um 16:58 schrieb Wang, Warren warren_w...@cable.comcast.com: You'll take a noticeable hit on write latency. Whether or not it's tolerable will be up to you and the workload you have to capture. Large file operations are throughput efficient without an SSD journal, as long as you have enough spindles. About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. The 400 GB is probably okay if you keep the journal sizes small, but the 800 is probably safer if you plan on leaving these in production for a few years. Depends on the turnover of data on the servers. The dual disk failure comment is pointing out that you are more exposed for data loss with 2 copies. You do need to understand that there is a possibility for 2 drives to fail either simultaneously, or one before the cluster is repaired. As usual, this is going to be a decision you need to decide if it's acceptable or not. We have many clusters, and some are 2, and others are 3. If your data resides nowhere else, then 3 copies is the safe thing to do. That's getting harder and harder to justify though, when the price of other storage solutions using erasure coding continues to plummet. Warren -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz Reinicke - IT Koordinator Sent: Thursday, July 09, 2015 4:47 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster Hi Christian, Am 09.07.15 um 09:36 schrieb Christian Balzer: Hello, On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote: Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! Leave the space to upgrade
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
You'll take a noticeable hit on write latency. Whether or not it's tolerable will be up to you and the workload you have to capture. Large file operations are throughput efficient without an SSD journal, as long as you have enough spindles. About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. The 400 GB is probably okay if you keep the journal sizes small, but the 800 is probably safer if you plan on leaving these in production for a few years. Depends on the turnover of data on the servers. The dual disk failure comment is pointing out that you are more exposed for data loss with 2 copies. You do need to understand that there is a possibility for 2 drives to fail either simultaneously, or one before the cluster is repaired. As usual, this is going to be a decision you need to decide if it's acceptable or not. We have many clusters, and some are 2, and others are 3. If your data resides nowhere else, then 3 copies is the safe thing to do. That's getting harder and harder to justify though, when the price of other storage solutions using erasure coding continues to plummet. Warren -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz Reinicke - IT Koordinator Sent: Thursday, July 09, 2015 4:47 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster Hi Christian, Am 09.07.15 um 09:36 schrieb Christian Balzer: Hello, On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote: Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! Leave the space to upgrade these nodes with SSDs in the future. If your cluster grows large enough (more than 20 nodes) even a single P3700 might do the trick and will need only a PCIe slot. If I get you right, the 12Disk is not a bad idea, if there would be the need of SSD Journal I can add the PCIe P3700. In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs. God or bad idea? There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. Danger, Will Robinson. This is essentially a RAID5 and you're plain asking for a double disk failure to happen. May be I do not understand that. size = 2 I think is more sort of raid1 ... ? And why am I asking for for a double disk failure? To less nodes, OSDs or because of the size = 2. See this recent thread: calculating maximum number of disk and node failure that can be handled by cluster with out data loss for some discussion and python script which you will need to modify for 2 disk replication. With a RAID5 failure calculator you're at 1 data loss event per 3.5 years... Thanks for that thread, but I dont get the point out of it for me. I see that calculating the reliability is some sort of complex math ... The workload I expect is more writes of may be some GB of Office files per day and some TB of larger video Files from a few users per week. At the end of this year we calculate to have +- 60 to 80 TB of lager videofiles in that cluster, which are accessed from time to time. Any suggestion on the drop of ssd journals? You will miss them when the cluster does write, be it from clients or when re-balancing a lost OSD. I can imagine, that I might miss the SSD Journal, but if I can add the P3700 later I feel comfy with it for now. Budget and evaluation related. Thanks for your helpful input and feedback. /Götz -- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82420 E-Mail goetz.reini...@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
If you can accept the failure domain, we find 12:1 ratio of SATA spinners to a 400GB P3700 is reasonable. Benchmarks can saturate it, but it is entirely bored in our real-world workload and only 30-50% utilized during backfills. I am sure one could go even further than 12:1 if they wanted, but we haven't tested. On Thu, Jul 9, 2015 at 4:47 AM, Götz Reinicke - IT Koordinator goetz.reini...@filmakademie.de wrote: Hi Christian, Am 09.07.15 um 09:36 schrieb Christian Balzer: Hello, On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote: Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! Leave the space to upgrade these nodes with SSDs in the future. If your cluster grows large enough (more than 20 nodes) even a single P3700 might do the trick and will need only a PCIe slot. If I get you right, the 12Disk is not a bad idea, if there would be the need of SSD Journal I can add the PCIe P3700. In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs. God or bad idea? There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. Danger, Will Robinson. This is essentially a RAID5 and you're plain asking for a double disk failure to happen. May be I do not understand that. size = 2 I think is more sort of raid1 ... ? And why am I asking for for a double disk failure? To less nodes, OSDs or because of the size = 2. See this recent thread: calculating maximum number of disk and node failure that can be handled by cluster with out data loss for some discussion and python script which you will need to modify for 2 disk replication. With a RAID5 failure calculator you're at 1 data loss event per 3.5 years... Thanks for that thread, but I dont get the point out of it for me. I see that calculating the reliability is some sort of complex math ... The workload I expect is more writes of may be some GB of Office files per day and some TB of larger video Files from a few users per week. At the end of this year we calculate to have +- 60 to 80 TB of lager videofiles in that cluster, which are accessed from time to time. Any suggestion on the drop of ssd journals? You will miss them when the cluster does write, be it from clients or when re-balancing a lost OSD. I can imagine, that I might miss the SSD Journal, but if I can add the P3700 later I feel comfy with it for now. Budget and evaluation related. Thanks for your helpful input and feedback. /Götz -- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82420 E-Mail goetz.reini...@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- David Burley NOC Manager, Sr. Systems Programmer/Analyst Slashdot Media e: da...@slashdotmedia.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
Hello, On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote: Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! Leave the space to upgrade these nodes with SSDs in the future. If your cluster grows large enough (more than 20 nodes) even a single P3700 might do the trick and will need only a PCIe slot. There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. Danger, Will Robinson. This is essentially a RAID5 and you're plain asking for a double disk failure to happen. See this recent thread: calculating maximum number of disk and node failure that can be handled by cluster with out data loss for some discussion and python script which you will need to modify for 2 disk replication. With a RAID5 failure calculator you're at 1 data loss event per 3.5 years... The workload I expect is more writes of may be some GB of Office files per day and some TB of larger video Files from a few users per week. At the end of this year we calculate to have +- 60 to 80 TB of lager videofiles in that cluster, which are accessed from time to time. Any suggestion on the drop of ssd journals? You will miss them when the cluster does write, be it from clients or when re-balancing a lost OSD. Christian Thanks as always for your feedback . Götz -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Real world benefit from SSD Journals for a more read than write cluster
Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. The workload I expect is more writes of may be some GB of Office files per day and some TB of larger video Files from a few users per week. At the end of this year we calculate to have +- 60 to 80 TB of lager videofiles in that cluster, which are accessed from time to time. Any suggestion on the drop of ssd journals? Thanks as always for your feedback . Götz -- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82420 E-Mail goetz.reini...@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
Hi Christian, Am 09.07.15 um 09:36 schrieb Christian Balzer: Hello, On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote: Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! Leave the space to upgrade these nodes with SSDs in the future. If your cluster grows large enough (more than 20 nodes) even a single P3700 might do the trick and will need only a PCIe slot. If I get you right, the 12Disk is not a bad idea, if there would be the need of SSD Journal I can add the PCIe P3700. In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs. God or bad idea? There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. Danger, Will Robinson. This is essentially a RAID5 and you're plain asking for a double disk failure to happen. May be I do not understand that. size = 2 I think is more sort of raid1 ... ? And why am I asking for for a double disk failure? To less nodes, OSDs or because of the size = 2. See this recent thread: calculating maximum number of disk and node failure that can be handled by cluster with out data loss for some discussion and python script which you will need to modify for 2 disk replication. With a RAID5 failure calculator you're at 1 data loss event per 3.5 years... Thanks for that thread, but I dont get the point out of it for me. I see that calculating the reliability is some sort of complex math ... The workload I expect is more writes of may be some GB of Office files per day and some TB of larger video Files from a few users per week. At the end of this year we calculate to have +- 60 to 80 TB of lager videofiles in that cluster, which are accessed from time to time. Any suggestion on the drop of ssd journals? You will miss them when the cluster does write, be it from clients or when re-balancing a lost OSD. I can imagine, that I might miss the SSD Journal, but if I can add the P3700 later I feel comfy with it for now. Budget and evaluation related. Thanks for your helpful input and feedback. /Götz -- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82420 E-Mail goetz.reini...@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
So, I was running with size=2, until we had a network interface on an OSD node go faulty, and start corrupting data. Because ceph couldn't tell which copy was right it caused all sorts of trouble. I might have been able to recover more gracefully had I caught the problem sooner and been able to identify the root right away, but as it was, we ended up labeling every VM in the cluster suspect destroying the whole thing and restoring from backups. I didn't end up managing to find the root of the problem until I was rebuilding the cluster and noticed one node felt weird when I was ssh'd into it. It was painful. We are currently running important vms from a ceph pool with size=3, and more disposable ones from a size=2 pool, and that seems to be a reasonable tradeoff so far, giving us a bit more IO overhead tha nwe would have running 3 for everything, but still having safety where we need it. QH On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke goetz.reini...@filmakademie.de wrote: Hi Warren, thanks for that feedback. regarding the 2 or 3 copies we had a lot of internal discussions and lots of pros and cons on 2 and 3 :) … and finally decided to give 2 copies in the first - now called evaluation cluster - a chance to prove. I bet in 2016 we will see, if that was a good decision or bad and data los is in that scenario ok. We evaluate. :) Regarding one P3700 for 12 SATA disks I do get it right, that if that P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me from my current knowledge. Or are the P3700 so much more reliable than the eg. S3500 or S3700? Or is the suggestion with the P3700 if we go in the direction of 20+ nodes and till than stay without SSDs for journaling. I really appreciate your thoughts and feedback and I’m aware of the fact that building a ceph cluster is some sort of knowing the specs, configuration option, math, experience, modification and feedback from best practices real world clusters. Finally all clusters are unique in some way and what works for one will not work for an other. Thanks for feedback, 100 kowtows . Götz Am 09.07.2015 um 16:58 schrieb Wang, Warren warren_w...@cable.comcast.com: You'll take a noticeable hit on write latency. Whether or not it's tolerable will be up to you and the workload you have to capture. Large file operations are throughput efficient without an SSD journal, as long as you have enough spindles. About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. The 400 GB is probably okay if you keep the journal sizes small, but the 800 is probably safer if you plan on leaving these in production for a few years. Depends on the turnover of data on the servers. The dual disk failure comment is pointing out that you are more exposed for data loss with 2 copies. You do need to understand that there is a possibility for 2 drives to fail either simultaneously, or one before the cluster is repaired. As usual, this is going to be a decision you need to decide if it's acceptable or not. We have many clusters, and some are 2, and others are 3. If your data resides nowhere else, then 3 copies is the safe thing to do. That's getting harder and harder to justify though, when the price of other storage solutions using erasure coding continues to plummet. Warren -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz Reinicke - IT Koordinator Sent: Thursday, July 09, 2015 4:47 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster Hi Christian, Am 09.07.15 um 09:36 schrieb Christian Balzer: Hello, On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote: Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! Leave the space to upgrade these nodes with SSDs in the future. If your cluster grows large enough (more than 20 nodes) even a single P3700 might do the trick and will need only a PCIe slot. If I get you right, the 12Disk is not a bad idea, if there would be the need of SSD Journal I can add the PCIe P3700. In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs. God or bad idea? There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. Danger, Will Robinson. This is essentially a RAID5 and you're plain asking for a double disk failure to happen. May be I do not understand that. size = 2 I think is more sort of raid1
Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster
Hi Warren, thanks for that feedback. regarding the 2 or 3 copies we had a lot of internal discussions and lots of pros and cons on 2 and 3 :) … and finally decided to give 2 copies in the first - now called evaluation cluster - a chance to prove. I bet in 2016 we will see, if that was a good decision or bad and data los is in that scenario ok. We evaluate. :) Regarding one P3700 for 12 SATA disks I do get it right, that if that P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me from my current knowledge. Or are the P3700 so much more reliable than the eg. S3500 or S3700? Or is the suggestion with the P3700 if we go in the direction of 20+ nodes and till than stay without SSDs for journaling. I really appreciate your thoughts and feedback and I’m aware of the fact that building a ceph cluster is some sort of knowing the specs, configuration option, math, experience, modification and feedback from best practices real world clusters. Finally all clusters are unique in some way and what works for one will not work for an other. Thanks for feedback, 100 kowtows . Götz Am 09.07.2015 um 16:58 schrieb Wang, Warren warren_w...@cable.comcast.com: You'll take a noticeable hit on write latency. Whether or not it's tolerable will be up to you and the workload you have to capture. Large file operations are throughput efficient without an SSD journal, as long as you have enough spindles. About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. The 400 GB is probably okay if you keep the journal sizes small, but the 800 is probably safer if you plan on leaving these in production for a few years. Depends on the turnover of data on the servers. The dual disk failure comment is pointing out that you are more exposed for data loss with 2 copies. You do need to understand that there is a possibility for 2 drives to fail either simultaneously, or one before the cluster is repaired. As usual, this is going to be a decision you need to decide if it's acceptable or not. We have many clusters, and some are 2, and others are 3. If your data resides nowhere else, then 3 copies is the safe thing to do. That's getting harder and harder to justify though, when the price of other storage solutions using erasure coding continues to plummet. Warren -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Götz Reinicke - IT Koordinator Sent: Thursday, July 09, 2015 4:47 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster Hi Christian, Am 09.07.15 um 09:36 schrieb Christian Balzer: Hello, On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote: Hi again, time is passing, so is my budget :-/ and I have to recheck the options for a starter cluster. An expansion next year for may be an openstack installation or more performance if the demands rise is possible. The starter could always be used as test or slow dark archive. At the beginning I was at 16SATA OSDs with 4 SSDs for journal per node, but now I'm looking for 12 SATA OSDs without SSD journal. Less performance, less capacity I know. But thats ok! Leave the space to upgrade these nodes with SSDs in the future. If your cluster grows large enough (more than 20 nodes) even a single P3700 might do the trick and will need only a PCIe slot. If I get you right, the 12Disk is not a bad idea, if there would be the need of SSD Journal I can add the PCIe P3700. In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs. God or bad idea? There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2. Danger, Will Robinson. This is essentially a RAID5 and you're plain asking for a double disk failure to happen. May be I do not understand that. size = 2 I think is more sort of raid1 ... ? And why am I asking for for a double disk failure? To less nodes, OSDs or because of the size = 2. See this recent thread: calculating maximum number of disk and node failure that can be handled by cluster with out data loss for some discussion and python script which you will need to modify for 2 disk replication. With a RAID5 failure calculator you're at 1 data loss event per 3.5 years... Thanks for that thread, but I dont get the point out of it for me. I see that calculating the reliability is some sort of complex math ... The workload I expect is more writes of may be some GB of Office files per day and some TB of larger video Files from a few users per week. At the end of this year we calculate to have +- 60 to 80 TB of lager videofiles in that cluster, which are accessed from time to time. Any suggestion on the drop of ssd journals? You will miss them when the cluster does write, be it from clients or when re-balancing a lost OSD. I can imagine, that I