Re: [ceph-users] Again - state of Ceph NVMe and SSDs
It sounds like your just assuming these drives don't perform good... - Original Message - From: "Mark Nelson" <mnel...@redhat.com> To: ceph-users@lists.ceph.com Sent: Monday, January 18, 2016 2:17:19 PM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs Take Greg's comments to heart, because he's absolutely correct here. Distributed storage systems almost as a rule love parallelism and if you have enough you can often hide other issues. Latency is probably the more interesting question, and frankly that's where you'll often start seeing the kernel, ceph code, drivers, random acts of god, etc, get in the way. It's very easy for any one of these things to destroy your performance, so you have to be *very* *very* careful to understand exactly what you are seeing. As such, don't trust any one benchmark. Wait until it's independently verified, possibly by multiple sources, before putting too much weight into it. Mark On 01/18/2016 01:02 PM, Tyler Bishop wrote: > One of the other guys on the list here benchmarked them. They spanked every > other ssd on the *recommended* tree.. > > - Original Message - > From: "Gregory Farnum" <gfar...@redhat.com> > To: "Tyler Bishop" <tyler.bis...@beyondhosting.net> > Cc: "David" <da...@visions.se>, "Ceph Users" <ceph-users@lists.ceph.com> > Sent: Monday, January 18, 2016 2:01:44 PM > Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs > > On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop > <tyler.bis...@beyondhosting.net> wrote: >> The changes you are looking for are coming from Sandisk in the ceph "Jewel" >> release coming up. >> >> Based on benchmarks and testing, sandisk has really contributed heavily on >> the tuning aspects and are promising 90%+ native iop of a drive in the >> cluster. > > Mmmm, they've gotten some very impressive numbers but most people > shouldn't be expecting 90% of an SSD's throughput out of their > workloads. These tests are *very* parallel and tend to run multiple > OSD processes on a single SSD, IIRC. > -Greg > >> >> The biggest changes will come from the memory allocation with writes. >> Latency is going to be a lot lower. >> >> >> - Original Message ----- >> From: "David" <da...@visions.se> >> To: "Wido den Hollander" <w...@42on.com> >> Cc: ceph-users@lists.ceph.com >> Sent: Sunday, January 17, 2016 6:49:25 AM >> Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs >> >> Thanks Wido, those are good pointers indeed :) >> So we just have to make sure the backend storage (SSD/NVMe journals) won’t >> be saturated (or the controllers) and then go with as many RBD per VM as >> possible. >> >> Kind Regards, >> David Majchrzak >> >> 16 jan 2016 kl. 22:26 skrev Wido den Hollander <w...@42on.com>: >> >>> On 01/16/2016 07:06 PM, David wrote: >>>> Hi! >>>> >>>> We’re planning our third ceph cluster and been trying to find how to >>>> maximize IOPS on this one. >>>> >>>> Our needs: >>>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM >>>> servers) >>>> * Pool for storage of many small files, rbd (probably dovecot maildir >>>> and dovecot index etc) >>>> >>> >>> Not completely NVMe related, but in this case, make sure you use >>> multiple disks. >>> >>> For MySQL for example: >>> >>> - Root disk for OS >>> - Disk for /var/lib/mysql (data) >>> - Disk for /var/log/mysql (binary log) >>> - Maybe even a InnoDB logfile disk >>> >>> With RBD you gain more performance by sending I/O into the cluster in >>> parallel. So when ever you can, do so! >>> >>> Regarding small files, it might be interesting to play with the stripe >>> count and stripe size there. By default this is 1 and 4MB. But maybe 16 >>> and 256k work better here. >>> >>> With Dovecot as well, use a different RBD disk for the indexes and a >>> different one for the Maildir itself. >>> >>> Ceph excels at parallel performance. That is what you want to aim for. >>> >>>> So I’ve been reading up on: >>>> >>>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance >>>> >>>> and ceph-users from october 2015: >>>> >>>> http://www.spinics.net/lists/ceph-users/msg22494.html
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
Not at all! For all we know, the drives may be the fastest ones on the planet. My comment still stands though. Be skeptical of any one benchmark that shows something unexpected. 90% of native SSD/NVMe IOPS performance in a distributed storage system is just such a number. Look for test repeatability. Look for independent verification. The memory allocator tests that Sandisk initially performed and that were later independently verified by Intel, Red Hat, and others is a great example of this kind of process. Mark On 01/19/2016 07:01 AM, Tyler Bishop wrote: It sounds like your just assuming these drives don't perform good... - Original Message - From: "Mark Nelson" <mnel...@redhat.com> To: ceph-users@lists.ceph.com Sent: Monday, January 18, 2016 2:17:19 PM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs Take Greg's comments to heart, because he's absolutely correct here. Distributed storage systems almost as a rule love parallelism and if you have enough you can often hide other issues. Latency is probably the more interesting question, and frankly that's where you'll often start seeing the kernel, ceph code, drivers, random acts of god, etc, get in the way. It's very easy for any one of these things to destroy your performance, so you have to be *very* *very* careful to understand exactly what you are seeing. As such, don't trust any one benchmark. Wait until it's independently verified, possibly by multiple sources, before putting too much weight into it. Mark On 01/18/2016 01:02 PM, Tyler Bishop wrote: One of the other guys on the list here benchmarked them. They spanked every other ssd on the *recommended* tree.. - Original Message - From: "Gregory Farnum" <gfar...@redhat.com> To: "Tyler Bishop" <tyler.bis...@beyondhosting.net> Cc: "David" <da...@visions.se>, "Ceph Users" <ceph-users@lists.ceph.com> Sent: Monday, January 18, 2016 2:01:44 PM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop <tyler.bis...@beyondhosting.net> wrote: The changes you are looking for are coming from Sandisk in the ceph "Jewel" release coming up. Based on benchmarks and testing, sandisk has really contributed heavily on the tuning aspects and are promising 90%+ native iop of a drive in the cluster. Mmmm, they've gotten some very impressive numbers but most people shouldn't be expecting 90% of an SSD's throughput out of their workloads. These tests are *very* parallel and tend to run multiple OSD processes on a single SSD, IIRC. -Greg The biggest changes will come from the memory allocation with writes. Latency is going to be a lot lower. - Original Message - From: "David" <da...@visions.se> To: "Wido den Hollander" <w...@42on.com> Cc: ceph-users@lists.ceph.com Sent: Sunday, January 17, 2016 6:49:25 AM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs Thanks Wido, those are good pointers indeed :) So we just have to make sure the backend storage (SSD/NVMe journals) won’t be saturated (or the controllers) and then go with as many RBD per VM as possible. Kind Regards, David Majchrzak 16 jan 2016 kl. 22:26 skrev Wido den Hollander <w...@42on.com>: On 01/16/2016 07:06 PM, David wrote: Hi! We’re planning our third ceph cluster and been trying to find how to maximize IOPS on this one. Our needs: * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM servers) * Pool for storage of many small files, rbd (probably dovecot maildir and dovecot index etc) Not completely NVMe related, but in this case, make sure you use multiple disks. For MySQL for example: - Root disk for OS - Disk for /var/lib/mysql (data) - Disk for /var/log/mysql (binary log) - Maybe even a InnoDB logfile disk With RBD you gain more performance by sending I/O into the cluster in parallel. So when ever you can, do so! Regarding small files, it might be interesting to play with the stripe count and stripe size there. By default this is 1 and 4MB. But maybe 16 and 256k work better here. With Dovecot as well, use a different RBD disk for the indexes and a different one for the Maildir itself. Ceph excels at parallel performance. That is what you want to aim for. So I’ve been reading up on: https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance and ceph-users from october 2015: http://www.spinics.net/lists/ceph-users/msg22494.html We’re planning something like 5 OSD servers, with: * 4x 1.2TB Intel S3510 * 8st 4TB HDD * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and one for HDD pool journal) * 2x 80GB Intel S3510 raid1 for system * 256GB RAM * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better This cluster will probably run Hammer LTS unless t
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
Check these out to: http://www.seagate.com/internal-hard-drives/solid-state-hybrid/1200-ssd/ - Original Message - From: "Christian Balzer" <ch...@gol.com> To: "ceph-users" <ceph-users@lists.ceph.com> Sent: Sunday, January 17, 2016 10:45:56 PM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs Hello, On Sat, 16 Jan 2016 19:06:07 +0100 David wrote: > Hi! > > We’re planning our third ceph cluster and been trying to find how to > maximize IOPS on this one. > > Our needs: > * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM > servers) > * Pool for storage of many small files, rbd (probably dovecot maildir > and dovecot index etc) > I'm running dovecot for several 100k users on 2-node DRBD clusters and for a mail archive server for a few hundred users backed by Ceph/RBD. The later works fine (it's not that busy), but I wouldn't consider replacing the DRBD clusters with Ceph/RBD at this time (higher investment in storage 3x vs 2x and lower performance of course). Depending on your use case you may be just fine of course. > So I’ve been reading up on: > > https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance > > and ceph-users from october 2015: > > http://www.spinics.net/lists/ceph-users/msg22494.html > > We’re planning something like 5 OSD servers, with: > > * 4x 1.2TB Intel S3510 I'd be wary of that. As in, you're spec'ing the best Intel SSDs money can buy below for journals, but the least write-endurable Intel DC SSDs for OSDs here. Note that write amplification (beyond Ceph and FS journals) is very much a thing, especially with small files. There's a mail about this by me in the ML archives somewhere: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html Unless you're very sure about this being a read-mostly environment I'd go with 3610's at least. > * 8st 4TB HDD > * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and > one for HDD pool journal) You may be better off (cost and SPOF wise) with 2 200GB S3700 (not 3710) for the HDD journals, but then again that won't fit into your case, won't it... Given the IOPS limits in Ceph as it is, you're unlikely to see much of difference if you forgo a journal for the SSDs and use shared journals with DC S3610 or 3710 OSD SSDs. Note that as far as pure throughput is concerned (in most operations the least critical factor) your single journal SSD will limit things to the speed of 2 (of your 4) storage SSDs. But then again, your network is probably saturated before that. > * 2x 80GB Intel S3510 raid1 for system > * 256GB RAM Plenty. ^o^ > * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better > Not sure about Jewel, but SSD OSDs will eat pretty much any and all CPU cycles you can throw at them. This also boils down to the question if having mixed HDD/SSD storage nodes (with the fun of having to set "osd crush update on start = false") is a good idea or not, as opposed to nodes that are optimized for their respective storage hardware (CPU, RAM, network wise). Regards, Christian > This cluster will probably run Hammer LTS unless there are huge > improvements in Infernalis when dealing 4k IOPS. > > The first link above hints at awesome performance. The second one from > the list not so much yet.. > > Is anyone running Hammer or Infernalis with a setup like this? > Is it a sane setup? > Will we become CPU constrained or can we just throw more RAM on it? :D > > Kind Regards, > David Majchrzak -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
On 01/16/2016 12:06 PM, David wrote: Hi! We’re planning our third ceph cluster and been trying to find how to maximize IOPS on this one. Our needs: * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM servers) * Pool for storage of many small files, rbd (probably dovecot maildir and dovecot index etc) So I’ve been reading up on: https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance and ceph-users from october 2015: http://www.spinics.net/lists/ceph-users/msg22494.html We’re planning something like 5 OSD servers, with: * 4x 1.2TB Intel S3510 * 8st 4TB HDD * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and one for HDD pool journal) * 2x 80GB Intel S3510 raid1 for system * 256GB RAM * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better This cluster will probably run Hammer LTS unless there are huge improvements in Infernalis when dealing 4k IOPS. The first link above hints at awesome performance. The second one from the list not so much yet.. Is anyone running Hammer or Infernalis with a setup like this? Is it a sane setup? Will we become CPU constrained or can we just throw more RAM on it? :D On the write side you can pretty quickly hit CPU limits, though if you upgrade to tcmalloc 2.4 and set a high thread cache or switch to jemalloc it will help dramatically. There's been various posts and threads about it here on the mailing list, but generally in CPU constrained scenarios people are seeing pretty dramatic improvements. (like 4X IOPs on the write side with SSDs/NVMes). We are also seeing a dramatic improvement in small random write performance with bluestore, but that's only going to be tech preview in jewel. Kind Regards, David Majchrzak ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
One of the other guys on the list here benchmarked them. They spanked every other ssd on the *recommended* tree.. - Original Message - From: "Gregory Farnum" <gfar...@redhat.com> To: "Tyler Bishop" <tyler.bis...@beyondhosting.net> Cc: "David" <da...@visions.se>, "Ceph Users" <ceph-users@lists.ceph.com> Sent: Monday, January 18, 2016 2:01:44 PM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop <tyler.bis...@beyondhosting.net> wrote: > The changes you are looking for are coming from Sandisk in the ceph "Jewel" > release coming up. > > Based on benchmarks and testing, sandisk has really contributed heavily on > the tuning aspects and are promising 90%+ native iop of a drive in the > cluster. Mmmm, they've gotten some very impressive numbers but most people shouldn't be expecting 90% of an SSD's throughput out of their workloads. These tests are *very* parallel and tend to run multiple OSD processes on a single SSD, IIRC. -Greg > > The biggest changes will come from the memory allocation with writes. > Latency is going to be a lot lower. > > > - Original Message - > From: "David" <da...@visions.se> > To: "Wido den Hollander" <w...@42on.com> > Cc: ceph-users@lists.ceph.com > Sent: Sunday, January 17, 2016 6:49:25 AM > Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs > > Thanks Wido, those are good pointers indeed :) > So we just have to make sure the backend storage (SSD/NVMe journals) won’t be > saturated (or the controllers) and then go with as many RBD per VM as > possible. > > Kind Regards, > David Majchrzak > > 16 jan 2016 kl. 22:26 skrev Wido den Hollander <w...@42on.com>: > >> On 01/16/2016 07:06 PM, David wrote: >>> Hi! >>> >>> We’re planning our third ceph cluster and been trying to find how to >>> maximize IOPS on this one. >>> >>> Our needs: >>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM >>> servers) >>> * Pool for storage of many small files, rbd (probably dovecot maildir >>> and dovecot index etc) >>> >> >> Not completely NVMe related, but in this case, make sure you use >> multiple disks. >> >> For MySQL for example: >> >> - Root disk for OS >> - Disk for /var/lib/mysql (data) >> - Disk for /var/log/mysql (binary log) >> - Maybe even a InnoDB logfile disk >> >> With RBD you gain more performance by sending I/O into the cluster in >> parallel. So when ever you can, do so! >> >> Regarding small files, it might be interesting to play with the stripe >> count and stripe size there. By default this is 1 and 4MB. But maybe 16 >> and 256k work better here. >> >> With Dovecot as well, use a different RBD disk for the indexes and a >> different one for the Maildir itself. >> >> Ceph excels at parallel performance. That is what you want to aim for. >> >>> So I’ve been reading up on: >>> >>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance >>> >>> and ceph-users from october 2015: >>> >>> http://www.spinics.net/lists/ceph-users/msg22494.html >>> >>> We’re planning something like 5 OSD servers, with: >>> >>> * 4x 1.2TB Intel S3510 >>> * 8st 4TB HDD >>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and >>> one for HDD pool journal) >>> * 2x 80GB Intel S3510 raid1 for system >>> * 256GB RAM >>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better >>> >>> This cluster will probably run Hammer LTS unless there are huge >>> improvements in Infernalis when dealing 4k IOPS. >>> >>> The first link above hints at awesome performance. The second one from >>> the list not so much yet.. >>> >>> Is anyone running Hammer or Infernalis with a setup like this? >>> Is it a sane setup? >>> Will we become CPU constrained or can we just throw more RAM on it? :D >>> >>> Kind Regards, >>> David Majchrzak >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Wido den Hollander >> 42on B.V. >> Ceph trainer and consultant >> >> Phone: +31 (0)20 700 9902 >> Skype: contact42on >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop <tyler.bis...@beyondhosting.net> wrote: > The changes you are looking for are coming from Sandisk in the ceph "Jewel" > release coming up. > > Based on benchmarks and testing, sandisk has really contributed heavily on > the tuning aspects and are promising 90%+ native iop of a drive in the > cluster. Mmmm, they've gotten some very impressive numbers but most people shouldn't be expecting 90% of an SSD's throughput out of their workloads. These tests are *very* parallel and tend to run multiple OSD processes on a single SSD, IIRC. -Greg > > The biggest changes will come from the memory allocation with writes. > Latency is going to be a lot lower. > > > - Original Message - > From: "David" <da...@visions.se> > To: "Wido den Hollander" <w...@42on.com> > Cc: ceph-users@lists.ceph.com > Sent: Sunday, January 17, 2016 6:49:25 AM > Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs > > Thanks Wido, those are good pointers indeed :) > So we just have to make sure the backend storage (SSD/NVMe journals) won’t be > saturated (or the controllers) and then go with as many RBD per VM as > possible. > > Kind Regards, > David Majchrzak > > 16 jan 2016 kl. 22:26 skrev Wido den Hollander <w...@42on.com>: > >> On 01/16/2016 07:06 PM, David wrote: >>> Hi! >>> >>> We’re planning our third ceph cluster and been trying to find how to >>> maximize IOPS on this one. >>> >>> Our needs: >>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM >>> servers) >>> * Pool for storage of many small files, rbd (probably dovecot maildir >>> and dovecot index etc) >>> >> >> Not completely NVMe related, but in this case, make sure you use >> multiple disks. >> >> For MySQL for example: >> >> - Root disk for OS >> - Disk for /var/lib/mysql (data) >> - Disk for /var/log/mysql (binary log) >> - Maybe even a InnoDB logfile disk >> >> With RBD you gain more performance by sending I/O into the cluster in >> parallel. So when ever you can, do so! >> >> Regarding small files, it might be interesting to play with the stripe >> count and stripe size there. By default this is 1 and 4MB. But maybe 16 >> and 256k work better here. >> >> With Dovecot as well, use a different RBD disk for the indexes and a >> different one for the Maildir itself. >> >> Ceph excels at parallel performance. That is what you want to aim for. >> >>> So I’ve been reading up on: >>> >>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance >>> >>> and ceph-users from october 2015: >>> >>> http://www.spinics.net/lists/ceph-users/msg22494.html >>> >>> We’re planning something like 5 OSD servers, with: >>> >>> * 4x 1.2TB Intel S3510 >>> * 8st 4TB HDD >>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and >>> one for HDD pool journal) >>> * 2x 80GB Intel S3510 raid1 for system >>> * 256GB RAM >>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better >>> >>> This cluster will probably run Hammer LTS unless there are huge >>> improvements in Infernalis when dealing 4k IOPS. >>> >>> The first link above hints at awesome performance. The second one from >>> the list not so much yet.. >>> >>> Is anyone running Hammer or Infernalis with a setup like this? >>> Is it a sane setup? >>> Will we become CPU constrained or can we just throw more RAM on it? :D >>> >>> Kind Regards, >>> David Majchrzak >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Wido den Hollander >> 42on B.V. >> Ceph trainer and consultant >> >> Phone: +31 (0)20 700 9902 >> Skype: contact42on >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
Take Greg's comments to heart, because he's absolutely correct here. Distributed storage systems almost as a rule love parallelism and if you have enough you can often hide other issues. Latency is probably the more interesting question, and frankly that's where you'll often start seeing the kernel, ceph code, drivers, random acts of god, etc, get in the way. It's very easy for any one of these things to destroy your performance, so you have to be *very* *very* careful to understand exactly what you are seeing. As such, don't trust any one benchmark. Wait until it's independently verified, possibly by multiple sources, before putting too much weight into it. Mark On 01/18/2016 01:02 PM, Tyler Bishop wrote: One of the other guys on the list here benchmarked them. They spanked every other ssd on the *recommended* tree.. - Original Message - From: "Gregory Farnum" <gfar...@redhat.com> To: "Tyler Bishop" <tyler.bis...@beyondhosting.net> Cc: "David" <da...@visions.se>, "Ceph Users" <ceph-users@lists.ceph.com> Sent: Monday, January 18, 2016 2:01:44 PM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs On Sun, Jan 17, 2016 at 12:34 PM, Tyler Bishop <tyler.bis...@beyondhosting.net> wrote: The changes you are looking for are coming from Sandisk in the ceph "Jewel" release coming up. Based on benchmarks and testing, sandisk has really contributed heavily on the tuning aspects and are promising 90%+ native iop of a drive in the cluster. Mmmm, they've gotten some very impressive numbers but most people shouldn't be expecting 90% of an SSD's throughput out of their workloads. These tests are *very* parallel and tend to run multiple OSD processes on a single SSD, IIRC. -Greg The biggest changes will come from the memory allocation with writes. Latency is going to be a lot lower. - Original Message - From: "David" <da...@visions.se> To: "Wido den Hollander" <w...@42on.com> Cc: ceph-users@lists.ceph.com Sent: Sunday, January 17, 2016 6:49:25 AM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs Thanks Wido, those are good pointers indeed :) So we just have to make sure the backend storage (SSD/NVMe journals) won’t be saturated (or the controllers) and then go with as many RBD per VM as possible. Kind Regards, David Majchrzak 16 jan 2016 kl. 22:26 skrev Wido den Hollander <w...@42on.com>: On 01/16/2016 07:06 PM, David wrote: Hi! We’re planning our third ceph cluster and been trying to find how to maximize IOPS on this one. Our needs: * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM servers) * Pool for storage of many small files, rbd (probably dovecot maildir and dovecot index etc) Not completely NVMe related, but in this case, make sure you use multiple disks. For MySQL for example: - Root disk for OS - Disk for /var/lib/mysql (data) - Disk for /var/log/mysql (binary log) - Maybe even a InnoDB logfile disk With RBD you gain more performance by sending I/O into the cluster in parallel. So when ever you can, do so! Regarding small files, it might be interesting to play with the stripe count and stripe size there. By default this is 1 and 4MB. But maybe 16 and 256k work better here. With Dovecot as well, use a different RBD disk for the indexes and a different one for the Maildir itself. Ceph excels at parallel performance. That is what you want to aim for. So I’ve been reading up on: https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance and ceph-users from october 2015: http://www.spinics.net/lists/ceph-users/msg22494.html We’re planning something like 5 OSD servers, with: * 4x 1.2TB Intel S3510 * 8st 4TB HDD * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and one for HDD pool journal) * 2x 80GB Intel S3510 raid1 for system * 256GB RAM * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better This cluster will probably run Hammer LTS unless there are huge improvements in Infernalis when dealing 4k IOPS. The first link above hints at awesome performance. The second one from the list not so much yet.. Is anyone running Hammer or Infernalis with a setup like this? Is it a sane setup? Will we become CPU constrained or can we just throw more RAM on it? :D Kind Regards, David Majchrzak ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
Hello, On Sat, 16 Jan 2016 19:06:07 +0100 David wrote: > Hi! > > We’re planning our third ceph cluster and been trying to find how to > maximize IOPS on this one. > > Our needs: > * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM > servers) > * Pool for storage of many small files, rbd (probably dovecot maildir > and dovecot index etc) > I'm running dovecot for several 100k users on 2-node DRBD clusters and for a mail archive server for a few hundred users backed by Ceph/RBD. The later works fine (it's not that busy), but I wouldn't consider replacing the DRBD clusters with Ceph/RBD at this time (higher investment in storage 3x vs 2x and lower performance of course). Depending on your use case you may be just fine of course. > So I’ve been reading up on: > > https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance > > and ceph-users from october 2015: > > http://www.spinics.net/lists/ceph-users/msg22494.html > > We’re planning something like 5 OSD servers, with: > > * 4x 1.2TB Intel S3510 I'd be wary of that. As in, you're spec'ing the best Intel SSDs money can buy below for journals, but the least write-endurable Intel DC SSDs for OSDs here. Note that write amplification (beyond Ceph and FS journals) is very much a thing, especially with small files. There's a mail about this by me in the ML archives somewhere: http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html Unless you're very sure about this being a read-mostly environment I'd go with 3610's at least. > * 8st 4TB HDD > * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and > one for HDD pool journal) You may be better off (cost and SPOF wise) with 2 200GB S3700 (not 3710) for the HDD journals, but then again that won't fit into your case, won't it... Given the IOPS limits in Ceph as it is, you're unlikely to see much of difference if you forgo a journal for the SSDs and use shared journals with DC S3610 or 3710 OSD SSDs. Note that as far as pure throughput is concerned (in most operations the least critical factor) your single journal SSD will limit things to the speed of 2 (of your 4) storage SSDs. But then again, your network is probably saturated before that. > * 2x 80GB Intel S3510 raid1 for system > * 256GB RAM Plenty. ^o^ > * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better > Not sure about Jewel, but SSD OSDs will eat pretty much any and all CPU cycles you can throw at them. This also boils down to the question if having mixed HDD/SSD storage nodes (with the fun of having to set "osd crush update on start = false") is a good idea or not, as opposed to nodes that are optimized for their respective storage hardware (CPU, RAM, network wise). Regards, Christian > This cluster will probably run Hammer LTS unless there are huge > improvements in Infernalis when dealing 4k IOPS. > > The first link above hints at awesome performance. The second one from > the list not so much yet.. > > Is anyone running Hammer or Infernalis with a setup like this? > Is it a sane setup? > Will we become CPU constrained or can we just throw more RAM on it? :D > > Kind Regards, > David Majchrzak -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
That is indeed great news! :) Thanks for the heads up. Kind Regards, David Majchrzak 17 jan 2016 kl. 21:34 skrev Tyler Bishop <tyler.bis...@beyondhosting.net>: > The changes you are looking for are coming from Sandisk in the ceph "Jewel" > release coming up. > > Based on benchmarks and testing, sandisk has really contributed heavily on > the tuning aspects and are promising 90%+ native iop of a drive in the > cluster. > > The biggest changes will come from the memory allocation with writes. > Latency is going to be a lot lower. > > > - Original Message - > From: "David" <da...@visions.se> > To: "Wido den Hollander" <w...@42on.com> > Cc: ceph-users@lists.ceph.com > Sent: Sunday, January 17, 2016 6:49:25 AM > Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs > > Thanks Wido, those are good pointers indeed :) > So we just have to make sure the backend storage (SSD/NVMe journals) won’t be > saturated (or the controllers) and then go with as many RBD per VM as > possible. > > Kind Regards, > David Majchrzak > > 16 jan 2016 kl. 22:26 skrev Wido den Hollander <w...@42on.com>: > >> On 01/16/2016 07:06 PM, David wrote: >>> Hi! >>> >>> We’re planning our third ceph cluster and been trying to find how to >>> maximize IOPS on this one. >>> >>> Our needs: >>> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM >>> servers) >>> * Pool for storage of many small files, rbd (probably dovecot maildir >>> and dovecot index etc) >>> >> >> Not completely NVMe related, but in this case, make sure you use >> multiple disks. >> >> For MySQL for example: >> >> - Root disk for OS >> - Disk for /var/lib/mysql (data) >> - Disk for /var/log/mysql (binary log) >> - Maybe even a InnoDB logfile disk >> >> With RBD you gain more performance by sending I/O into the cluster in >> parallel. So when ever you can, do so! >> >> Regarding small files, it might be interesting to play with the stripe >> count and stripe size there. By default this is 1 and 4MB. But maybe 16 >> and 256k work better here. >> >> With Dovecot as well, use a different RBD disk for the indexes and a >> different one for the Maildir itself. >> >> Ceph excels at parallel performance. That is what you want to aim for. >> >>> So I’ve been reading up on: >>> >>> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance >>> >>> and ceph-users from october 2015: >>> >>> http://www.spinics.net/lists/ceph-users/msg22494.html >>> >>> We’re planning something like 5 OSD servers, with: >>> >>> * 4x 1.2TB Intel S3510 >>> * 8st 4TB HDD >>> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and >>> one for HDD pool journal) >>> * 2x 80GB Intel S3510 raid1 for system >>> * 256GB RAM >>> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better >>> >>> This cluster will probably run Hammer LTS unless there are huge >>> improvements in Infernalis when dealing 4k IOPS. >>> >>> The first link above hints at awesome performance. The second one from >>> the list not so much yet.. >>> >>> Is anyone running Hammer or Infernalis with a setup like this? >>> Is it a sane setup? >>> Will we become CPU constrained or can we just throw more RAM on it? :D >>> >>> Kind Regards, >>> David Majchrzak >>> >>> >>> ___ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> >> -- >> Wido den Hollander >> 42on B.V. >> Ceph trainer and consultant >> >> Phone: +31 (0)20 700 9902 >> Skype: contact42on >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
The changes you are looking for are coming from Sandisk in the ceph "Jewel" release coming up. Based on benchmarks and testing, sandisk has really contributed heavily on the tuning aspects and are promising 90%+ native iop of a drive in the cluster. The biggest changes will come from the memory allocation with writes. Latency is going to be a lot lower. - Original Message - From: "David" <da...@visions.se> To: "Wido den Hollander" <w...@42on.com> Cc: ceph-users@lists.ceph.com Sent: Sunday, January 17, 2016 6:49:25 AM Subject: Re: [ceph-users] Again - state of Ceph NVMe and SSDs Thanks Wido, those are good pointers indeed :) So we just have to make sure the backend storage (SSD/NVMe journals) won’t be saturated (or the controllers) and then go with as many RBD per VM as possible. Kind Regards, David Majchrzak 16 jan 2016 kl. 22:26 skrev Wido den Hollander <w...@42on.com>: > On 01/16/2016 07:06 PM, David wrote: >> Hi! >> >> We’re planning our third ceph cluster and been trying to find how to >> maximize IOPS on this one. >> >> Our needs: >> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM >> servers) >> * Pool for storage of many small files, rbd (probably dovecot maildir >> and dovecot index etc) >> > > Not completely NVMe related, but in this case, make sure you use > multiple disks. > > For MySQL for example: > > - Root disk for OS > - Disk for /var/lib/mysql (data) > - Disk for /var/log/mysql (binary log) > - Maybe even a InnoDB logfile disk > > With RBD you gain more performance by sending I/O into the cluster in > parallel. So when ever you can, do so! > > Regarding small files, it might be interesting to play with the stripe > count and stripe size there. By default this is 1 and 4MB. But maybe 16 > and 256k work better here. > > With Dovecot as well, use a different RBD disk for the indexes and a > different one for the Maildir itself. > > Ceph excels at parallel performance. That is what you want to aim for. > >> So I’ve been reading up on: >> >> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance >> >> and ceph-users from october 2015: >> >> http://www.spinics.net/lists/ceph-users/msg22494.html >> >> We’re planning something like 5 OSD servers, with: >> >> * 4x 1.2TB Intel S3510 >> * 8st 4TB HDD >> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and >> one for HDD pool journal) >> * 2x 80GB Intel S3510 raid1 for system >> * 256GB RAM >> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better >> >> This cluster will probably run Hammer LTS unless there are huge >> improvements in Infernalis when dealing 4k IOPS. >> >> The first link above hints at awesome performance. The second one from >> the list not so much yet.. >> >> Is anyone running Hammer or Infernalis with a setup like this? >> Is it a sane setup? >> Will we become CPU constrained or can we just throw more RAM on it? :D >> >> Kind Regards, >> David Majchrzak >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > Wido den Hollander > 42on B.V. > Ceph trainer and consultant > > Phone: +31 (0)20 700 9902 > Skype: contact42on > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
Thanks Wido, those are good pointers indeed :) So we just have to make sure the backend storage (SSD/NVMe journals) won’t be saturated (or the controllers) and then go with as many RBD per VM as possible. Kind Regards, David Majchrzak 16 jan 2016 kl. 22:26 skrev Wido den Hollander: > On 01/16/2016 07:06 PM, David wrote: >> Hi! >> >> We’re planning our third ceph cluster and been trying to find how to >> maximize IOPS on this one. >> >> Our needs: >> * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM >> servers) >> * Pool for storage of many small files, rbd (probably dovecot maildir >> and dovecot index etc) >> > > Not completely NVMe related, but in this case, make sure you use > multiple disks. > > For MySQL for example: > > - Root disk for OS > - Disk for /var/lib/mysql (data) > - Disk for /var/log/mysql (binary log) > - Maybe even a InnoDB logfile disk > > With RBD you gain more performance by sending I/O into the cluster in > parallel. So when ever you can, do so! > > Regarding small files, it might be interesting to play with the stripe > count and stripe size there. By default this is 1 and 4MB. But maybe 16 > and 256k work better here. > > With Dovecot as well, use a different RBD disk for the indexes and a > different one for the Maildir itself. > > Ceph excels at parallel performance. That is what you want to aim for. > >> So I’ve been reading up on: >> >> https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance >> >> and ceph-users from october 2015: >> >> http://www.spinics.net/lists/ceph-users/msg22494.html >> >> We’re planning something like 5 OSD servers, with: >> >> * 4x 1.2TB Intel S3510 >> * 8st 4TB HDD >> * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and >> one for HDD pool journal) >> * 2x 80GB Intel S3510 raid1 for system >> * 256GB RAM >> * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better >> >> This cluster will probably run Hammer LTS unless there are huge >> improvements in Infernalis when dealing 4k IOPS. >> >> The first link above hints at awesome performance. The second one from >> the list not so much yet.. >> >> Is anyone running Hammer or Infernalis with a setup like this? >> Is it a sane setup? >> Will we become CPU constrained or can we just throw more RAM on it? :D >> >> Kind Regards, >> David Majchrzak >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > > > -- > Wido den Hollander > 42on B.V. > Ceph trainer and consultant > > Phone: +31 (0)20 700 9902 > Skype: contact42on > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Again - state of Ceph NVMe and SSDs
Hi! We’re planning our third ceph cluster and been trying to find how to maximize IOPS on this one. Our needs: * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM servers) * Pool for storage of many small files, rbd (probably dovecot maildir and dovecot index etc) So I’ve been reading up on: https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance and ceph-users from october 2015: http://www.spinics.net/lists/ceph-users/msg22494.html We’re planning something like 5 OSD servers, with: * 4x 1.2TB Intel S3510 * 8st 4TB HDD * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and one for HDD pool journal) * 2x 80GB Intel S3510 raid1 for system * 256GB RAM * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better This cluster will probably run Hammer LTS unless there are huge improvements in Infernalis when dealing 4k IOPS. The first link above hints at awesome performance. The second one from the list not so much yet.. Is anyone running Hammer or Infernalis with a setup like this? Is it a sane setup? Will we become CPU constrained or can we just throw more RAM on it? :D Kind Regards, David Majchrzak___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Again - state of Ceph NVMe and SSDs
On 01/16/2016 07:06 PM, David wrote: > Hi! > > We’re planning our third ceph cluster and been trying to find how to > maximize IOPS on this one. > > Our needs: > * Pool for MySQL, rbd (mounted as /var/lib/mysql or equivalent on KVM > servers) > * Pool for storage of many small files, rbd (probably dovecot maildir > and dovecot index etc) > Not completely NVMe related, but in this case, make sure you use multiple disks. For MySQL for example: - Root disk for OS - Disk for /var/lib/mysql (data) - Disk for /var/log/mysql (binary log) - Maybe even a InnoDB logfile disk With RBD you gain more performance by sending I/O into the cluster in parallel. So when ever you can, do so! Regarding small files, it might be interesting to play with the stripe count and stripe size there. By default this is 1 and 4MB. But maybe 16 and 256k work better here. With Dovecot as well, use a different RBD disk for the indexes and a different one for the Maildir itself. Ceph excels at parallel performance. That is what you want to aim for. > So I’ve been reading up on: > > https://communities.intel.com/community/itpeernetwork/blog/2015/11/20/the-future-ssd-is-here-pcienvme-boosts-ceph-performance > > and ceph-users from october 2015: > > http://www.spinics.net/lists/ceph-users/msg22494.html > > We’re planning something like 5 OSD servers, with: > > * 4x 1.2TB Intel S3510 > * 8st 4TB HDD > * 2x Intel P3700 Series HHHL PCIe 400GB (one for SSD Pool Journal and > one for HDD pool journal) > * 2x 80GB Intel S3510 raid1 for system > * 256GB RAM > * 2x 8 core CPU Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz or better > > This cluster will probably run Hammer LTS unless there are huge > improvements in Infernalis when dealing 4k IOPS. > > The first link above hints at awesome performance. The second one from > the list not so much yet.. > > Is anyone running Hammer or Infernalis with a setup like this? > Is it a sane setup? > Will we become CPU constrained or can we just throw more RAM on it? :D > > Kind Regards, > David Majchrzak > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com