Re: [ceph-users] how to judge the results? - rados bench comparison
On Wed, 17 Apr 2019 16:08:34 +0200 Lars Täuber wrote: > Wed, 17 Apr 2019 20:01:28 +0900 > Christian Balzer ==> Ceph Users : > > On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote: > > > > > Wed, 17 Apr 2019 10:47:32 +0200 > > > Paul Emmerich ==> Lars Täuber > > > : > > > > The standard argument that it helps preventing recovery traffic from > > > > clogging the network and impacting client traffic is missleading: > > > > > > What do you mean by "it"? I don't know the standard argument. > > > Do you mean separating the networks or do you mean having both together > > > in one switched network? > > > > > He means separated networks, obviously. > > > > > > > > > > * write client traffic relies on the backend network for replication > > > > operations: your client (write) traffic is impacted anyways if the > > > > backend network is full > > > > > > This I understand as an argument for separating the networks and the > > > backend network being faster than the frontend network. > > > So in case of reconstruction there should be some bandwidth left in the > > > backend for the traffic that is used for the client IO. > > > > > You need to run the numbers and look at the big picture. > > As mentioned already, this is all moot in your case. > > > > 6 HDDs at realistically 150MB/s each, if they were all doing sequential > > I/O. which they aren't. > > But the for the sake of argument lest say that one of your nodes can read > > (or write, not both at the same time) 900MB/s. > > That's still less than half of a single 25Gb/s link. > > Is this really true also with the WAL device (combined with the DB device) > which is a (fast) SSD in our setup? > reading: 2150MB/s > writing: 2120MB/s > IOPS 4K reading/wrtiing 440k/320k > > If so, the next version of OSD host will be adjusted in HW requirements. > Yes. Read up on how WAL/DB is involved in client data writes (only small ones) and reads (not at all). Small writes will incur CPU (Ceph) and latency penalties (Ceph and network) and on top of that your WAL will run out of space quickly, too. It's nice for small bursts, but nothing sustained. > > > And that very hypothetical data rate (it's not sequential, you will > > concurrent operations and thus seeks) is all your node can handle, if it > > all going into recovery/rebalancing your clients are starved because of > > that, not bandwidth exhaustion. > > If it like this also with our SSD WAL, the next version of OSD host will be > adjusted in HW requirements. > Most people trim down recovery/backfill settings so that they don't impact client I/O. Which again makes the network separation less useful. And the WAL/DB is not involved in these activities at all, as they happen on an object (4MB default) level. If I had 25Gb/s and 10Gb/s ports and your nodes and were dead set on separating networks (I'm not), I'd give the faster one to the clients so they can benefit from cached reads while replication and recovery still wouldn't be limited by the 10Gb/s network. Christian > Thanks > Lars > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Wed, 17 Apr 2019 20:01:28 +0900 Christian Balzer ==> Ceph Users : > On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote: > > > Wed, 17 Apr 2019 10:47:32 +0200 > > Paul Emmerich ==> Lars Täuber : > > > The standard argument that it helps preventing recovery traffic from > > > clogging the network and impacting client traffic is missleading: > > > > What do you mean by "it"? I don't know the standard argument. > > Do you mean separating the networks or do you mean having both together in > > one switched network? > > > He means separated networks, obviously. > > > > > > > * write client traffic relies on the backend network for replication > > > operations: your client (write) traffic is impacted anyways if the > > > backend network is full > > > > This I understand as an argument for separating the networks and the > > backend network being faster than the frontend network. > > So in case of reconstruction there should be some bandwidth left in the > > backend for the traffic that is used for the client IO. > > > You need to run the numbers and look at the big picture. > As mentioned already, this is all moot in your case. > > 6 HDDs at realistically 150MB/s each, if they were all doing sequential > I/O. which they aren't. > But the for the sake of argument lest say that one of your nodes can read > (or write, not both at the same time) 900MB/s. > That's still less than half of a single 25Gb/s link. Is this really true also with the WAL device (combined with the DB device) which is a (fast) SSD in our setup? reading:2150MB/s writing:2120MB/s IOPS 4K reading/wrtiing 440k/320k If so, the next version of OSD host will be adjusted in HW requirements. > And that very hypothetical data rate (it's not sequential, you will > concurrent operations and thus seeks) is all your node can handle, if it > all going into recovery/rebalancing your clients are starved because of > that, not bandwidth exhaustion. If it like this also with our SSD WAL, the next version of OSD host will be adjusted in HW requirements. Thanks Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote: > Wed, 17 Apr 2019 10:47:32 +0200 > Paul Emmerich ==> Lars Täuber : > > The standard argument that it helps preventing recovery traffic from > > clogging the network and impacting client traffic is missleading: > > What do you mean by "it"? I don't know the standard argument. > Do you mean separating the networks or do you mean having both together in > one switched network? > He means separated networks, obviously. > > > > * write client traffic relies on the backend network for replication > > operations: your client (write) traffic is impacted anyways if the > > backend network is full > > This I understand as an argument for separating the networks and the backend > network being faster than the frontend network. > So in case of reconstruction there should be some bandwidth left in the > backend for the traffic that is used for the client IO. > You need to run the numbers and look at the big picture. As mentioned already, this is all moot in your case. 6 HDDs at realistically 150MB/s each, if they were all doing sequential I/O. which they aren't. But the for the sake of argument lest say that one of your nodes can read (or write, not both at the same time) 900MB/s. That's still less than half of a single 25Gb/s link. And that very hypothetical data rate (it's not sequential, you will concurrent operations and thus seeks) is all your node can handle, if it all going into recovery/rebalancing your clients are starved because of that, not bandwidth exhaustion. > > > * you are usually not limited by network speed for recovery (except > > for 1 gbit networks), and if you are you probably want to reduce > > recovery speed anyways if you would run into that limit > > > > Paul > > > > Lars > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Wed, 17 Apr 2019 10:47:32 +0200 Paul Emmerich ==> Lars Täuber : > The standard argument that it helps preventing recovery traffic from > clogging the network and impacting client traffic is missleading: What do you mean by "it"? I don't know the standard argument. Do you mean separating the networks or do you mean having both together in one switched network? > > * write client traffic relies on the backend network for replication > operations: your client (write) traffic is impacted anyways if the > backend network is full This I understand as an argument for separating the networks and the backend network being faster than the frontend network. So in case of reconstruction there should be some bandwidth left in the backend for the traffic that is used for the client IO. > * you are usually not limited by network speed for recovery (except > for 1 gbit networks), and if you are you probably want to reduce > recovery speed anyways if you would run into that limit > > Paul > Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Quoting Lars Täuber (taeu...@bbaw.de): > > > This is something i was told to do, because a reconstruction of failed > > > OSDs/disks would have a heavy impact on the backend network. > > > > Opinions vary on running "public" only versus "public" / "backend". > > Having a separate "backend" network might lead to difficult to debug > > issues when the "public" network is working fine, but the "backend" is > > having issues and OSDs can't peer with each other, while the clients can > > talk to all OSDs. You will get slow requests and OSDs marking each other > > down while they are still running etc. > > This I was not aware of. It's real. I've been bitten by this several times in a PoC cluster while playing around with networking ... make sure you have proper monitoring checks on all network interfaces when running this setup. > > In your case with only 6 spinners max per server there is no way you > > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s > > (for large spinners) should be just enough to fill a 10 Gb/s link. A > > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for > > both OSD replication traffic and client IO. > > The reason for the choice for the 25GBit network was because a remark > of someone, that the latency in this ethernet is way below that of > 10GBit. I never double checked this. This is probably true. 25 Gb/s is a single-lane (SerDes) which is used in 50 Gb/s / 100 Gb/s 200 Gb/s connections. It operates on ~ 2.5 times the clock rate of 10 Gb/s / 40 Gb/s. But for clients to fully benefit from this lower latency, they should be on 25 Gb/s as well. If you can affort to redesign your cluster (and low latency is important) ... Then again ... the latency your spinners introduce is a few orders of magnitude higher than the network latency ... I would then (also) invest in NVMe drives for (at least) metadata ... and switch to 3 x replication ... but that might be too much asked for. TL;DR: when desinging clusters, try to think about the "weakest" link (bottleneck) ... most probably this will be disk speed / Ceph overhead. Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
On Wed, Apr 17, 2019 at 7:56 AM Lars Täuber wrote: > > Thanks Paul for the judgement. > > Tue, 16 Apr 2019 10:13:03 +0200 > Paul Emmerich ==> Lars Täuber : > > Seems in line with what I'd expect for the hardware. > > > > Your hardware seems to be way overspecced, you'd be fine with half the > > RAM, half the CPU and way cheaper disks. > > Do you mean all the components of the cluster or only the OSD-nodes? > Before making the requirements i only read about mirroring clusters. I was > afraid of the CPUs being to slow to calculate the erasure codes we planned to > use. Erasure coding is quite fast, you are not running into a CPU bottleneck anytime soon on HDDs. I don't have the numbers in my head, but just try running perf top on an erasure coded OSD while it's recovering, erasure coding is really insignificant here. Paul > > > > In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk. > > This is a really good hint, because we just started to plan the extension. > > > > > I'd probably only use the 25G network for both networks instead of > > using both. Splitting the network usually doesn't help. > > This is something i was told to do, because a reconstruction of failed > OSDs/disks would have a heavy impact on the backend network. > > > > > > Paul > > > > Thanks again. > > Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
25 Gbit/s doesn't have a significant latency advantage over 10 Gbit/s. For reference: a point-to-point 10 Gbit/s fiber link takes around 300 ns of processing for rx+tx on standard Intel X520 NICs (measured it), so not much to save here. Then there's serialization latency which changes from 0.8ns/byte to 0.32ns/byte, i.e., for a small 4kb IO there's an advantage of only about 2µs. That's not really significant unless you run all your storage on NVDIMMs or in RAM or something. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Apr 17, 2019 at 10:52 AM Christian Balzer wrote: > > On Wed, 17 Apr 2019 10:39:10 +0200 Lars Täuber wrote: > > > Wed, 17 Apr 2019 09:52:29 +0200 > > Stefan Kooman ==> Lars Täuber : > > > Quoting Lars Täuber (taeu...@bbaw.de): > > > > > I'd probably only use the 25G network for both networks instead of > > > > > using both. Splitting the network usually doesn't help. > > > > > > > > This is something i was told to do, because a reconstruction of failed > > > > OSDs/disks would have a heavy impact on the backend network. > > > > > > Opinions vary on running "public" only versus "public" / "backend". > > > Having a separate "backend" network might lead to difficult to debug > > > issues when the "public" network is working fine, but the "backend" is > > > having issues and OSDs can't peer with each other, while the clients can > > > talk to all OSDs. You will get slow requests and OSDs marking each other > > > down while they are still running etc. > > > > This I was not aware of. > > > Split networks are usually more trouble than their worth and as stated > only help when your OSD speeds exceed the network bandwidth _and_ you > can't do a CLAG bonding over switches that support it, gaining both > additional bandwidth and redundancy. > > > > > > In your case with only 6 spinners max per server there is no way you > > > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s > > > (for large spinners) should be just enough to fill a 10 Gb/s link. A > > > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for > > > both OSD replication traffic and client IO. > > > > The reason for the choice for the 25GBit network was because a remark of > > someone, that the latency in this ethernet is way below that of 10GBit. I > > never double checked this. > > > Correct, 25Gb/s is a split of 100GB/s, inheriting the latency advantages > from it. > So if you do a lot of small IOPS, this will help. > > But only completely so if everything is on the same boat. > > So if you clients (or most of them at least) can be on 25GB/s as well, > that would be the best situation, with a non-split network. > > Christian > > > > > > > > > My 2 cents, > > > > > > Gr. Stefan > > > > > > > Cheers, > > Lars > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Rakuten Communications > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
On Wed, 17 Apr 2019 10:39:10 +0200 Lars Täuber wrote: > Wed, 17 Apr 2019 09:52:29 +0200 > Stefan Kooman ==> Lars Täuber : > > Quoting Lars Täuber (taeu...@bbaw.de): > > > > I'd probably only use the 25G network for both networks instead of > > > > using both. Splitting the network usually doesn't help. > > > > > > This is something i was told to do, because a reconstruction of failed > > > OSDs/disks would have a heavy impact on the backend network. > > > > Opinions vary on running "public" only versus "public" / "backend". > > Having a separate "backend" network might lead to difficult to debug > > issues when the "public" network is working fine, but the "backend" is > > having issues and OSDs can't peer with each other, while the clients can > > talk to all OSDs. You will get slow requests and OSDs marking each other > > down while they are still running etc. > > This I was not aware of. > Split networks are usually more trouble than their worth and as stated only help when your OSD speeds exceed the network bandwidth _and_ you can't do a CLAG bonding over switches that support it, gaining both additional bandwidth and redundancy. > > > In your case with only 6 spinners max per server there is no way you > > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s > > (for large spinners) should be just enough to fill a 10 Gb/s link. A > > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for > > both OSD replication traffic and client IO. > > The reason for the choice for the 25GBit network was because a remark of > someone, that the latency in this ethernet is way below that of 10GBit. I > never double checked this. > Correct, 25Gb/s is a split of 100GB/s, inheriting the latency advantages from it. So if you do a lot of small IOPS, this will help. But only completely so if everything is on the same boat. So if you clients (or most of them at least) can be on 25GB/s as well, that would be the best situation, with a non-split network. Christian > > > > > My 2 cents, > > > > Gr. Stefan > > > > Cheers, > Lars > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
The standard argument that it helps preventing recovery traffic from clogging the network and impacting client traffic is missleading: * write client traffic relies on the backend network for replication operations: your client (write) traffic is impacted anyways if the backend network is full * you are usually not limited by network speed for recovery (except for 1 gbit networks), and if you are you probably want to reduce recovery speed anyways if you would run into that limit Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Apr 17, 2019 at 10:39 AM Lars Täuber wrote: > > Wed, 17 Apr 2019 09:52:29 +0200 > Stefan Kooman ==> Lars Täuber : > > Quoting Lars Täuber (taeu...@bbaw.de): > > > > I'd probably only use the 25G network for both networks instead of > > > > using both. Splitting the network usually doesn't help. > > > > > > This is something i was told to do, because a reconstruction of failed > > > OSDs/disks would have a heavy impact on the backend network. > > > > Opinions vary on running "public" only versus "public" / "backend". > > Having a separate "backend" network might lead to difficult to debug > > issues when the "public" network is working fine, but the "backend" is > > having issues and OSDs can't peer with each other, while the clients can > > talk to all OSDs. You will get slow requests and OSDs marking each other > > down while they are still running etc. > > This I was not aware of. > > > > In your case with only 6 spinners max per server there is no way you > > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s > > (for large spinners) should be just enough to fill a 10 Gb/s link. A > > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for > > both OSD replication traffic and client IO. > > The reason for the choice for the 25GBit network was because a remark of > someone, that the latency in this ethernet is way below that of 10GBit. I > never double checked this. > > > > > > My 2 cents, > > > > Gr. Stefan > > > > Cheers, > Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Wed, 17 Apr 2019 09:52:29 +0200 Stefan Kooman ==> Lars Täuber : > Quoting Lars Täuber (taeu...@bbaw.de): > > > I'd probably only use the 25G network for both networks instead of > > > using both. Splitting the network usually doesn't help. > > > > This is something i was told to do, because a reconstruction of failed > > OSDs/disks would have a heavy impact on the backend network. > > Opinions vary on running "public" only versus "public" / "backend". > Having a separate "backend" network might lead to difficult to debug > issues when the "public" network is working fine, but the "backend" is > having issues and OSDs can't peer with each other, while the clients can > talk to all OSDs. You will get slow requests and OSDs marking each other > down while they are still running etc. This I was not aware of. > In your case with only 6 spinners max per server there is no way you > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s > (for large spinners) should be just enough to fill a 10 Gb/s link. A > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for > both OSD replication traffic and client IO. The reason for the choice for the 25GBit network was because a remark of someone, that the latency in this ethernet is way below that of 10GBit. I never double checked this. > > My 2 cents, > > Gr. Stefan > Cheers, Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Quoting Lars Täuber (taeu...@bbaw.de): > > I'd probably only use the 25G network for both networks instead of > > using both. Splitting the network usually doesn't help. > > This is something i was told to do, because a reconstruction of failed > OSDs/disks would have a heavy impact on the backend network. Opinions vary on running "public" only versus "public" / "backend". Having a separate "backend" network might lead to difficult to debug issues when the "public" network is working fine, but the "backend" is having issues and OSDs can't peer with each other, while the clients can talk to all OSDs. You will get slow requests and OSDs marking each other down while they are still running etc. There might also be pro's for running a separate "backend" network, anyone? In your case with only 6 spinners max per server there is no way you will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s (for large spinners) should be just enough to fill a 10 Gb/s link. A redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for both OSD replication traffic and client IO. My 2 cents, Gr. Stefan -- | BIT BV http://www.bit.nl/Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Thanks Paul for the judgement. Tue, 16 Apr 2019 10:13:03 +0200 Paul Emmerich ==> Lars Täuber : > Seems in line with what I'd expect for the hardware. > > Your hardware seems to be way overspecced, you'd be fine with half the > RAM, half the CPU and way cheaper disks. Do you mean all the components of the cluster or only the OSD-nodes? Before making the requirements i only read about mirroring clusters. I was afraid of the CPUs being to slow to calculate the erasure codes we planned to use. > In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk. This is a really good hint, because we just started to plan the extension. > > I'd probably only use the 25G network for both networks instead of > using both. Splitting the network usually doesn't help. This is something i was told to do, because a reconstruction of failed OSDs/disks would have a heavy impact on the backend network. > > Paul > Thanks again. Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to judge the results? - rados bench comparison
Seems in line with what I'd expect for the hardware. Your hardware seems to be way overspecced, you'd be fine with half the RAM, half the CPU and way cheaper disks. In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk. I'd probably only use the 25G network for both networks instead of using both. Splitting the network usually doesn't help. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Mon, Apr 8, 2019 at 2:16 PM Lars Täuber wrote: > > Hi there, > > i'm new to ceph and just got my first cluster running. > Now i'd like to know if the performance we get is expectable. > > Is there a website with benchmark results somewhere where i could have a look > to compare with our HW and our results? > > This are the results: > rados bench single threaded: > # rados bench 10 write --rbd-cache=false -t 1 > > Object size:4194304 > Bandwidth (MB/sec): 53.7186 > Stddev Bandwidth: 3.86437 > Max bandwidth (MB/sec): 60 > Min bandwidth (MB/sec): 48 > Average IOPS: 13 > Stddev IOPS:0.966092 > Average Latency(s): 0.0744599 > Stddev Latency(s): 0.00911778 > > nearly maxing out one (idle) client with 28 threads > # rados bench 10 write --rbd-cache=false -t 28 > > Bandwidth (MB/sec): 850.451 > Stddev Bandwidth: 40.6699 > Max bandwidth (MB/sec): 904 > Min bandwidth (MB/sec): 748 > Average IOPS: 212 > Stddev IOPS:10.1675 > Average Latency(s): 0.131309 > Stddev Latency(s): 0.0318489 > > four concurrent benchmarks on four clients each with 24 threads: > Bandwidth (MB/sec): 396 376 381 389 > Stddev Bandwidth: 30 25 22 22 > Max bandwidth (MB/sec): 440 420 416 428 > Min bandwidth (MB/sec): 352 348 344 364 > Average IOPS: 99 94 95 97 > Stddev IOPS:7.5 6.3 5.6 5.6 > Average Latency(s): 0.240.250.250.24 > Stddev Latency(s): 0.120.150.150.14 > > summing up: write mode > ~1500 MB/sec Bandwidth > ~385 IOPS > ~0.25s Latency > > rand mode: > ~3500 MB/sec > ~920 IOPS > ~0.154s Latency > > > > Maybe someone could judge our numbers. I am actually very satisfied with the > values. > > The (mostly idle) cluster is build from these components: > * 10GB frontend network, bonding two connections to mon-, mds- and osd-nodes > ** no bonding to clients > * 25GB backend network, bonding two connections to osd-nodes > > > cluster: > * 3x mon, 2x Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 64GB RAM > * 3x mds, 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz, 128MB RAM > * 7x OSD-nodes, 2x Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz, 96GB RAM > ** 4x 6TB SAS HDD HGST HUS726T6TAL5204 (5x on two nodes, max. 6x per chassis > for later growth) > ** 2x 800GB SAS SSD WDC WUSTM3280ASS200 => SW-RAID1 => LVM ~116 GiB per OSD > for DB and WAL > > erasure encoded pool: (made for CephFS) > * plugin=clay k=5 m=2 d=6 crush-failure-domain=host > > Thanks and best regards > Lars > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to judge the results? - rados bench comparison
Hi there, i'm new to ceph and just got my first cluster running. Now i'd like to know if the performance we get is expectable. Is there a website with benchmark results somewhere where i could have a look to compare with our HW and our results? This are the results: rados bench single threaded: # rados bench 10 write --rbd-cache=false -t 1 Object size:4194304 Bandwidth (MB/sec): 53.7186 Stddev Bandwidth: 3.86437 Max bandwidth (MB/sec): 60 Min bandwidth (MB/sec): 48 Average IOPS: 13 Stddev IOPS:0.966092 Average Latency(s): 0.0744599 Stddev Latency(s): 0.00911778 nearly maxing out one (idle) client with 28 threads # rados bench 10 write --rbd-cache=false -t 28 Bandwidth (MB/sec): 850.451 Stddev Bandwidth: 40.6699 Max bandwidth (MB/sec): 904 Min bandwidth (MB/sec): 748 Average IOPS: 212 Stddev IOPS:10.1675 Average Latency(s): 0.131309 Stddev Latency(s): 0.0318489 four concurrent benchmarks on four clients each with 24 threads: Bandwidth (MB/sec): 396 376 381 389 Stddev Bandwidth: 30 25 22 22 Max bandwidth (MB/sec): 440 420 416 428 Min bandwidth (MB/sec): 352 348 344 364 Average IOPS: 99 94 95 97 Stddev IOPS:7.5 6.3 5.6 5.6 Average Latency(s): 0.240.250.250.24 Stddev Latency(s): 0.120.150.150.14 summing up: write mode ~1500 MB/sec Bandwidth ~385 IOPS ~0.25s Latency rand mode: ~3500 MB/sec ~920 IOPS ~0.154s Latency Maybe someone could judge our numbers. I am actually very satisfied with the values. The (mostly idle) cluster is build from these components: * 10GB frontend network, bonding two connections to mon-, mds- and osd-nodes ** no bonding to clients * 25GB backend network, bonding two connections to osd-nodes cluster: * 3x mon, 2x Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 64GB RAM * 3x mds, 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz, 128MB RAM * 7x OSD-nodes, 2x Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz, 96GB RAM ** 4x 6TB SAS HDD HGST HUS726T6TAL5204 (5x on two nodes, max. 6x per chassis for later growth) ** 2x 800GB SAS SSD WDC WUSTM3280ASS200 => SW-RAID1 => LVM ~116 GiB per OSD for DB and WAL erasure encoded pool: (made for CephFS) * plugin=clay k=5 m=2 d=6 crush-failure-domain=host Thanks and best regards Lars ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com