Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Christian Balzer
On Wed, 17 Apr 2019 16:08:34 +0200 Lars Täuber wrote:

> Wed, 17 Apr 2019 20:01:28 +0900
> Christian Balzer  ==> Ceph Users  :
> > On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote:
> >   
> > > Wed, 17 Apr 2019 10:47:32 +0200
> > > Paul Emmerich  ==> Lars Täuber  
> > > :
> > > > The standard argument that it helps preventing recovery traffic from
> > > > clogging the network and impacting client traffic is missleading:  
> > > 
> > > What do you mean by "it"? I don't know the standard argument.
> > > Do you mean separating the networks or do you mean having both together 
> > > in one switched network?
> > > 
> > He means separated networks, obviously.
> >   
> > > > 
> > > > * write client traffic relies on the backend network for replication
> > > > operations: your client (write) traffic is impacted anyways if the
> > > > backend network is full  
> > > 
> > > This I understand as an argument for separating the networks and the 
> > > backend network being faster than the frontend network.
> > > So in case of reconstruction there should be some bandwidth left in the 
> > > backend for the traffic that is used for the client IO.
> > > 
> > You need to run the numbers and look at the big picture.
> > As mentioned already, this is all moot in your case.
> > 
> > 6 HDDs at realistically 150MB/s each, if they were all doing sequential
> > I/O. which they aren't. 
> > But the for the sake of argument lest say that one of your nodes can read
> > (or write, not both at the same time) 900MB/s.
> > That's still less than half of a single 25Gb/s link.  
> 
> Is this really true also with the WAL device (combined with the DB device) 
> which is a (fast) SSD in our setup?
> reading:  2150MB/​s
> writing:  2120MB/​s
> IOPS 4K reading/wrtiing   440k/​320k
> 
> If so, the next version of OSD host will be adjusted in HW requirements.
> 
Yes.
Read up on how WAL/DB is involved in client data writes (only small ones)
and reads (not at all). 

Small writes will incur CPU (Ceph) and latency penalties (Ceph and
network) and on top of that your WAL will run out of space quickly, too.
It's nice for small bursts, but nothing sustained.

> 
> > And that very hypothetical data rate (it's not sequential, you will
> > concurrent operations and thus seeks) is all your node can handle, if it
> > all going into recovery/rebalancing your clients are starved because of
> > that, not bandwidth exhaustion.  
> 
> If it like this also with our SSD WAL, the next version of OSD host will be 
> adjusted in HW requirements.
> 
Most people trim down recovery/backfill settings so that they don't impact
client I/O. Which again makes the network separation less useful. 
And the WAL/DB is not involved in these activities at all, as they happen
on an object (4MB default) level.

If I had 25Gb/s and 10Gb/s ports and your nodes and were dead set on
separating networks (I'm not), I'd give the faster one to the clients so
they can benefit from cached reads while replication and recovery still
wouldn't be limited by the 10Gb/s network.


Christian

> Thanks
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Lars Täuber
Wed, 17 Apr 2019 20:01:28 +0900
Christian Balzer  ==> Ceph Users  :
> On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote:
> 
> > Wed, 17 Apr 2019 10:47:32 +0200
> > Paul Emmerich  ==> Lars Täuber  :  
> > > The standard argument that it helps preventing recovery traffic from
> > > clogging the network and impacting client traffic is missleading:
> > 
> > What do you mean by "it"? I don't know the standard argument.
> > Do you mean separating the networks or do you mean having both together in 
> > one switched network?
> >   
> He means separated networks, obviously.
> 
> > > 
> > > * write client traffic relies on the backend network for replication
> > > operations: your client (write) traffic is impacted anyways if the
> > > backend network is full
> > 
> > This I understand as an argument for separating the networks and the 
> > backend network being faster than the frontend network.
> > So in case of reconstruction there should be some bandwidth left in the 
> > backend for the traffic that is used for the client IO.
> >   
> You need to run the numbers and look at the big picture.
> As mentioned already, this is all moot in your case.
> 
> 6 HDDs at realistically 150MB/s each, if they were all doing sequential
> I/O. which they aren't. 
> But the for the sake of argument lest say that one of your nodes can read
> (or write, not both at the same time) 900MB/s.
> That's still less than half of a single 25Gb/s link.

Is this really true also with the WAL device (combined with the DB device) 
which is a (fast) SSD in our setup?
reading:2150MB/​s
writing:2120MB/​s
IOPS 4K reading/wrtiing 440k/​320k

If so, the next version of OSD host will be adjusted in HW requirements.


> And that very hypothetical data rate (it's not sequential, you will
> concurrent operations and thus seeks) is all your node can handle, if it
> all going into recovery/rebalancing your clients are starved because of
> that, not bandwidth exhaustion.

If it like this also with our SSD WAL, the next version of OSD host will be 
adjusted in HW requirements.

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Christian Balzer
On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote:

> Wed, 17 Apr 2019 10:47:32 +0200
> Paul Emmerich  ==> Lars Täuber  :
> > The standard argument that it helps preventing recovery traffic from
> > clogging the network and impacting client traffic is missleading:  
> 
> What do you mean by "it"? I don't know the standard argument.
> Do you mean separating the networks or do you mean having both together in 
> one switched network?
> 
He means separated networks, obviously.

> > 
> > * write client traffic relies on the backend network for replication
> > operations: your client (write) traffic is impacted anyways if the
> > backend network is full  
> 
> This I understand as an argument for separating the networks and the backend 
> network being faster than the frontend network.
> So in case of reconstruction there should be some bandwidth left in the 
> backend for the traffic that is used for the client IO.
> 
You need to run the numbers and look at the big picture.
As mentioned already, this is all moot in your case.

6 HDDs at realistically 150MB/s each, if they were all doing sequential
I/O. which they aren't. 
But the for the sake of argument lest say that one of your nodes can read
(or write, not both at the same time) 900MB/s.
That's still less than half of a single 25Gb/s link.

And that very hypothetical data rate (it's not sequential, you will
concurrent operations and thus seeks) is all your node can handle, if it
all going into recovery/rebalancing your clients are starved because of
that, not bandwidth exhaustion.

> 
> > * you are usually not limited by network speed for recovery (except
> > for 1 gbit networks), and if you are you probably want to reduce
> > recovery speed anyways if you would run into that limit
> > 
> > Paul
> >   
> 
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Lars Täuber
Wed, 17 Apr 2019 10:47:32 +0200
Paul Emmerich  ==> Lars Täuber  :
> The standard argument that it helps preventing recovery traffic from
> clogging the network and impacting client traffic is missleading:

What do you mean by "it"? I don't know the standard argument.
Do you mean separating the networks or do you mean having both together in one 
switched network?

> 
> * write client traffic relies on the backend network for replication
> operations: your client (write) traffic is impacted anyways if the
> backend network is full

This I understand as an argument for separating the networks and the backend 
network being faster than the frontend network.
So in case of reconstruction there should be some bandwidth left in the backend 
for the traffic that is used for the client IO.


> * you are usually not limited by network speed for recovery (except
> for 1 gbit networks), and if you are you probably want to reduce
> recovery speed anyways if you would run into that limit
> 
> Paul
> 

Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Stefan Kooman
Quoting Lars Täuber (taeu...@bbaw.de):
> > > This is something i was told to do, because a reconstruction of failed
> > > OSDs/disks would have a heavy impact on the backend network.  
> > 
> > Opinions vary on running "public" only versus "public" / "backend".
> > Having a separate "backend" network might lead to difficult to debug
> > issues when the "public" network is working fine, but the "backend" is
> > having issues and OSDs can't peer with each other, while the clients can
> > talk to all OSDs. You will get slow requests and OSDs marking each other
> > down while they are still running etc.
> 
> This I was not aware of.

It's real. I've been bitten by this several times in a PoC cluster while
playing around with networking ... make sure you have proper monitoring checks 
on
all network interfaces when running this setup.

> > In your case with only 6 spinners max per server there is no way you
> > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> > (for large spinners) should be just enough to fill a 10 Gb/s link. A
> > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> > both OSD replication traffic and client IO.
> 
> The reason for the choice for the 25GBit network was because a remark
> of someone, that the latency in this ethernet is way below that of
> 10GBit. I never double checked this.

This is probably true. 25 Gb/s is a single-lane (SerDes) which is used in 50 
Gb/s
/ 100 Gb/s 200 Gb/s connections. It operates on ~ 2.5 times the clock
rate of 10 Gb/s / 40 Gb/s. But for clients to fully benefit from this lower
latency, they should be on 25 Gb/s as well. If you can affort to
redesign your cluster (and low latency is important) ...  Then again ...
the latency your spinners introduce is a few orders of magnitude higher
than the network latency ... I would then (also) invest in NVMe drives
for (at least) metadata ... and switch to 3 x replication ... but that
might be too much asked for.

TL;DR: when desinging clusters, try to think about the "weakest" link
(bottleneck) ... most probably this will be disk speed / Ceph overhead.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Paul Emmerich
On Wed, Apr 17, 2019 at 7:56 AM Lars Täuber  wrote:
>
> Thanks Paul for the judgement.
>
> Tue, 16 Apr 2019 10:13:03 +0200
> Paul Emmerich  ==> Lars Täuber  :
> > Seems in line with what I'd expect for the hardware.
> >
> > Your hardware seems to be way overspecced, you'd be fine with half the
> > RAM, half the CPU and way cheaper disks.
>
> Do you mean all the components of the cluster or only the OSD-nodes?
> Before making the requirements i only read about mirroring clusters. I was 
> afraid of the CPUs being to slow to calculate the erasure codes we planned to 
> use.

Erasure coding is quite fast, you are not running into a CPU
bottleneck anytime soon on HDDs.
I don't have the numbers in my head, but just try running perf top on
an erasure coded OSD while it's recovering, erasure coding is really
insignificant here.


Paul


>
>
> > In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk.
>
> This is a really good hint, because we just started to plan the extension.
>
> >
> > I'd probably only use the 25G network for both networks instead of
> > using both. Splitting the network usually doesn't help.
>
> This is something i was told to do, because a reconstruction of failed 
> OSDs/disks would have a heavy impact on the backend network.
>
>
> >
> > Paul
> >
>
> Thanks again.
>
> Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Paul Emmerich
25 Gbit/s doesn't have a significant latency advantage over 10 Gbit/s.

For reference: a point-to-point 10 Gbit/s fiber link takes around 300
ns of processing for rx+tx on standard Intel X520 NICs (measured it),
so not much to save here.
Then there's serialization latency which changes from 0.8ns/byte to
0.32ns/byte, i.e., for a small 4kb IO there's an advantage of only
about 2µs.

That's not really significant unless you run all your storage on
NVDIMMs or in RAM or something.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Apr 17, 2019 at 10:52 AM Christian Balzer  wrote:
>
> On Wed, 17 Apr 2019 10:39:10 +0200 Lars Täuber wrote:
>
> > Wed, 17 Apr 2019 09:52:29 +0200
> > Stefan Kooman  ==> Lars Täuber  :
> > > Quoting Lars Täuber (taeu...@bbaw.de):
> > > > > I'd probably only use the 25G network for both networks instead of
> > > > > using both. Splitting the network usually doesn't help.
> > > >
> > > > This is something i was told to do, because a reconstruction of failed
> > > > OSDs/disks would have a heavy impact on the backend network.
> > >
> > > Opinions vary on running "public" only versus "public" / "backend".
> > > Having a separate "backend" network might lead to difficult to debug
> > > issues when the "public" network is working fine, but the "backend" is
> > > having issues and OSDs can't peer with each other, while the clients can
> > > talk to all OSDs. You will get slow requests and OSDs marking each other
> > > down while they are still running etc.
> >
> > This I was not aware of.
> >
> Split networks are usually more trouble than their worth and as stated
> only help when your OSD speeds exceed the network bandwidth _and_ you
> can't do a CLAG bonding over switches that support it, gaining both
> additional bandwidth and redundancy.
>
> >
> > > In your case with only 6 spinners max per server there is no way you
> > > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> > > (for large spinners) should be just enough to fill a 10 Gb/s link. A
> > > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> > > both OSD replication traffic and client IO.
> >
> > The reason for the choice for the 25GBit network was because a remark of 
> > someone, that the latency in this ethernet is way below that of 10GBit. I 
> > never double checked this.
> >
> Correct, 25Gb/s is a split of 100GB/s, inheriting the latency advantages
> from it.
> So if you do a lot of small IOPS, this will help.
>
> But only completely so if everything is on the same boat.
>
> So if you clients (or most of them at least) can be on 25GB/s as well,
> that would be the best situation, with a non-split network.
>
> Christian
>
> >
> > >
> > > My 2 cents,
> > >
> > > Gr. Stefan
> > >
> >
> > Cheers,
> > Lars
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Rakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Christian Balzer
On Wed, 17 Apr 2019 10:39:10 +0200 Lars Täuber wrote:

> Wed, 17 Apr 2019 09:52:29 +0200
> Stefan Kooman  ==> Lars Täuber  :
> > Quoting Lars Täuber (taeu...@bbaw.de):  
> > > > I'd probably only use the 25G network for both networks instead of
> > > > using both. Splitting the network usually doesn't help.
> > > 
> > > This is something i was told to do, because a reconstruction of failed
> > > OSDs/disks would have a heavy impact on the backend network.
> > 
> > Opinions vary on running "public" only versus "public" / "backend".
> > Having a separate "backend" network might lead to difficult to debug
> > issues when the "public" network is working fine, but the "backend" is
> > having issues and OSDs can't peer with each other, while the clients can
> > talk to all OSDs. You will get slow requests and OSDs marking each other
> > down while they are still running etc.  
> 
> This I was not aware of.
> 
Split networks are usually more trouble than their worth and as stated
only help when your OSD speeds exceed the network bandwidth _and_ you
can't do a CLAG bonding over switches that support it, gaining both
additional bandwidth and redundancy. 

> 
> > In your case with only 6 spinners max per server there is no way you
> > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> > (for large spinners) should be just enough to fill a 10 Gb/s link. A
> > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> > both OSD replication traffic and client IO.  
> 
> The reason for the choice for the 25GBit network was because a remark of 
> someone, that the latency in this ethernet is way below that of 10GBit. I 
> never double checked this.
> 
Correct, 25Gb/s is a split of 100GB/s, inheriting the latency advantages
from it.
So if you do a lot of small IOPS, this will help.

But only completely so if everything is on the same boat.

So if you clients (or most of them at least) can be on 25GB/s as well,
that would be the best situation, with a non-split network.

Christian

> 
> > 
> > My 2 cents,
> > 
> > Gr. Stefan
> >   
> 
> Cheers,
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Paul Emmerich
The standard argument that it helps preventing recovery traffic from
clogging the network and impacting client traffic is missleading:

* write client traffic relies on the backend network for replication
operations: your client (write) traffic is impacted anyways if the
backend network is full
* you are usually not limited by network speed for recovery (except
for 1 gbit networks), and if you are you probably want to reduce
recovery speed anyways if you would run into that limit

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Apr 17, 2019 at 10:39 AM Lars Täuber  wrote:
>
> Wed, 17 Apr 2019 09:52:29 +0200
> Stefan Kooman  ==> Lars Täuber  :
> > Quoting Lars Täuber (taeu...@bbaw.de):
> > > > I'd probably only use the 25G network for both networks instead of
> > > > using both. Splitting the network usually doesn't help.
> > >
> > > This is something i was told to do, because a reconstruction of failed
> > > OSDs/disks would have a heavy impact on the backend network.
> >
> > Opinions vary on running "public" only versus "public" / "backend".
> > Having a separate "backend" network might lead to difficult to debug
> > issues when the "public" network is working fine, but the "backend" is
> > having issues and OSDs can't peer with each other, while the clients can
> > talk to all OSDs. You will get slow requests and OSDs marking each other
> > down while they are still running etc.
>
> This I was not aware of.
>
>
> > In your case with only 6 spinners max per server there is no way you
> > will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> > (for large spinners) should be just enough to fill a 10 Gb/s link. A
> > redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> > both OSD replication traffic and client IO.
>
> The reason for the choice for the 25GBit network was because a remark of 
> someone, that the latency in this ethernet is way below that of 10GBit. I 
> never double checked this.
>
>
> >
> > My 2 cents,
> >
> > Gr. Stefan
> >
>
> Cheers,
> Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Lars Täuber
Wed, 17 Apr 2019 09:52:29 +0200
Stefan Kooman  ==> Lars Täuber  :
> Quoting Lars Täuber (taeu...@bbaw.de):
> > > I'd probably only use the 25G network for both networks instead of
> > > using both. Splitting the network usually doesn't help.  
> > 
> > This is something i was told to do, because a reconstruction of failed
> > OSDs/disks would have a heavy impact on the backend network.  
> 
> Opinions vary on running "public" only versus "public" / "backend".
> Having a separate "backend" network might lead to difficult to debug
> issues when the "public" network is working fine, but the "backend" is
> having issues and OSDs can't peer with each other, while the clients can
> talk to all OSDs. You will get slow requests and OSDs marking each other
> down while they are still running etc.

This I was not aware of.


> In your case with only 6 spinners max per server there is no way you
> will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> (for large spinners) should be just enough to fill a 10 Gb/s link. A
> redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> both OSD replication traffic and client IO.

The reason for the choice for the 25GBit network was because a remark of 
someone, that the latency in this ethernet is way below that of 10GBit. I never 
double checked this.


> 
> My 2 cents,
> 
> Gr. Stefan
> 

Cheers,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Stefan Kooman
Quoting Lars Täuber (taeu...@bbaw.de):
> > I'd probably only use the 25G network for both networks instead of
> > using both. Splitting the network usually doesn't help.
> 
> This is something i was told to do, because a reconstruction of failed
> OSDs/disks would have a heavy impact on the backend network.

Opinions vary on running "public" only versus "public" / "backend".
Having a separate "backend" network might lead to difficult to debug
issues when the "public" network is working fine, but the "backend" is
having issues and OSDs can't peer with each other, while the clients can
talk to all OSDs. You will get slow requests and OSDs marking each other
down while they are still running etc.

There might also be pro's for running a separate "backend" network,
anyone?

In your case with only 6 spinners max per server there is no way you
will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
(for large spinners) should be just enough to fill a 10 Gb/s link. A
redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
both OSD replication traffic and client IO.

My 2 cents,

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-16 Thread Lars Täuber
Thanks Paul for the judgement.

Tue, 16 Apr 2019 10:13:03 +0200
Paul Emmerich  ==> Lars Täuber  :
> Seems in line with what I'd expect for the hardware.
> 
> Your hardware seems to be way overspecced, you'd be fine with half the
> RAM, half the CPU and way cheaper disks.

Do you mean all the components of the cluster or only the OSD-nodes?
Before making the requirements i only read about mirroring clusters. I was 
afraid of the CPUs being to slow to calculate the erasure codes we planned to 
use.


> In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk.

This is a really good hint, because we just started to plan the extension.

> 
> I'd probably only use the 25G network for both networks instead of
> using both. Splitting the network usually doesn't help.

This is something i was told to do, because a reconstruction of failed 
OSDs/disks would have a heavy impact on the backend network.


> 
> Paul
> 

Thanks again.

Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-16 Thread Paul Emmerich
Seems in line with what I'd expect for the hardware.

Your hardware seems to be way overspecced, you'd be fine with half the
RAM, half the CPU and way cheaper disks.
In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk.

I'd probably only use the 25G network for both networks instead of
using both. Splitting the network usually doesn't help.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Mon, Apr 8, 2019 at 2:16 PM Lars Täuber  wrote:
>
> Hi there,
>
> i'm new to ceph and just got my first cluster running.
> Now i'd like to know if the performance we get is expectable.
>
> Is there a website with benchmark results somewhere where i could have a look 
> to compare with our HW and our results?
>
> This are the results:
> rados bench single threaded:
> # rados bench 10 write --rbd-cache=false -t 1
>
> Object size:4194304
> Bandwidth (MB/sec): 53.7186
> Stddev Bandwidth:   3.86437
> Max bandwidth (MB/sec): 60
> Min bandwidth (MB/sec): 48
> Average IOPS:   13
> Stddev IOPS:0.966092
> Average Latency(s): 0.0744599
> Stddev Latency(s):  0.00911778
>
> nearly maxing out one (idle) client with 28 threads
> # rados bench 10 write --rbd-cache=false -t 28
>
> Bandwidth (MB/sec): 850.451
> Stddev Bandwidth:   40.6699
> Max bandwidth (MB/sec): 904
> Min bandwidth (MB/sec): 748
> Average IOPS:   212
> Stddev IOPS:10.1675
> Average Latency(s): 0.131309
> Stddev Latency(s):  0.0318489
>
> four concurrent benchmarks on four clients each with 24 threads:
> Bandwidth (MB/sec): 396 376 381 389
> Stddev Bandwidth:   30  25  22  22
> Max bandwidth (MB/sec): 440 420 416 428
> Min bandwidth (MB/sec): 352 348 344 364
> Average IOPS:   99  94  95  97
> Stddev IOPS:7.5 6.3 5.6 5.6
> Average Latency(s): 0.240.250.250.24
> Stddev Latency(s):  0.120.150.150.14
>
> summing up: write mode
> ~1500 MB/sec Bandwidth
> ~385 IOPS
> ~0.25s Latency
>
> rand mode:
> ~3500 MB/sec
> ~920 IOPS
> ~0.154s Latency
>
>
>
> Maybe someone could judge our numbers. I am actually very satisfied with the 
> values.
>
> The (mostly idle) cluster is build from these components:
> * 10GB frontend network, bonding two connections to mon-, mds- and osd-nodes
> ** no bonding to clients
> * 25GB backend network, bonding two connections to osd-nodes
>
>
> cluster:
> * 3x mon, 2x Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 64GB RAM
> * 3x mds, 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz, 128MB RAM
> * 7x OSD-nodes, 2x Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz, 96GB RAM
> ** 4x 6TB SAS HDD HGST HUS726T6TAL5204 (5x on two nodes, max. 6x per chassis 
> for later growth)
> ** 2x 800GB SAS SSD WDC WUSTM3280ASS200 => SW-RAID1 => LVM ~116 GiB per OSD 
> for DB and WAL
>
> erasure encoded pool: (made for CephFS)
> * plugin=clay k=5 m=2 d=6 crush-failure-domain=host
>
> Thanks and best regards
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to judge the results? - rados bench comparison

2019-04-08 Thread Lars Täuber
Hi there,

i'm new to ceph and just got my first cluster running.
Now i'd like to know if the performance we get is expectable.

Is there a website with benchmark results somewhere where i could have a look 
to compare with our HW and our results?

This are the results:
rados bench single threaded:
# rados bench 10 write --rbd-cache=false -t 1

Object size:4194304
Bandwidth (MB/sec): 53.7186
Stddev Bandwidth:   3.86437
Max bandwidth (MB/sec): 60
Min bandwidth (MB/sec): 48
Average IOPS:   13
Stddev IOPS:0.966092
Average Latency(s): 0.0744599
Stddev Latency(s):  0.00911778

nearly maxing out one (idle) client with 28 threads
# rados bench 10 write --rbd-cache=false -t 28

Bandwidth (MB/sec): 850.451
Stddev Bandwidth:   40.6699
Max bandwidth (MB/sec): 904
Min bandwidth (MB/sec): 748
Average IOPS:   212
Stddev IOPS:10.1675
Average Latency(s): 0.131309
Stddev Latency(s):  0.0318489

four concurrent benchmarks on four clients each with 24 threads:
Bandwidth (MB/sec): 396 376 381 389
Stddev Bandwidth:   30  25  22  22
Max bandwidth (MB/sec): 440 420 416 428
Min bandwidth (MB/sec): 352 348 344 364
Average IOPS:   99  94  95  97
Stddev IOPS:7.5 6.3 5.6 5.6
Average Latency(s): 0.240.250.250.24
Stddev Latency(s):  0.120.150.150.14

summing up: write mode
~1500 MB/sec Bandwidth
~385 IOPS
~0.25s Latency

rand mode:
~3500 MB/sec
~920 IOPS
~0.154s Latency



Maybe someone could judge our numbers. I am actually very satisfied with the 
values.

The (mostly idle) cluster is build from these components:
* 10GB frontend network, bonding two connections to mon-, mds- and osd-nodes
** no bonding to clients
* 25GB backend network, bonding two connections to osd-nodes


cluster:
* 3x mon, 2x Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 64GB RAM
* 3x mds, 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz, 128MB RAM
* 7x OSD-nodes, 2x Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz, 96GB RAM
** 4x 6TB SAS HDD HGST HUS726T6TAL5204 (5x on two nodes, max. 6x per chassis 
for later growth)
** 2x 800GB SAS SSD WDC WUSTM3280ASS200 => SW-RAID1 => LVM ~116 GiB per OSD for 
DB and WAL

erasure encoded pool: (made for CephFS)
* plugin=clay k=5 m=2 d=6 crush-failure-domain=host

Thanks and best regards
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com