Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-11 Thread David Clarke
On 9/03/19 10:07 PM, Victor Hooi wrote:
> Hi,
> 
> I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
> based around Intel Optane 900P drives (which are meant to be the bee's
> knees), and I'm seeing pretty low IOPS/bandwidth.

We found that CPU performance, specifically power state, settings played
a large part in latency, and therefore IOPS.  This wasn't too evident
with spinning disks, but makes a large percentage difference in our NVMe
based clusters.

You may want to investigate setting processor.max_cstate=1 or
intel_idle.max_state=1, whichever is appropriate for your CPUs and
kernel, in the boot cmdline.



-- 
David Clarke
Systems Architect
Catalyst IT



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-11 Thread Vitaliy Filippov
These options aren't needed, numjobs is 1 by default and RBD has no "sync"  
concept at all. Operations are always "sync" by default.


In fact even --direct=1 may be redundant because there's no page cache  
involved. However I keep it just in case - there is the RBD cache, what if  
one day fio gets it enabled? :)



how about adding:  --sync=1 --numjobs=1  to the command as well?


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-11 Thread solarflow99
how about adding:  --sync=1 --numjobs=1  to the command as well?



On Sat, Mar 9, 2019 at 12:09 PM Vitaliy Filippov  wrote:

> There are 2:
>
> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite
> -pool=bench -rbdname=testimg
>
> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite
> -pool=bench -rbdname=testimg
>
> The first measures your min possible latency - it does not scale with the
> number of OSDs at all, but it's usually what real applications like
> DBMSes
> need.
>
> The second measures your max possible random write throughput which you
> probably won't be able to utilize if you don't have enough VMs all
> writing
> in parallel.
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Виталий Филиппов
Is it a question to me or Victor? :-)

I did test my drives, intel nvmes are capable of something like 95100 single 
thread iops.

10 марта 2019 г. 1:31:15 GMT+03:00, Martin Verges  
пишет:
>Hello,
>
>did you test the performance of your individual drives?
>
>Here is a small snippet:
>-
>DRIVE=/dev/XXX
>smartctl --a $DRIVE
>for i in 1 2 4 8 16; do echo "Test $i"; fio --filename=$DRIVE
>--direct=1
>--sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60
>--time_based --group_reporting --name=journal-test; done
>-
>
>Please share the results that we know what's possible with your
>hardware.
>
>--
>Martin Verges
>Managing director
>
>Mobile: +49 174 9335695
>E-Mail: martin.ver...@croit.io
>Chat: https://t.me/MartinVerges
>
>croit GmbH, Freseniusstr. 31h, 81247 Munich
>CEO: Martin Verges - VAT-ID: DE310638492
>Com. register: Amtsgericht Munich HRB 231263
>
>Web: https://croit.io
>YouTube: https://goo.gl/PGE1Bx
>
>Vitaliy Filippov  schrieb am Sa., 9. März 2019,
>21:09:
>
>> There are 2:
>>
>> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1
>-rw=randwrite
>> -pool=bench -rbdname=testimg
>>
>> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128
>-rw=randwrite
>> -pool=bench -rbdname=testimg
>>
>> The first measures your min possible latency - it does not scale with
>the
>> number of OSDs at all, but it's usually what real applications like
>> DBMSes
>> need.
>>
>> The second measures your max possible random write throughput which
>you
>> probably won't be able to utilize if you don't have enough VMs all
>> writing
>> in parallel.
>>
>> --
>> With best regards,
>>Vitaliy Filippov
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
With best regards,
  Vitaliy Filippov___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Martin Verges
Hello,

did you test the performance of your individual drives?

Here is a small snippet:
-
DRIVE=/dev/XXX
smartctl --a $DRIVE
for i in 1 2 4 8 16; do echo "Test $i"; fio --filename=$DRIVE --direct=1
--sync=1 --rw=write --bs=4k --numjobs=$i --iodepth=1 --runtime=60
--time_based --group_reporting --name=journal-test; done
-

Please share the results that we know what's possible with your hardware.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx

Vitaliy Filippov  schrieb am Sa., 9. März 2019, 21:09:

> There are 2:
>
> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite
> -pool=bench -rbdname=testimg
>
> fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite
> -pool=bench -rbdname=testimg
>
> The first measures your min possible latency - it does not scale with the
> number of OSDs at all, but it's usually what real applications like
> DBMSes
> need.
>
> The second measures your max possible random write throughput which you
> probably won't be able to utilize if you don't have enough VMs all
> writing
> in parallel.
>
> --
> With best regards,
>Vitaliy Filippov
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Vitaliy Filippov

There are 2:

fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite  
-pool=bench -rbdname=testimg


fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite  
-pool=bench -rbdname=testimg


The first measures your min possible latency - it does not scale with the  
number of OSDs at all, but it's usually what real applications like DBMSes  
need.


The second measures your max possible random write throughput which you  
probably won't be able to utilize if you don't have enough VMs all writing  
in parallel.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Victor Hooi
Hi,

I have retested with 4K blocks - results are below.

I am currently using 4 OSDs per Optane 900P drive. This was based on some
posts I found on Proxmox Forums, and what seems to be "tribal knowledge"
there.

I also saw this presentation
,
which mentions on page 14:

2-4 OSDs/NVMe SSD and 4-6 NVMe SSDs per node are sweet spots


Has anybody done much testing with pure Optane drives for Ceph? (Paper
above seems to use them mixed with traditional SSDs).

Would increasing the number of OSDs help in this scenario? I am happy to
try that - I assume I will need to blow away all the existing OSDs/Ceph
setup and start again, of course.

Here are the rados bench results with 4K - the write IOPS are still a tad
short of 15,000 - is that what I should be aiming for?

Write result:

# rados bench -p proxmox_vms 60 write -b 4K -t 16 --no-cleanup
Total time run: 60.001016
Total writes made:  726749
Write size: 4096
Object size:4096
Bandwidth (MB/sec): 47.3136
Stddev Bandwidth:   2.16408
Max bandwidth (MB/sec): 48.7344
Min bandwidth (MB/sec): 38.5078
Average IOPS:   12112
Stddev IOPS:554
Max IOPS:   12476
Min IOPS:   9858
Average Latency(s): 0.00132019
Stddev Latency(s):  0.000670617
Max latency(s): 0.065541
Min latency(s): 0.000689406


Sequential read result:

# rados bench -p proxmox_vms  60 seq -t 16
Total time run:   17.098593
Total reads made: 726749
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   166.029
Average IOPS: 42503
Stddev IOPS:  218
Max IOPS: 42978
Min IOPS: 42192
Average Latency(s):   0.000369021
Max latency(s):   0.00543175
Min latency(s):   0.000170024


Random read result:

# rados bench -p proxmox_vms 60 rand -t 16
Total time run:   60.000282
Total reads made: 2708799
Read size:4096
Object size:  4096
Bandwidth (MB/sec):   176.353
Average IOPS: 45146
Stddev IOPS:  310
Max IOPS: 45754
Min IOPS: 44506
Average Latency(s):   0.000347637
Max latency(s):   0.00457886
Min latency(s):   0.000138381


I am happy to try with fio -ioengine =rbd (the reason I use rados bench is
because that is what was used in the Proxmox Ceph benchmark paper
)
however, is there a common community-suggested starting command line that
makes it easy to compare results? (fio seems quite complex in terms of
options).

Thanks,
Victor

On Sun, Mar 10, 2019 at 6:15 AM Vitaliy Filippov  wrote:

> Welcome to our "slow ceph" party :)))
>
> However I have to note that:
>
> 1) 50 iops is for 4 KB blocks. You're testing it with 4 MB ones.
> That's kind of unfair comparison.
>
> 2) fio -ioengine=rbd is better than rados bench for testing.
>
> 3) You can't "compensate" for Ceph's overhead even by having infinitely
> fast disks.
>
> At its simplest, imagine that disk I/O takes X microseconds and Ceph's
> overhead is Y for a single operation.
>
> Suppose there is no parallelism. Then raw disk IOPS = 100/X and Ceph
> IOPS = 100/(X+Y). Y is currently quite long, something around 400-800
> microseconds or so. So the best IOPS number you can squeeze out of a
> single client thread (a DBMS, for example) is 100/400 = only ~2500
> iops.
>
> Parallel iops are of course better, but still you won't get anything
> close
> to 50 iops from a single OSD. The expected number is around 15000.
> Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you
> want better results.
>
> --
> With best regards,
>Vitaliy Filippov
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Vitaliy Filippov

Welcome to our "slow ceph" party :)))

However I have to note that:

1) 50 iops is for 4 KB blocks. You're testing it with 4 MB ones.  
That's kind of unfair comparison.


2) fio -ioengine=rbd is better than rados bench for testing.

3) You can't "compensate" for Ceph's overhead even by having infinitely  
fast disks.


At its simplest, imagine that disk I/O takes X microseconds and Ceph's  
overhead is Y for a single operation.


Suppose there is no parallelism. Then raw disk IOPS = 100/X and Ceph  
IOPS = 100/(X+Y). Y is currently quite long, something around 400-800  
microseconds or so. So the best IOPS number you can squeeze out of a  
single client thread (a DBMS, for example) is 100/400 = only ~2500  
iops.


Parallel iops are of course better, but still you won't get anything close  
to 50 iops from a single OSD. The expected number is around 15000.  
Create multiple OSDs on a single NVMe and sacrifice your CPU usage if you  
want better results.


--
With best regards,
  Vitaliy Filippov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Victor Hooi
Hi Ahsley,

Right - so the 50% bandwidth is OK, I guess, but it was more the drop in
IOPS that was concerning (hence the subject line about 200 IOPS) *sad face*.

That, and the Optane drives weren't exactly cheap, and I was hoping they
would compensate for the overhead of Ceph.

At random read, each Optane drive is capable of 55 IOPS (random read)
and 50 IOPS (random write). Yet we're seeing it drop to around 0.04% of
that in testing (200 IOPS). Is that sort of drop in IOPS normal for Ceph?

Each node can take up to 8 x 2.5" drives. If I loaded up say 4 cheap SSDs
in each (e.g. Intel S3700 SSD), instead of one Optane drive per node, would
that have better performance with 4 x 3 = 12 drives? (Would I still put 4
OSDs per physical drive)? Or some way to supplement the Optane's with SSDs?
(Although I would assume any SSD I get is going to be slower than an Optane
drive).

Or are there tweaks I can do to either configuration, or our layout that
could eke out more IOPS?

(This is going to be used for VM hosting, so IOPS is definitely a concern).

Thanks,
Victor

On Sat, Mar 9, 2019 at 9:27 PM Ashley Merrick 
wrote:

> What kind of results are you expecting?
>
> Looking at the specs they are "up to" 2000 Write, and 2500 Read, so your
> around 50-60% of the max up to speed, which I wouldn't say is to bad due to
> the fact CEPH / Bluestore has an overhead specially when using a single
> disk for DB & WAL & Content.
>
> Remember CEPH scales with the amount of physical disks you have, as you
> only have 3 disks every piece of I/O is hitting all 3 disks, if you had 6
> disks for example and still did replication of 3 then only 50% of I/O would
> be hitting each disks, therefore id expect to see performance jump.
>
> On Sat, Mar 9, 2019 at 5:08 PM Victor Hooi  wrote:
>
>> Hi,
>>
>> I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
>> based around Intel Optane 900P drives (which are meant to be the bee's
>> knees), and I'm seeing pretty low IOPS/bandwidth.
>>
>>- 3 nodes, each running a Ceph monitor daemon, and OSDs.
>>- Node 1 has 48 GB of RAM and 10 cores (Intel 4114
>>
>> ),
>>and Node 2 and 3 have 32 GB of RAM and 4 cores (Intel E3-1230V6
>>
>> 
>>)
>>- Each node has a Intel Optane 900p (480GB) NVMe
>>
>> 
>>  dedicated
>>for Ceph.
>>- 4 OSDs per node (total of 12 OSDs)
>>- NICs are Intel X520-DA2
>>
>> ,
>>with 10GBASE-LR going to a Unifi US-XG-16
>>.
>>- First 10GB port is for Proxmox VM traffic, second 10GB port is for
>>Ceph traffic.
>>
>> I created a new Ceph pool specifically for benchmarking with 128 PGs.
>>
>> Write results:
>>
>> root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16
>> --no-cleanup
>> 
>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
>> lat(s)
>>60  16 12258 12242   816.055   788   0.0856726
>> 0.0783458
>> Total time run: 60.069008
>> Total writes made:  12258
>> Write size: 4194304
>> Object size:4194304
>> Bandwidth (MB/sec): 816.261
>> Stddev Bandwidth:   17.4584
>> Max bandwidth (MB/sec): 856
>> Min bandwidth (MB/sec): 780
>> Average IOPS:   204
>> Stddev IOPS:4
>> Max IOPS:   214
>> Min IOPS:   195
>> Average Latency(s): 0.0783801
>> Stddev Latency(s):  0.0468404
>> Max latency(s): 0.437235
>> Min latency(s): 0.0177178
>>
>>
>> Sequential read results - I don't know why this only ran for 32 seconds?
>>
>> root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16
>> 
>> Total time run:   32.608549
>> Total reads made: 12258
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   1503.65
>> Average IOPS: 375
>> Stddev IOPS:  22
>> Max IOPS: 410
>> Min IOPS: 326
>> Average Latency(s):   0.0412777
>> Max latency(s):   0.498116
>> Min latency(s):   0.00447062
>>
>>
>> Random read result:
>>
>> root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16
>> 
>> Total time run:   60.066384
>> Total reads made: 22819
>> Read size:4194304
>> Object size:  4194304
>> Bandwidth (MB/sec):   1519.59
>> Average IOPS: 379
>> Stddev IOPS:  21
>> Max IOPS: 424
>> Min IOPS: 320
>> Average Latency(s):   0.0408697
>> Max 

Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Konstantin Shalygin

These results (800 MB/s writes, 1500 Mb/s reads, and 200 write IOPS, 400
read IOPS) seems incredibly low - particularly considering what the Optane
900p is meant to be capable of.

Is this in line with what you might expect on this hardware with Ceph
though?

Or is there some way to find out the source of bottleneck?


4Mbyte*200IOPS=800MB/s. What exactly bottleneck you meant?

Try to use 4K instead 4M for IOPS load.


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3-node cluster with 3 x Intel Optane 900P - very low benchmarked performance (200 IOPS)?

2019-03-09 Thread Ashley Merrick
What kind of results are you expecting?

Looking at the specs they are "up to" 2000 Write, and 2500 Read, so your
around 50-60% of the max up to speed, which I wouldn't say is to bad due to
the fact CEPH / Bluestore has an overhead specially when using a single
disk for DB & WAL & Content.

Remember CEPH scales with the amount of physical disks you have, as you
only have 3 disks every piece of I/O is hitting all 3 disks, if you had 6
disks for example and still did replication of 3 then only 50% of I/O would
be hitting each disks, therefore id expect to see performance jump.

On Sat, Mar 9, 2019 at 5:08 PM Victor Hooi  wrote:

> Hi,
>
> I'm setting up a 3-node Proxmox cluster with Ceph as the shared storage,
> based around Intel Optane 900P drives (which are meant to be the bee's
> knees), and I'm seeing pretty low IOPS/bandwidth.
>
>- 3 nodes, each running a Ceph monitor daemon, and OSDs.
>- Node 1 has 48 GB of RAM and 10 cores (Intel 4114
>
> ),
>and Node 2 and 3 have 32 GB of RAM and 4 cores (Intel E3-1230V6
>
> 
>)
>- Each node has a Intel Optane 900p (480GB) NVMe
>
> 
>  dedicated
>for Ceph.
>- 4 OSDs per node (total of 12 OSDs)
>- NICs are Intel X520-DA2
>
> ,
>with 10GBASE-LR going to a Unifi US-XG-16
>.
>- First 10GB port is for Proxmox VM traffic, second 10GB port is for
>Ceph traffic.
>
> I created a new Ceph pool specifically for benchmarking with 128 PGs.
>
> Write results:
>
> root@vwnode1:~# rados bench -p benchmarking 60 write -b 4M -t 16
> --no-cleanup
> 
>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg
> lat(s)
>60  16 12258 12242   816.055   788   0.0856726
> 0.0783458
> Total time run: 60.069008
> Total writes made:  12258
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 816.261
> Stddev Bandwidth:   17.4584
> Max bandwidth (MB/sec): 856
> Min bandwidth (MB/sec): 780
> Average IOPS:   204
> Stddev IOPS:4
> Max IOPS:   214
> Min IOPS:   195
> Average Latency(s): 0.0783801
> Stddev Latency(s):  0.0468404
> Max latency(s): 0.437235
> Min latency(s): 0.0177178
>
>
> Sequential read results - I don't know why this only ran for 32 seconds?
>
> root@vwnode1:~# rados bench -p benchmarking 60 seq -t 16
> 
> Total time run:   32.608549
> Total reads made: 12258
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1503.65
> Average IOPS: 375
> Stddev IOPS:  22
> Max IOPS: 410
> Min IOPS: 326
> Average Latency(s):   0.0412777
> Max latency(s):   0.498116
> Min latency(s):   0.00447062
>
>
> Random read result:
>
> root@vwnode1:~# rados bench -p benchmarking 60 rand -t 16
> 
> Total time run:   60.066384
> Total reads made: 22819
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1519.59
> Average IOPS: 379
> Stddev IOPS:  21
> Max IOPS: 424
> Min IOPS: 320
> Average Latency(s):   0.0408697
> Max latency(s):   0.662955
> Min latency(s):   0.00172077
>
>
> I then cleaned-up with:
>
> root@vwnode1:~# rados -p benchmarking cleanup
> Removed 12258 objects
>
>
> I then tested with another Ceph pool, with 512 PGs (originally created for
> Proxmox VMs) - results seem quite similar:
>
> root@vwnode1:~# rados bench -p proxmox_vms 60 write -b 4M -t 16
> --no-cleanup
> 
> Total time run: 60.041712
> Total writes made:  12132
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 808.238
> Stddev Bandwidth:   20.7444
> Max bandwidth (MB/sec): 860
> Min bandwidth (MB/sec): 744
> Average IOPS:   202
> Stddev IOPS:5
> Max IOPS:   215
> Min IOPS:   186
> Average Latency(s): 0.0791746
> Stddev Latency(s):  0.0432707
> Max latency(s): 0.42535
> Min latency(s): 0.0200791
>
>
> Sequential read result - once again, only ran for 32 seconds:
>
> root@vwnode1:~# rados bench -p proxmox_vms 60 seq -t 16
> 
> Total time run:   31.249274
> Total reads made: 12132
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   1552.93
> Average IOPS: 388
> Stddev IOPS:  30
>