Re: [ceph-users] BAD nvme SSD performance

2015-10-27 Thread Christian Balzer

Hello,

On Tue, 27 Oct 2015 11:37:42 + Matteo Dacrema wrote:

> Hi,
> 
> thanks for all the replies.
> 
> I've found the issue: 
> The Samsung nvme SSD has poor performance with sync=1. It reach only 4/5
> k iops with randwrite ops.
> 
> Using Intel DC S3700 SSDs I'm able to saturate the CPU.
> 
That's what I thought, also keep that CPU saturation in mind for any
further test subjects.

> I'm using hammer v 0.94.5 on Ubuntu 14.04 and 3.19.0-31 kernel
> 
> What do you think about Intel 750 series :
> http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-750-series.html
> 
> I plan to use it for cache layer ( one for host - is it a problem? )
> Behind the cache layer I plan to use Mechanical HDD with Journal on SSD
> drives.
> 

That SSD has been mentioned in this ML pretty much the day it was
released. 
It certainly looks fast, but so did your Samsung one.

Whether or not it is actually fast with sync writes I can't tell and
probably nobody actually tested or deployed it with Ceph.

Why?

Because at 70GB endurance per day, it is very hard to come up with a use
case where this kind of performance would be required but also fit into
such a small volume of data.

This is particularly the case in a cache tier situation, because unless
your working set (hot data) fits into your cache tier, Ceph will
constantly promote/evict objects to and from it.
Meaning that frequently READS will actually result in WRITES to the cache
tier before they are served to the client.

With a purely SSD based Ceph pool I would take my estimated writes per day
and multiply them by 5-10, if that still is in the endurance envelope of
the SSD, fine. 
But with a cache tier that becomes even higher and much more unpredictable.

Regards,

Christian

> What do you think about it?
> 
> Thanks
> Regards,
> Matteo
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Somnath Roy Sent: lunedì 26 ottobre 2015 17:45
> To: Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] BAD nvme SSD performance
> 
> Another point,
> As Christian mentioned, try to evaluate O_DIRECT|O_DSYNC performance of
> a SSD before choosing that for Ceph.. Try to run with direct=1 and sync
> =1 with fio to a raw ssd drive..
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Somnath Roy Sent: Monday, October 26, 2015 9:20 AM
> To: Christian Balzer; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] BAD nvme SSD performance
> 
> One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't
> mean you are saturating SSDs there..I have seen a large performance
> delta even if iostat is reporting 100% disk util in both the cases.
> Also, the ceph.conf file you are using is not optimal..Try to add these..
> 
> debug_lockdep = 0/0
> debug_context = 0/0
> debug_crush = 0/0
> debug_buffer = 0/0
> debug_timer = 0/0
> debug_filer = 0/0
> debug_objecter = 0/0
> debug_rados = 0/0
> debug_rbd = 0/0
> debug_journaler = 0/0
> debug_objectcatcher = 0/0
> debug_client = 0/0
> debug_osd = 0/0
> debug_optracker = 0/0
> debug_objclass = 0/0
> debug_filestore = 0/0
> debug_journal = 0/0
> debug_ms = 0/0
> debug_monc = 0/0
> debug_tp = 0/0
> debug_auth = 0/0
> debug_finisher = 0/0
> debug_heartbeatmap = 0/0
> debug_perfcounter = 0/0
> debug_asok = 0/0
> debug_throttle = 0/0
> debug_mon = 0/0
> debug_paxos = 0/0
> debug_rgw = 0/0
> 
> You didn't mention anything about your cpu, considering you have
> powerful cpu complex for SSDs tweak this to high number of shards..It
> also depends on number of OSDs per box..
> 
> osd_op_num_threads_per_shard
> osd_op_num_shards
> 
> 
> Don't need to change the following..
> 
> osd_disk_threads
> osd_op_threads
> 
> 
> Instead, try increasing..
> 
> filestore_op_threads
> 
> Use the following in the global section..
> 
> ms_dispatch_throttle_bytes = 0
> throttler_perf_counter = false
> 
> Change the following..
> filestore_max_sync_interval = 1   (or even lower, need to lower
> filestore_min_sync_interval as well)
> 
> 
> I am assuming you are using hammer and newer..
> 
> Thanks & Regards
> Somnath
> 
> Try increasing the following to very big numbers..
> 
> > > filestore_queue_max_ops = 2000
> > >
> > > filestore_queue_max_bytes = 536870912
> > >
> > > filestore_queue_committing_max_ops = 500
> > >
> > > filestore_queue_committing_max_bytes = 268435456
> 
> Use the following..
> 
> osd_enable_op_tracker = false
&

Re: [ceph-users] BAD nvme SSD performance

2015-10-27 Thread Matteo Dacrema
Hi,

thanks for all the replies.

I've found the issue: 
The Samsung nvme SSD has poor performance with sync=1. It reach only 4/5 k iops 
with randwrite ops.

Using Intel DC S3700 SSDs I'm able to saturate the CPU.

I'm using hammer v 0.94.5 on Ubuntu 14.04 and 3.19.0-31 kernel

What do you think about Intel 750 series : 
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-750-series.html

I plan to use it for cache layer ( one for host - is it a problem? )
Behind the cache layer I plan to use Mechanical HDD with Journal on SSD drives.

What do you think about it?

Thanks
Regards,
Matteo

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: lunedì 26 ottobre 2015 17:45
To: Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

Another point,
As Christian mentioned, try to evaluate O_DIRECT|O_DSYNC performance of a SSD 
before choosing that for Ceph..
Try to run with direct=1 and sync =1 with fio to a raw ssd drive..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Monday, October 26, 2015 9:20 AM
To: Christian Balzer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean 
you are saturating SSDs there..I have seen a large performance delta even if 
iostat is reporting 100% disk util in both the cases.
Also, the ceph.conf file you are using is not optimal..Try to add these..

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

You didn't mention anything about your cpu, considering you have powerful cpu 
complex for SSDs tweak this to high number of shards..It also depends on number 
of OSDs per box..

osd_op_num_threads_per_shard
osd_op_num_shards


Don't need to change the following..

osd_disk_threads
osd_op_threads


Instead, try increasing..

filestore_op_threads

Use the following in the global section..

ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false

Change the following..
filestore_max_sync_interval = 1   (or even lower, need to lower 
filestore_min_sync_interval as well)


I am assuming you are using hammer and newer..

Thanks & Regards
Somnath

Try increasing the following to very big numbers..

> > filestore_queue_max_ops = 2000
> >
> > filestore_queue_max_bytes = 536870912
> >
> > filestore_queue_committing_max_ops = 500
> >
> > filestore_queue_committing_max_bytes = 268435456

Use the following..

osd_enable_op_tracker = false


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Monday, October 26, 2015 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance


Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:

>
>
> On 26-10-15 14:29, Matteo Dacrema wrote:
> > Hi Nick,
> >
> >
> >
> > I also tried to increase iodepth but nothing has changed.
> >
> >
> >
> > With iostat I noticed that the disk is fully utilized and write per 
> > seconds from iostat match fio output.
> >
>
> Ceph isn't fully optimized to get the maximum potential out of NVME 
> SSDs yet.
>
Indeed. Don't expect Ceph to be near raw SSD performance.

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck.

My guess would be these particular NVMe SSDs might just suffer from the same 
direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, 
not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian

> For example, NVM-E SSDs work best with very high queue depths and 
> parallel IOps.
>
> Also, be aware that Ceph add multiple layers to the whole I/O 
> subsystem and that there will be a performance impact when Ceph is used in 
> between.
>
> Wido
>
> >
> >
> > Matteo
> >
> >
> >
> > *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> > *Sent:* lunedì 26 ottobre 2015 

Re: [ceph-users] BAD nvme SSD performance

2015-10-27 Thread Mark Nelson

On 10/27/2015 06:37 AM, Matteo Dacrema wrote:

Hi,

thanks for all the replies.

I've found the issue:
The Samsung nvme SSD has poor performance with sync=1. It reach only 4/5 k iops 
with randwrite ops.

Using Intel DC S3700 SSDs I'm able to saturate the CPU.

I'm using hammer v 0.94.5 on Ubuntu 14.04 and 3.19.0-31 kernel

What do you think about Intel 750 series : 
http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-750-series.html


I briefly considered 750s for some test boxes that wouldn't actually 
hold real data, but even in that situation the write endurance is pretty 
scary looking.  I imagine they are probably going to do better than the 
very low rating Intel give them (they sure look a lot like rebadged 
P3500s), but in the end I ended up going with P3700s.  The rated writ 
endurance is just so much higher than it was worth the extra price (to 
us at least).




I plan to use it for cache layer ( one for host - is it a problem? )
Behind the cache layer I plan to use Mechanical HDD with Journal on SSD drives.

What do you think about it?

Thanks
Regards,
Matteo

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: lunedì 26 ottobre 2015 17:45
To: Christian Balzer <ch...@gol.com>; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

Another point,
As Christian mentioned, try to evaluate O_DIRECT|O_DSYNC performance of a SSD 
before choosing that for Ceph..
Try to run with direct=1 and sync =1 with fio to a raw ssd drive..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Monday, October 26, 2015 9:20 AM
To: Christian Balzer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean 
you are saturating SSDs there..I have seen a large performance delta even if 
iostat is reporting 100% disk util in both the cases.
Also, the ceph.conf file you are using is not optimal..Try to add these..

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

You didn't mention anything about your cpu, considering you have powerful cpu 
complex for SSDs tweak this to high number of shards..It also depends on number 
of OSDs per box..

osd_op_num_threads_per_shard
osd_op_num_shards


Don't need to change the following..

osd_disk_threads
osd_op_threads


Instead, try increasing..

filestore_op_threads

Use the following in the global section..

ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false

Change the following..
filestore_max_sync_interval = 1   (or even lower, need to lower 
filestore_min_sync_interval as well)


I am assuming you are using hammer and newer..

Thanks & Regards
Somnath

Try increasing the following to very big numbers..


filestore_queue_max_ops = 2000

filestore_queue_max_bytes = 536870912

filestore_queue_committing_max_ops = 500

filestore_queue_committing_max_bytes = 268435456


Use the following..

osd_enable_op_tracker = false


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Monday, October 26, 2015 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance


Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:




On 26-10-15 14:29, Matteo Dacrema wrote:

Hi Nick,



I also tried to increase iodepth but nothing has changed.



With iostat I noticed that the disk is fully utilized and write per
seconds from iostat match fio output.



Ceph isn't fully optimized to get the maximum potential out of NVME
SSDs yet.


Indeed. Don't expect Ceph to be near raw SSD performance.

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck.

My guess would be these particular NVMe SSDs might just suffer from the same 
direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, 
not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian


For example, NVM-E SSDs work best with very high queue depths and
parallel IOps.

Also, be aware that Ceph add multiple layers to the whole

Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Nick Fisk
Hi Matteo,

 

Ceph introduces latency into the write path and so what you are seeing is
typical. If you increase the iodepth of the fio test you should get higher
results though, until you start maxing out your CPU.

 

Nick

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
Matteo Dacrema
Sent: 26 October 2015 11:20
To: ceph-us...@ceph.com
Subject: [ceph-users] BAD nvme SSD performance

 

Hi all,

 

I've recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a 2 OSD
ceph cluster with min_size = 1.

I've tested them with fio ad I obtained two very different results with
these two situations with fio.

This is the command : fio  --ioengine=libaio --direct=1  --name=test
--filename=test --bs=4k  --size=100M --readwrite=randwrite  --numjobs=200
--group_reporting

 

On the OSD host I've obtained this result:

bw=575493KB/s, iops=143873

 

On the client host with a mounted volume I've obtained this result:

 

Fio executed on the client osd with a mounted volume:

bw=9288.1KB/s, iops=2322

 

I've obtained this results with Journal and data on the same disk and also
with Journal on separate SSD.

 

I've two OSD host with 64GB of RAM and 2x Intel Xeon E5-2620 @ 2.00GHz and
one MON host with 128GB of RAM and 2x Intel Xeon E5-2620 @ 2.00 GHz.

I'm using 10G mellanox NIC and Switch with jumbo frames.

 

I also did other test with this configuration ( see attached Excel workbook
)

Hardware configuration for each of the two OSD nodes:

3x  100GB Intel SSD DC S3700 with 3 * 30 GB partition for
every SSD 

9x  1TB Seagate HDD

Results: about 12k IOPS with 4k bs and same fio test.

 

I can't understand where is the problem with nvme SSDs.

Anyone can help me? 

 

Here the ceph.conf:

[global]

fsid = 3392a053-7b48-49d3-8fc9-50f245513cc7

mon_initial_members = mon1

mon_host = 192.168.1.3

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

osd_pool_default_size = 2

mon_client_hung_interval = 1.0

mon_client_ping_interval = 5.0

public_network = 192.168.1.0/24

cluster_network = 192.168.1.0/24

mon_osd_full_ratio = .90

mon_osd_nearfull_ratio = .85

 

[mon]

mon_warn_on_legacy_crush_tunables = false

 

[mon.1]

host = mon1

mon_addr = 192.168.1.3:6789

 

[osd]

osd_journal_size = 3

journal_dio = true

journal_aio = true

osd_op_threads = 24

osd_op_thread_timeout = 60

osd_disk_threads = 8

osd_recovery_threads = 2

osd_recovery_max_active = 1

osd_max_backfills = 2

osd_mkfs_type = xfs

osd_mkfs_options_xfs = "-f -i size=2048"

osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"

filestore_xattr_use_omap = false

filestore_max_inline_xattr_size = 512

filestore_max_sync_interval = 10

filestore_merge_threshold = 40

filestore_split_multiple = 8

filestore_flusher = false

filestore_queue_max_ops = 2000

filestore_queue_max_bytes = 536870912

filestore_queue_committing_max_ops = 500

filestore_queue_committing_max_bytes = 268435456

filestore_op_threads = 2

 

Best regards,

Matteo

 




 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Wido den Hollander


On 26-10-15 14:29, Matteo Dacrema wrote:
> Hi Nick,
> 
>  
> 
> I also tried to increase iodepth but nothing has changed.
> 
>  
> 
> With iostat I noticed that the disk is fully utilized and write per
> seconds from iostat match fio output.
> 

Ceph isn't fully optimized to get the maximum potential out of NVME SSDs
yet.

For example, NVM-E SSDs work best with very high queue depths and
parallel IOps.

Also, be aware that Ceph add multiple layers to the whole I/O subsystem
and that there will be a performance impact when Ceph is used in between.

Wido

>  
> 
> Matteo
> 
>  
> 
> *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> *Sent:* lunedì 26 ottobre 2015 13:06
> *To:* Matteo Dacrema ; ceph-us...@ceph.com
> *Subject:* RE: BAD nvme SSD performance
> 
>  
> 
> Hi Matteo,
> 
>  
> 
> Ceph introduces latency into the write path and so what you are seeing
> is typical. If you increase the iodepth of the fio test you should get
> higher results though, until you start maxing out your CPU.
> 
>  
> 
> Nick
> 
>  
> 
> *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *Matteo Dacrema
> *Sent:* 26 October 2015 11:20
> *To:* ceph-us...@ceph.com 
> *Subject:* [ceph-users] BAD nvme SSD performance
> 
>  
> 
> Hi all,
> 
>  
> 
> I’ve recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a 2
> OSD ceph cluster with min_size = 1.
> 
> I’ve tested them with fio ad I obtained two very different results with
> these two situations with fio.
> 
> This is the command : *fio  --ioengine=libaio --direct=1  --name=test
> --filename=test --bs=4k  --size=100M --readwrite=randwrite 
> --numjobs=200  --group_reporting*
> 
>  
> 
> On the OSD host I’ve obtained this result:
> 
> *bw=575493KB/s, iops=143873*
> 
> * *
> 
> On the client host with a mounted volume I’ve obtained this result:
> 
>  
> 
> Fio executed on the client osd with a mounted volume:
> 
> *bw=9288.1KB/s, iops=2322*
> 
> * *
> 
> I’ve obtained this results with Journal and data on the same disk and
> also with Journal on separate SSD.
> 
> * *
> 
> I’ve two OSD host with 64GB of RAM and 2x Intel Xeon E5-2620 @ 2.00GHz
> and one MON host with 128GB of RAM and 2x Intel Xeon E5-2620 @ 2.00 GHz.
> 
> I’m using 10G mellanox NIC and Switch with jumbo frames.
> 
>  
> 
> I also did other test with this configuration ( see attached Excel
> workbook )
> 
> Hardware configuration for each of the two OSD nodes:
> 
> 3x  100GB Intel SSD DC S3700 with 3 * 30 GB partition
> for every SSD
> 
> 9x  1TB Seagate HDD
> 
> Results: about *12k* IOPS with 4k bs and same fio test.
> 
>  
> 
> I can’t understand where is the problem with nvme SSDs.
> 
> Anyone can help me?
> 
>  
> 
> Here the *ceph.conf:*
> 
> [global]
> 
> fsid = 3392a053-7b48-49d3-8fc9-50f245513cc7
> 
> mon_initial_members = mon1
> 
> mon_host = 192.168.1.3
> 
> auth_cluster_required = cephx
> 
> auth_service_required = cephx
> 
> auth_client_required = cephx
> 
> osd_pool_default_size = 2
> 
> mon_client_hung_interval = 1.0
> 
> mon_client_ping_interval = 5.0
> 
> public_network = 192.168.1.0/24
> 
> cluster_network = 192.168.1.0/24
> 
> mon_osd_full_ratio = .90
> 
> mon_osd_nearfull_ratio = .85
> 
>  
> 
> [mon]
> 
> mon_warn_on_legacy_crush_tunables = false
> 
>  
> 
> [mon.1]
> 
> host = mon1
> 
> mon_addr = 192.168.1.3:6789
> 
>  
> 
> [osd]
> 
> osd_journal_size = 3
> 
> journal_dio = true
> 
> journal_aio = true
> 
> osd_op_threads = 24
> 
> osd_op_thread_timeout = 60
> 
> osd_disk_threads = 8
> 
> osd_recovery_threads = 2
> 
> osd_recovery_max_active = 1
> 
> osd_max_backfills = 2
> 
> osd_mkfs_type = xfs
> 
> osd_mkfs_options_xfs = "-f -i size=2048"
> 
> osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"
> 
> filestore_xattr_use_omap = false
> 
> filestore_max_inline_xattr_size = 512
> 
> filestore_max_sync_interval = 10
> 
> filestore_merge_threshold = 40
> 
> filestore_split_multiple = 8
> 
> filestore_flusher = false
> 
> filestore_queue_max_ops = 2000
> 
> filestore_queue_max_bytes = 536870912
> 
> filestore_queue_committing_max_ops = 500
> 
> filestore_queue_committing_max_bytes = 268435456
> 
> filestore_op_threads = 2
> 
>  
> 
> Best regards,
> 
> Matteo
> 
>  
> 
> 
> Web Bug from http://xo4t.mj.am/o/xo4t/f8b6cd3d/qoi1l59e.gif
> -- 
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non
> infetto.
> Clicca qui per segnalarlo come spam.
> 
> Clicca qui per metterlo in blacklist
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Christian Balzer

Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:

> 
> 
> On 26-10-15 14:29, Matteo Dacrema wrote:
> > Hi Nick,
> > 
> >  
> > 
> > I also tried to increase iodepth but nothing has changed.
> > 
> >  
> > 
> > With iostat I noticed that the disk is fully utilized and write per
> > seconds from iostat match fio output.
> > 
> 
> Ceph isn't fully optimized to get the maximum potential out of NVME SSDs
> yet.
> 
Indeed. Don't expect Ceph to be near raw SSD performance. 

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck. 

My guess would be these particular NVMe SSDs might just suffer from the
same direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client
SSDs, not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian

> For example, NVM-E SSDs work best with very high queue depths and
> parallel IOps.
> 
> Also, be aware that Ceph add multiple layers to the whole I/O subsystem
> and that there will be a performance impact when Ceph is used in between.
> 
> Wido
> 
> >  
> > 
> > Matteo
> > 
> >  
> > 
> > *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> > *Sent:* lunedì 26 ottobre 2015 13:06
> > *To:* Matteo Dacrema ; ceph-us...@ceph.com
> > *Subject:* RE: BAD nvme SSD performance
> > 
> >  
> > 
> > Hi Matteo,
> > 
> >  
> > 
> > Ceph introduces latency into the write path and so what you are seeing
> > is typical. If you increase the iodepth of the fio test you should get
> > higher results though, until you start maxing out your CPU.
> > 
> >  
> > 
> > Nick
> > 
> >  
> > 
> > *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> > Of *Matteo Dacrema
> > *Sent:* 26 October 2015 11:20
> > *To:* ceph-us...@ceph.com 
> > *Subject:* [ceph-users] BAD nvme SSD performance
> > 
> >  
> > 
> > Hi all,
> > 
> >  
> > 
> > I’ve recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a 2
> > OSD ceph cluster with min_size = 1.
> > 
> > I’ve tested them with fio ad I obtained two very different results with
> > these two situations with fio.
> > 
> > This is the command : *fio  --ioengine=libaio --direct=1  --name=test
> > --filename=test --bs=4k  --size=100M --readwrite=randwrite 
> > --numjobs=200  --group_reporting*
> > 
> >  
> > 
> > On the OSD host I’ve obtained this result:
> > 
> > *bw=575493KB/s, iops=143873*
> > 
> > * *
> > 
> > On the client host with a mounted volume I’ve obtained this result:
> > 
> >  
> > 
> > Fio executed on the client osd with a mounted volume:
> > 
> > *bw=9288.1KB/s, iops=2322*
> > 
> > * *
> > 
> > I’ve obtained this results with Journal and data on the same disk and
> > also with Journal on separate SSD.
> > 
> > * *
> > 
> > I’ve two OSD host with 64GB of RAM and 2x Intel Xeon E5-2620 @ 2.00GHz
> > and one MON host with 128GB of RAM and 2x Intel Xeon E5-2620 @ 2.00
> > GHz.
> > 
> > I’m using 10G mellanox NIC and Switch with jumbo frames.
> > 
> >  
> > 
> > I also did other test with this configuration ( see attached Excel
> > workbook )
> > 
> > Hardware configuration for each of the two OSD nodes:
> > 
> > 3x  100GB Intel SSD DC S3700 with 3 * 30 GB partition
> > for every SSD
> > 
> > 9x  1TB Seagate HDD
> > 
> > Results: about *12k* IOPS with 4k bs and same fio test.
> > 
> >  
> > 
> > I can’t understand where is the problem with nvme SSDs.
> > 
> > Anyone can help me?
> > 
> >  
> > 
> > Here the *ceph.conf:*
> > 
> > [global]
> > 
> > fsid = 3392a053-7b48-49d3-8fc9-50f245513cc7
> > 
> > mon_initial_members = mon1
> > 
> > mon_host = 192.168.1.3
> > 
> > auth_cluster_required = cephx
> > 
> > auth_service_required = cephx
> > 
> > auth_client_required = cephx
> > 
> > osd_pool_default_size = 2
> > 
> > mon_client_hung_interval = 1.0
> > 
> > mon_client_ping_interval = 5.0
> > 
> > public_network = 192.168.1.0/24
> > 
> > cluster_network = 192.168.1.0/24
> > 
> > mon_osd_full_ratio = .90
> > 
> > mon_osd_nearfull_ratio = .85
> > 
> >  
> > 
> > [mon]
> > 
> > mon_warn_on_legacy_crush_tunables = false
> > 
> >  
> > 
> > [mon.1]
> > 
> > host = mon1
> > 
> > mon_addr = 192.168.1.3:6789
> > 
> >  
> > 
> > [osd]
> > 
> > osd_journal_size = 3
> > 
> > journal_dio = true
> > 
> > journal_aio = true
> > 
> > osd_op_threads = 24
> > 
> > osd_op_thread_timeout = 60
> > 
> > osd_disk_threads = 8
> > 
> > osd_recovery_threads = 2
> > 
> > osd_recovery_max_active = 1
> > 
> > osd_max_backfills = 2
> > 
> > osd_mkfs_type = xfs
> > 
> > osd_mkfs_options_xfs = "-f -i size=2048"
> > 
> > osd_mount_options_xfs = "rw,noatime,inode64,logbsize=256k,delaylog"
> > 
> > filestore_xattr_use_omap = false
> > 
> > filestore_max_inline_xattr_size = 512
> > 
> > filestore_max_sync_interval = 10
> 

Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Somnath Roy
One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean 
you are saturating SSDs there..I have seen a large performance delta even if 
iostat is reporting 100% disk util in both the cases.
Also, the ceph.conf file you are using is not optimal..Try to add these..

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

You didn't mention anything about your cpu, considering you have powerful cpu 
complex for SSDs tweak this to high number of shards..It also depends on number 
of OSDs per box..

osd_op_num_threads_per_shard
osd_op_num_shards


Don't need to change the following..

osd_disk_threads
osd_op_threads


Instead, try increasing..

filestore_op_threads

Use the following in the global section..

ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false

Change the following..
filestore_max_sync_interval = 1   (or even lower, need to lower 
filestore_min_sync_interval as well)


I am assuming you are using hammer and newer..

Thanks & Regards
Somnath

Try increasing the following to very big numbers..

> > filestore_queue_max_ops = 2000
> >
> > filestore_queue_max_bytes = 536870912
> >
> > filestore_queue_committing_max_ops = 500
> >
> > filestore_queue_committing_max_bytes = 268435456

Use the following..

osd_enable_op_tracker = false


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Monday, October 26, 2015 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance


Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:

>
>
> On 26-10-15 14:29, Matteo Dacrema wrote:
> > Hi Nick,
> >
> >
> >
> > I also tried to increase iodepth but nothing has changed.
> >
> >
> >
> > With iostat I noticed that the disk is fully utilized and write per
> > seconds from iostat match fio output.
> >
>
> Ceph isn't fully optimized to get the maximum potential out of NVME
> SSDs yet.
>
Indeed. Don't expect Ceph to be near raw SSD performance.

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck.

My guess would be these particular NVMe SSDs might just suffer from the same 
direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, 
not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian

> For example, NVM-E SSDs work best with very high queue depths and
> parallel IOps.
>
> Also, be aware that Ceph add multiple layers to the whole I/O
> subsystem and that there will be a performance impact when Ceph is used in 
> between.
>
> Wido
>
> >
> >
> > Matteo
> >
> >
> >
> > *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> > *Sent:* lunedì 26 ottobre 2015 13:06
> > *To:* Matteo Dacrema <mdacr...@enter.it>; ceph-us...@ceph.com
> > *Subject:* RE: BAD nvme SSD performance
> >
> >
> >
> > Hi Matteo,
> >
> >
> >
> > Ceph introduces latency into the write path and so what you are
> > seeing is typical. If you increase the iodepth of the fio test you
> > should get higher results though, until you start maxing out your CPU.
> >
> >
> >
> > Nick
> >
> >
> >
> > *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
> > Behalf Of *Matteo Dacrema
> > *Sent:* 26 October 2015 11:20
> > *To:* ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>
> > *Subject:* [ceph-users] BAD nvme SSD performance
> >
> >
> >
> > Hi all,
> >
> >
> >
> > I’ve recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a
> > 2 OSD ceph cluster with min_size = 1.
> >
> > I’ve tested them with fio ad I obtained two very different results
> > with these two situations with fio.
> >
> > This is the command : *fio  --ioengine=libaio --direct=1
> > --name=test --filename=test --bs=4k  --size=100M
> > --readwrite=randwrite
> > --numjobs=200  --group_reporting*
> >
> >
>

Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Somnath Roy
Another point,
As Christian mentioned, try to evaluate O_DIRECT|O_DSYNC performance of a SSD 
before choosing that for Ceph..
Try to run with direct=1 and sync =1 with fio to a raw ssd drive..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Monday, October 26, 2015 9:20 AM
To: Christian Balzer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean 
you are saturating SSDs there..I have seen a large performance delta even if 
iostat is reporting 100% disk util in both the cases.
Also, the ceph.conf file you are using is not optimal..Try to add these..

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

You didn't mention anything about your cpu, considering you have powerful cpu 
complex for SSDs tweak this to high number of shards..It also depends on number 
of OSDs per box..

osd_op_num_threads_per_shard
osd_op_num_shards


Don't need to change the following..

osd_disk_threads
osd_op_threads


Instead, try increasing..

filestore_op_threads

Use the following in the global section..

ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false

Change the following..
filestore_max_sync_interval = 1   (or even lower, need to lower 
filestore_min_sync_interval as well)


I am assuming you are using hammer and newer..

Thanks & Regards
Somnath

Try increasing the following to very big numbers..

> > filestore_queue_max_ops = 2000
> >
> > filestore_queue_max_bytes = 536870912
> >
> > filestore_queue_committing_max_ops = 500
> >
> > filestore_queue_committing_max_bytes = 268435456

Use the following..

osd_enable_op_tracker = false


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Monday, October 26, 2015 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance


Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:

>
>
> On 26-10-15 14:29, Matteo Dacrema wrote:
> > Hi Nick,
> >
> >
> >
> > I also tried to increase iodepth but nothing has changed.
> >
> >
> >
> > With iostat I noticed that the disk is fully utilized and write per 
> > seconds from iostat match fio output.
> >
>
> Ceph isn't fully optimized to get the maximum potential out of NVME 
> SSDs yet.
>
Indeed. Don't expect Ceph to be near raw SSD performance.

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck.

My guess would be these particular NVMe SSDs might just suffer from the same 
direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, 
not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian

> For example, NVM-E SSDs work best with very high queue depths and 
> parallel IOps.
>
> Also, be aware that Ceph add multiple layers to the whole I/O 
> subsystem and that there will be a performance impact when Ceph is used in 
> between.
>
> Wido
>
> >
> >
> > Matteo
> >
> >
> >
> > *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> > *Sent:* lunedì 26 ottobre 2015 13:06
> > *To:* Matteo Dacrema <mdacr...@enter.it>; ceph-us...@ceph.com
> > *Subject:* RE: BAD nvme SSD performance
> >
> >
> >
> > Hi Matteo,
> >
> >
> >
> > Ceph introduces latency into the write path and so what you are 
> > seeing is typical. If you increase the iodepth of the fio test you 
> > should get higher results though, until you start maxing out your CPU.
> >
> >
> >
> > Nick
> >
> >
> >
> > *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
> > Behalf Of *Matteo Dacrema
> > *Sent:* 26 October 2015 11:20
> > *To:* ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>
> > *Subject:* [ceph-users] BAD nvme SSD performance
> >
> >
> >
> > Hi all,
> >
> >
> >