Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Willem Jan Withagen


> Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar  het 
> volgende geschreven:
> 
> My understanding was this test is targeting latency more than IOPS. This is 
> probably why its was run using QD=1. It also makes sense that cpu freq will 
> be more important than cores. 
> 

But then it is not generic enough to be used as an advise!
It is just a line in 3D-space. 
As there are so many

--WjW
>> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>> 
>>> On 24-6-2017 05:30, Christian Wuerdig wrote:
>>> The general advice floating around is that your want CPUs with high
>>> clock speeds rather than more cores to reduce latency and increase IOPS
>>> for SSD setups (see also
>>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
>>> something like a E5-2667V4 might bring better results in that situation.
>>> Also there was some talk about disabling the processor C states in order
>>> to bring latency down (something like this should be easy to test:
>>> https://stackoverflow.com/a/22482722/220986)
>> 
>> I would be very careful to call this a general advice...
>> 
>> Although the article is interesting, it is rather single sided.
>> 
>> The only thing is shows that there is a lineair relation between
>> clockspeed and write or read speeds???
>> The article is rather vague on how and what is actually tested.
>> 
>> By just running a single OSD with no replication a lot of the
>> functionality is left out of the equation.
>> Nobody is running just 1 osD on a box in a normal cluster host.
>> 
>> Not using a serious SSD is another source of noise on the conclusion.
>> More Queue depth can/will certainly have impact on concurrency.
>> 
>> I would call this an observation, and nothing more.
>> 
>> --WjW
>>> 
>>> 
>>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
>>> > wrote:
>>> 
>>> Hello,
>>> 
>>> We are in the process of evaluating the performance of a testing
>>> cluster (3 nodes) with ceph jewel. Our setup consists of:
>>> 3 monitors (VMs)
>>> 2 physical servers each connected with 1 JBOD running Ubuntu Server
>>> 16.04
>>> 
>>> Each server has 32 threads @2.1GHz and 128GB RAM.
>>> The disk distribution per server is:
>>> 38 * HUS726020ALS210 (SAS rotational)
>>> 2 * HUSMH8010BSS200 (SAS SSD for journals)
>>> 2 * ST1920FM0043 (SAS SSD for data)
>>> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>>> 
>>> Since we don't currently have a 10Gbit switch, we test the performance
>>> with the cluster in a degraded state, the noout flag set and we mount
>>> rbd images on the powered on osd node. We confirmed that the network
>>> is not saturated during the tests.
>>> 
>>> We ran tests on the NVME disk and the pool created on this disk where
>>> we hoped to get the most performance without getting limited by the
>>> hardware specs since we have more disks than CPU threads.
>>> 
>>> The nvme disk was at first partitioned with one partition and the
>>> journal on the same disk. The performance on random 4K reads was
>>> topped at 50K iops. We then removed the osd and partitioned with 4
>>> data partitions and 4 journals on the same disk. The performance
>>> didn't increase significantly. Also, since we run read tests, the
>>> journals shouldn't cause performance issues.
>>> 
>>> We then ran 4 fio processes in parallel on the same rbd mounted image
>>> and the total iops reached 100K. More parallel fio processes didn't
>>> increase the measured iops.
>>> 
>>> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
>>> the crushmap just defines the different buckets/rules for the disk
>>> separation (rotational, ssd, nvme) in order to create the required
>>> pools
>>> 
>>> Is the performance of 100.000 iops for random 4K read normal for a
>>> disk that on the same benchmark runs at more than 300K iops on the
>>> same hardware or are we missing something?
>>> 
>>> Best regards,
>>> Kostas
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>  
> 
>  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Nick Fisk
Apologies for the top post, I can't seem to break indents on my phone.

Anyway the point of that test was as maged suggests to show the effect of 
serial CPU speed on latency. IO is effectively serialised by the pg lock, and 
so trying to reduce the time spent in this area is key. Fast cpu, fast network 
and fast journals are the key here.

This is particularly important on either databases where the small log area 
which may only occupy a small number of pgs can cause contention. Similar to 
xfs journal as well.

Higher queue depths will start to show similar behaviour if you go high enough 
and start waiting for pgs to unlock.

Further tests on proper hardware and 3x replication over network have shown 
average latency figures of around 600us for qd1.


From: Maged Mokhtar 
Sent: 24 Jun 2017 1:17 p.m.
To: Willem Jan Withagen
Cc: Ceph Users
Subject: Re: [ceph-users] Ceph random read IOPS

> My understanding was this test is targeting latency more than IOPS. This is 
> probably why its was run using QD=1. It also makes sense that cpu freq will 
> be more important than cores. 
>
>  
>
>
> On 2017-06-24 12:52, Willem Jan Withagen wrote:
>>
>> On 24-6-2017 05:30, Christian Wuerdig wrote:
>>>
>>> The general advice floating around is that your want CPUs with high
>>> clock speeds rather than more cores to reduce latency and increase IOPS
>>> for SSD setups (see also
>>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
>>> something like a E5-2667V4 might bring better results in that situation.
>>> Also there was some talk about disabling the processor C states in order
>>> to bring latency down (something like this should be easy to test:
>>> https://stackoverflow.com/a/22482722/220986)
>>
>>
>> I would be very careful to call this a general advice...
>>
>> Although the article is interesting, it is rather single sided.
>>
>> The only thing is shows that there is a lineair relation between
>> clockspeed and write or read speeds???
>> The article is rather vague on how and what is actually tested.
>>
>> By just running a single OSD with no replication a lot of the
>> functionality is left out of the equation.
>> Nobody is running just 1 osD on a box in a normal cluster host.
>>
>> Not using a serious SSD is another source of noise on the conclusion.
>> More Queue depth can/will certainly have impact on concurrency.
>>
>> I would call this an observation, and nothing more.
>>
>> --WjW
>>>
>>>
>>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
>>> > wrote:
>>>
>>> Hello,
>>>
>>> We are in the process of evaluating the performance of a testing
>>> cluster (3 nodes) with ceph jewel. Our setup consists of:
>>> 3 monitors (VMs)
>>> 2 physical servers each connected with 1 JBOD running Ubuntu Server
>>> 16.04
>>>
>>> Each server has 32 threads @2.1GHz and 128GB RAM.
>>> The disk distribution per server is:
>>> 38 * HUS726020ALS210 (SAS rotational)
>>> 2 * HUSMH8010BSS200 (SAS SSD for journals)
>>> 2 * ST1920FM0043 (SAS SSD for data)
>>> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>>>
>>> Since we don't currently have a 10Gbit switch, we test the performance
>>> with the cluster in a degraded state, the noout flag set and we mount
>>> rbd images on the powered on osd node. We confirmed that the network
>>> is not saturated during the tests.
>>>
>>> We ran tests on the NVME disk and the pool created on this disk where
>>> we hoped to get the most performance without getting limited by the
>>> hardware specs since we have more disks than CPU threads.
>>>
>>> The nvme disk was at first partitioned with one partition and the
>>> journal on the same disk. The performance on random 4K reads was
>>> topped at 50K iops. We then removed the osd and partitioned with 4
>>> data partitions and 4 journals on the same disk. The performance
>>> didn't increase significantly. Also, since we run read tests, the
>>> journals shouldn't cause performance issues.
>>>
>>> We then ran 4 fio processes in parallel on the same rbd mounted image
>>> and the total iops reached 100K. More parallel fio processes didn't
>>> increase the measured iops.
>>>
>>> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
>>> the crushmap just defines the different buckets/rules for the disk
>>> separation (rotational, ssd, nvme) in order to create the required
>>> pools
>>>
>>> Is the performance of 100.000 iops for random 4K read normal for a
>>> disk that on the same benchmark runs at more than 300K iops on the
>>> same hardware or are we missing something?
>>>
>>> Best regards,
>>> Kostas
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 

Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Maged Mokhtar
My understanding was this test is targeting latency more than IOPS. This
is probably why its was run using QD=1. It also makes sense that cpu
freq will be more important than cores.  

On 2017-06-24 12:52, Willem Jan Withagen wrote:

> On 24-6-2017 05:30, Christian Wuerdig wrote: 
> 
>> The general advice floating around is that your want CPUs with high
>> clock speeds rather than more cores to reduce latency and increase IOPS
>> for SSD setups (see also
>> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
>> something like a E5-2667V4 might bring better results in that situation.
>> Also there was some talk about disabling the processor C states in order
>> to bring latency down (something like this should be easy to test:
>> https://stackoverflow.com/a/22482722/220986)
> 
> I would be very careful to call this a general advice...
> 
> Although the article is interesting, it is rather single sided.
> 
> The only thing is shows that there is a lineair relation between
> clockspeed and write or read speeds???
> The article is rather vague on how and what is actually tested.
> 
> By just running a single OSD with no replication a lot of the
> functionality is left out of the equation.
> Nobody is running just 1 osD on a box in a normal cluster host.
> 
> Not using a serious SSD is another source of noise on the conclusion.
> More Queue depth can/will certainly have impact on concurrency.
> 
> I would call this an observation, and nothing more.
> 
> --WjW 
> 
>> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
>> > wrote:
>> 
>> Hello,
>> 
>> We are in the process of evaluating the performance of a testing
>> cluster (3 nodes) with ceph jewel. Our setup consists of:
>> 3 monitors (VMs)
>> 2 physical servers each connected with 1 JBOD running Ubuntu Server
>> 16.04
>> 
>> Each server has 32 threads @2.1GHz and 128GB RAM.
>> The disk distribution per server is:
>> 38 * HUS726020ALS210 (SAS rotational)
>> 2 * HUSMH8010BSS200 (SAS SSD for journals)
>> 2 * ST1920FM0043 (SAS SSD for data)
>> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
>> 
>> Since we don't currently have a 10Gbit switch, we test the performance
>> with the cluster in a degraded state, the noout flag set and we mount
>> rbd images on the powered on osd node. We confirmed that the network
>> is not saturated during the tests.
>> 
>> We ran tests on the NVME disk and the pool created on this disk where
>> we hoped to get the most performance without getting limited by the
>> hardware specs since we have more disks than CPU threads.
>> 
>> The nvme disk was at first partitioned with one partition and the
>> journal on the same disk. The performance on random 4K reads was
>> topped at 50K iops. We then removed the osd and partitioned with 4
>> data partitions and 4 journals on the same disk. The performance
>> didn't increase significantly. Also, since we run read tests, the
>> journals shouldn't cause performance issues.
>> 
>> We then ran 4 fio processes in parallel on the same rbd mounted image
>> and the total iops reached 100K. More parallel fio processes didn't
>> increase the measured iops.
>> 
>> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
>> the crushmap just defines the different buckets/rules for the disk
>> separation (rotational, ssd, nvme) in order to create the required
>> pools
>> 
>> Is the performance of 100.000 iops for random 4K read normal for a
>> disk that on the same benchmark runs at more than 300K iops on the
>> same hardware or are we missing something?
>> 
>> Best regards,
>> Kostas
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph random read IOPS

2017-06-24 Thread Willem Jan Withagen
On 24-6-2017 05:30, Christian Wuerdig wrote:
> The general advice floating around is that your want CPUs with high
> clock speeds rather than more cores to reduce latency and increase IOPS
> for SSD setups (see also
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/) So
> something like a E5-2667V4 might bring better results in that situation.
> Also there was some talk about disabling the processor C states in order
> to bring latency down (something like this should be easy to test:
> https://stackoverflow.com/a/22482722/220986)

I would be very careful to call this a general advice...

Although the article is interesting, it is rather single sided.

The only thing is shows that there is a lineair relation between
clockspeed and write or read speeds???
The article is rather vague on how and what is actually tested.

By just running a single OSD with no replication a lot of the
functionality is left out of the equation.
Nobody is running just 1 osD on a box in a normal cluster host.

Not using a serious SSD is another source of noise on the conclusion.
More Queue depth can/will certainly have impact on concurrency.

I would call this an observation, and nothing more.

--WjW
> 
> On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> > wrote:
> 
> Hello,
> 
> We are in the process of evaluating the performance of a testing
> cluster (3 nodes) with ceph jewel. Our setup consists of:
> 3 monitors (VMs)
> 2 physical servers each connected with 1 JBOD running Ubuntu Server
> 16.04
> 
> Each server has 32 threads @2.1GHz and 128GB RAM.
> The disk distribution per server is:
> 38 * HUS726020ALS210 (SAS rotational)
> 2 * HUSMH8010BSS200 (SAS SSD for journals)
> 2 * ST1920FM0043 (SAS SSD for data)
> 1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
> 
> Since we don't currently have a 10Gbit switch, we test the performance
> with the cluster in a degraded state, the noout flag set and we mount
> rbd images on the powered on osd node. We confirmed that the network
> is not saturated during the tests.
> 
> We ran tests on the NVME disk and the pool created on this disk where
> we hoped to get the most performance without getting limited by the
> hardware specs since we have more disks than CPU threads.
> 
> The nvme disk was at first partitioned with one partition and the
> journal on the same disk. The performance on random 4K reads was
> topped at 50K iops. We then removed the osd and partitioned with 4
> data partitions and 4 journals on the same disk. The performance
> didn't increase significantly. Also, since we run read tests, the
> journals shouldn't cause performance issues.
> 
> We then ran 4 fio processes in parallel on the same rbd mounted image
> and the total iops reached 100K. More parallel fio processes didn't
> increase the measured iops.
> 
> Our ceph.conf is pretty basic (debug is set to 0/0 for everything) and
> the crushmap just defines the different buckets/rules for the disk
> separation (rotational, ssd, nvme) in order to create the required
> pools
> 
> Is the performance of 100.000 iops for random 4K read normal for a
> disk that on the same benchmark runs at more than 300K iops on the
> same hardware or are we missing something?
> 
> Best regards,
> Kostas
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Help needed rbd feature enable

2017-06-24 Thread Mykola Golub
The immutable features are features that can be set only at image
creation time. These features are mutable (can be dynamically
enabled/disabled after image creation):

  exclusive-lock, object-map, fast-diff, journaling

Also, deep-flatten feature can be dynamically disabled.

So all other features (like layering) are immutable. To add an
immutable feature you have to use `rbd copy` command (specifying
features you want for the destination image) and then `rbd rename` to
give it the original name. Or use export/import as it has already been
suggested.

To tell why your current images have no features, we would need to
know how they were created and ceph.conf at that moment.

on Sat, Jun 24, 2017 at 12:14:03AM +0200, Massimiliano Cuttini wrote:
> What seems to be strange is that feature are *all disabled* when I create
> some images.
> While ceph should use default settings of jewel at least.
> 
> Do I need to place in ceph.conf something in order to use default settings?
> 
> 
> 
> 
> Il 23/06/2017 23:43, Massimiliano Cuttini ha scritto:
> >
> >I guess you updated those feature before the commit that fix this:
> >
> >https://github.com/ceph/ceph/blob/master/src/include/rbd/features.h
> >
> >As stated:
> >
> >// features that make an image inaccessible for read or write by
> >/// clients that don't understand them
> >#define RBD_FEATURES_INCOMPATIBLE (RBD_FEATURE_LAYERING   | \
> > RBD_FEATURE_STRIPINGV2 | \
> >  RBD_FEATURE_DATA_POOL)
> >
> >/// features that make an image unwritable by clients that don't 
> > understand them
> >#define RBD_FEATURES_RW_INCOMPATIBLE(RBD_FEATURES_INCOMPATIBLE  | \
> >  RBD_FEATURE_EXCLUSIVE_LOCK | \
> >  RBD_FEATURE_OBJECT_MAP | \
> > RBD_FEATURE_FAST_DIFF  | \
> > RBD_FEATURE_DEEP_FLATTEN   | \
> > RBD_FEATURE_JOURNALING)
> >
> >
> >Some features are no-way back: if you downgrade you cannot restore
> >anymore.
> >As stated here: https://bugzilla.redhat.com/show_bug.cgi?id=1326645
> >This is the right behaviour and redhat will not fix this.
> >
> >So don't downgrade your features or you'll have to export-import all the
> >images.
> >I'm getting crazy.
> >
> >
> >
> >Il 23/06/2017 22:36, David Turner ha scritto:
> >>
> >>I upgraded to Jewel from Hammer and was able to enable those features on
> >>all of my rbds that were format 2, which yours is.  Just test it on some
> >>non customer data and see how it goes.
> >>
> >>
> >>On Fri, Jun 23, 2017, 4:33 PM Massimiliano Cuttini  >>> wrote:
> >>
> >>Ok,
> >>
> >>At moment my client use only nbd-rbd, can I use all these feature
> >>or this is something unavoidable?
> >>I guess it's ok.
> >>
> >>Reading around seems that a lost feature cannot be re-enabled due
> >>to back-compatibility with old clients.
> >>... I guess I'll need to export and import in a new image fully
> >>feature.
> >>Is it?
> >>
> >>
> >>
> >>
> >>Il 23/06/2017 22:25, David Turner ha scritto:
> >>>All of the features you are talking about likely require the
> >>>exclusive-lock which requires the 4.9 linux kernel.  You cannot
> >>>map any RBDs that have these features enabled with any kernel
> >>>older than that.
> >>>
> >>>The features you can enable are layering, exclusive-lock,
> >>>object-map, and fast-diff.  You cannot enable deep-flatten on
> >>>any RBD ever.  RBD's can only be created with that feature.  You
> >>>may need to enable these in a specific order.  I believe that
> >>>the order I have the features listed is the order you need to
> >>>enable them in, at least that order should work.
> >>>
> >>>On Fri, Jun 23, 2017 at 3:41 PM Massimiliano Cuttini
> >>>> wrote:
> >>>
> >>>Hi everybody,
> >>>
> >>>I just realize that all my Images are completly without
> >>>features:
> >>>
> >>>rbd info VHD-4c7ebb38-b081-48da-9b57-aac14bdf88c4
> >>>rbd image 'VHD-4c7ebb38-b081-48da-9b57-aac14bdf88c4':
> >>> size 102400 MB in 51200 objects
> >>> order 21 (2048 kB objects)
> >>> block_name_prefix: rbd_data.5fde2ae8944a
> >>> format: 2
> >>> features:
> >>> flags:
> >>>
> >>>try to enabling them will get this error:
> >>>
> >>>rbd: failed to update image features: (22) Invalid argument
> >>>2017-06-23 21:20:03.748746 7fdec1b34d80 -1 librbd: cannot 
> >>> update immutable features
> >>>
> >>>I read on the guide I shoulded had place in the
> >>>config|rbd_default_features|
> >>>
> >>>What can I do now to enable this feature all