Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-24 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Christian Balzer
> Sent: 24 October 2016 02:30
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg
> queue size
> 
> 
> Hello,
> 
> On Fri, 21 Oct 2016 17:44:25 + Jim Kilborn wrote:
> 
> > Reed/Christian,
> >
> > So if I put the OSD journals on an SSD that has power loss protection 
> > (Samsung SM863) , all the write then go through those journals.
> Can I then leave write caching turn on for the spinner OSDs, even without BBU 
> caching controller? In the event of a power outage past
> our ups time, I want to ensure all the osds aren’t corrupt after bring the 
> nodes back up.

I think just to clarify here. I think the poster who had data loss when 
enabling write cache on their disks, had the disk write cache option enabled on 
a RAID controller. This is different to having write cache enabled (via hdparm) 
if the disk is directly presented into Linux, as the kernel can force 
(hopefully!!!) the disk to flush its write cache. RAID controllers hide this 
capability from the kernel to communicate with the disk and tell it to flush 
its cache, as they only see the virtual raid disk(s).


> >
> 
> Nope, as the corruption happened on the actual OSD, the LevelDB.
> 
> At this point I'd like to point out the last paragraph here:
> https://en.wikipedia.org/wiki/LevelDB
> 
> Which matches my experience, on a MON node with non-power fail safe SSDs we 
> lost power (by human error, not actual DC
> problems) twice and had the levelDB corrupted both times, while the 
> filesystem was fine thanks to proper journaling with barriers and
> SYNC points.
> 
> If you have direct control/communications with your UPS, I'd recommend using 
> that info to shut down things before power runs out.
> 
> Alternatively, have ALL your data on PLP SSDs or with lots of manual effort, 
> symlink to the leveldb to such a device.
> 
> 
> Christian
> 
> > Secondly, Seagate 8TB enterprise drives say they employ power loss 
> > protection as well. Apparently, in your case, this turned out to
> be untrue?
> >
> >
> >
> >
> > Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for
> > Windows 10
> >
> > From: Reed Dier<mailto:reed.d...@focusvq.com>
> > Sent: Friday, October 21, 2016 10:06 AM
> > To: Christian Balzer<mailto:ch...@gol.com>
> > Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> > Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel
> > - cache pressure, capability release, poor iostat await avg queue size
> >
> >
> > On Oct 19, 2016, at 7:54 PM, Christian Balzer 
> > <ch...@gol.com<mailto:ch...@gol.com>> wrote:
> >
> >
> > Hello,
> >
> > On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:
> >
> > I have setup a new linux cluster to allow migration from our old SAN based 
> > cluster to a new cluster with ceph.
> > All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> > As others mentioned, not a good choice, but also not the (main) cause
> > of your problems.
> >
> > I am basically running stock ceph settings, with just turning the write 
> > cache off via hdparm on the drives, and temporarily turning of
> scrubbing.
> >
> > The former is bound to kill performance, if you care that much for
> > your data but can't guarantee constant power (UPS, dual PSUs, etc),
> > consider using a BBU caching controller.
> >
> > I wanted to comment on this small bolded bit, in the early days of my ceph 
> > cluster, testing resiliency to power failure (worst case
> scenario), when the on-disk write cache was enabled on my drives, I would 
> lose that OSD to leveldb corruption, even with BBU.
> >
> > With BBU + no disk-level cache, the OSD would come back, with no data
> > loss, however performance would be significantly degraded. (xfsaild
> > process with 99% iowait, cured by zapping disk and recreating OSD)
> >
> > For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, 
> > with the OSD set as a single RAID0 VD. On disk
> journaling.
> >
> > There was a decent enough hit to write performance after disabling write 
> > caching at the disk layer, but write-back caching at the
> controller layer provided enough of a negating increase, that the data 
> security was an acceptable trade off.
> >
> > Was a tough way to learn how important this w

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-23 Thread Christian Balzer

Hello,

On Fri, 21 Oct 2016 17:44:25 + Jim Kilborn wrote:

> Reed/Christian,
> 
> So if I put the OSD journals on an SSD that has power loss protection 
> (Samsung SM863) , all the write then go through those journals. Can I then 
> leave write caching turn on for the spinner OSDs, even without BBU caching 
> controller? In the event of a power outage past our ups time, I want to 
> ensure all the osds aren’t corrupt after bring the nodes back up.
> 

Nope, as the corruption happened on the actual OSD, the LevelDB.

At this point I'd like to point out the last paragraph here:
https://en.wikipedia.org/wiki/LevelDB

Which matches my experience, on a MON node with non-power fail safe SSDs
we lost power (by human error, not actual DC problems) twice and had the
levelDB corrupted both times, while the filesystem was fine thanks to
proper journaling with barriers and SYNC points.

If you have direct control/communications with your UPS, I'd recommend
using that info to shut down things before power runs out.

Alternatively, have ALL your data on PLP SSDs or with lots of manual
effort, symlink to the leveldb to such a device.


Christian

> Secondly, Seagate 8TB enterprise drives say they employ power loss protection 
> as well. Apparently, in your case, this turned out to be untrue?
> 
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> From: Reed Dier<mailto:reed.d...@focusvq.com>
> Sent: Friday, October 21, 2016 10:06 AM
> To: Christian Balzer<mailto:ch...@gol.com>
> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
> 
> 
> On Oct 19, 2016, at 7:54 PM, Christian Balzer 
> <ch...@gol.com<mailto:ch...@gol.com>> wrote:
> 
> 
> Hello,
> 
> On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:
> 
> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> As others mentioned, not a good choice, but also not the (main) cause of
> your problems.
> 
> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
> 
> The former is bound to kill performance, if you care that much for your
> data but can't guarantee constant power (UPS, dual PSUs, etc), consider
> using a BBU caching controller.
> 
> I wanted to comment on this small bolded bit, in the early days of my ceph 
> cluster, testing resiliency to power failure (worst case scenario), when the 
> on-disk write cache was enabled on my drives, I would lose that OSD to 
> leveldb corruption, even with BBU.
> 
> With BBU + no disk-level cache, the OSD would come back, with no data loss, 
> however performance would be significantly degraded. (xfsaild process with 
> 99% iowait, cured by zapping disk and recreating OSD)
> 
> For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, 
> with the OSD set as a single RAID0 VD. On disk journaling.
> 
> There was a decent enough hit to write performance after disabling write 
> caching at the disk layer, but write-back caching at the controller layer 
> provided enough of a negating increase, that the data security was an 
> acceptable trade off.
> 
> Was a tough way to learn how important this was after data center was struck 
> by lightning two weeks after initial ceph cluster install and one phase of 
> power was knocked out for 15 minutes, taking half the non-dual-PSU nodes with 
> it.
> 
> Just want to make sure that people learn from that painful experience.
> 
> Reed
> 
> 
> The later I venture you did because performance was abysmal with scrubbing
> enabled.
> Which is always a good indicator that your cluster needs tuning, improving.
> 
> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.
> Memory is fine, CPU I can't tell from the model number and I'm not
> inclined to look up or guess, but that usually only becomes a bottleneck
> when dealing with all SSD setup and things requiring the lowest latency
> possible.
> 
> 
> Since I am running cephfs, I have tiering setup.
> That should read "on top of EC pools", and as John said, not a good idea
> at all, both EC pools and cache-tiering.
> 
> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replic

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-21 Thread Jim Kilborn
Reed/Christian,

So if I put the OSD journals on an SSD that has power loss protection (Samsung 
SM863) , all the write then go through those journals. Can I then leave write 
caching turn on for the spinner OSDs, even without BBU caching controller? In 
the event of a power outage past our ups time, I want to ensure all the osds 
aren’t corrupt after bring the nodes back up.

Secondly, Seagate 8TB enterprise drives say they employ power loss protection 
as well. Apparently, in your case, this turned out to be untrue?




Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Reed Dier<mailto:reed.d...@focusvq.com>
Sent: Friday, October 21, 2016 10:06 AM
To: Christian Balzer<mailto:ch...@gol.com>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size


On Oct 19, 2016, at 7:54 PM, Christian Balzer 
<ch...@gol.com<mailto:ch...@gol.com>> wrote:


Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

I have setup a new linux cluster to allow migration from our old SAN based 
cluster to a new cluster with ceph.
All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

I am basically running stock ceph settings, with just turning the write cache 
off via hdparm on the drives, and temporarily turning of scrubbing.

The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

I wanted to comment on this small bolded bit, in the early days of my ceph 
cluster, testing resiliency to power failure (worst case scenario), when the 
on-disk write cache was enabled on my drives, I would lose that OSD to leveldb 
corruption, even with BBU.

With BBU + no disk-level cache, the OSD would come back, with no data loss, 
however performance would be significantly degraded. (xfsaild process with 99% 
iowait, cured by zapping disk and recreating OSD)

For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, with 
the OSD set as a single RAID0 VD. On disk journaling.

There was a decent enough hit to write performance after disabling write 
caching at the disk layer, but write-back caching at the controller layer 
provided enough of a negating increase, that the data security was an 
acceptable trade off.

Was a tough way to learn how important this was after data center was struck by 
lightning two weeks after initial ceph cluster install and one phase of power 
was knocked out for 15 minutes, taking half the non-dual-PSU nodes with it.

Just want to make sure that people learn from that painful experience.

Reed


The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So 
the idea is to ensure a single host failure.
Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

My crush map is setup to ensure the cache pool uses only the 4 850 pro and the 
erasure code uses only the 16 spinning 4TB drives.

The problems that I am seeing is that I start copying data from our old san to 
the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 
TB, I start seeing:

HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
26 ops are blocked > 65.536 sec on osd.0
37 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set

osd.0 is the cache ssd

If I watch iostat on the cache ssd, I see the queue lengt

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-21 Thread Reed Dier

> On Oct 19, 2016, at 7:54 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:
> 
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> As others mentioned, not a good choice, but also not the (main) cause of
> your problems.
> 
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>> 
> The former is bound to kill performance, if you care that much for your
> data but can't guarantee constant power (UPS, dual PSUs, etc), consider
> using a BBU caching controller.

I wanted to comment on this small bolded bit, in the early days of my ceph 
cluster, testing resiliency to power failure (worst case scenario), when the 
on-disk write cache was enabled on my drives, I would lose that OSD to leveldb 
corruption, even with BBU.

With BBU + no disk-level cache, the OSD would come back, with no data loss, 
however performance would be significantly degraded. (xfsaild process with 99% 
iowait, cured by zapping disk and recreating OSD)

For reference, these were Seagate ST8000NM0065, backed by an LSI 3108 RoC, with 
the OSD set as a single RAID0 VD. On disk journaling.

There was a decent enough hit to write performance after disabling write 
caching at the disk layer, but write-back caching at the controller layer 
provided enough of a negating increase, that the data security was an 
acceptable trade off.

Was a tough way to learn how important this was after data center was struck by 
lightning two weeks after initial ceph cluster install and one phase of power 
was knocked out for 15 minutes, taking half the non-dual-PSU nodes with it.

Just want to make sure that people learn from that painful experience.

Reed

> 
> The later I venture you did because performance was abysmal with scrubbing
> enabled.
> Which is always a good indicator that your cluster needs tuning, improving.
> 
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  
> Memory is fine, CPU I can't tell from the model number and I'm not
> inclined to look up or guess, but that usually only becomes a bottleneck
> when dealing with all SSD setup and things requiring the lowest latency
> possible.
> 
> 
>> Since I am running cephfs, I have tiering setup.
> That should read "on top of EC pools", and as John said, not a good idea
> at all, both EC pools and cache-tiering.
> 
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
> 
> This isn't a Seagate, you mean Samsung. And that's a consumer model,
> ill suited for this task, even with the DC level SSDs below as journals.
> 
> And as such a replication of 2 is also ill advised, I've seen these SSDs
> die w/o ANY warning whatsoever and long before their (abysmal) endurance
> was exhausted.
> 
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
> 
> Those are fine. If you re-do you cluster, don't put more than 4-5 journals
> on them.
> 
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>> 
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>> 
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 
>> osd.0 is the cache ssd
>> 
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>> 
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>  0.00 0.339.00   84.33 0.9620.11   462.40
>> 75.92  397.56  125.67  426.58  10.70  99.90
>>  0.00 0.67   30.00   87.33 5.9621.03   471.20
>> 67.86  910.95   87.00 1193.99   8.27  97.07
>>  0.0016.67   33.00  289.33 4.2118.80   146.20
>> 29.83   88.99   93.91   88.43   3.10  99.83
>>  0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>> 
>> 
>> If I look 

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Christian Balzer

Hello,

On Thu, 20 Oct 2016 15:45:34 + Jim Kilborn wrote:

Good to know.

You may be able to squeeze some more 4K write IOPS out of this by cranking
the CPUs to full speed, see the relevant recent threads about this.

As for the 120GB (there is no 128GB SM863 model according to Samsung) SSDs
as journals, keep in mind that in your current cluster that limits you to
about 1TBW/day if you want them to survive 5 years.
Something to keep in mind.

Christian
> The chart obviously didn’t go well. Here it is again
> 
> 
> 
> fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
> --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
> --name=journal-test
> 
> 
> 
> FIO Test Local disk  SAN/NFS  
> Ceph size=3/SSD journal
> 
> 4M Writes  53 MB/sec   12 IOPS 62 MB/sec15 IOPS   
>  151 MB/sec 37 IOPS
> 
> 4M Rand Writes34 MB/sec 8 IOPS 63 MB/sec15 IOPS
> 155 MB/sec 37 IOPS
> 
> 4M Read  66 MB/sec   15 IOPS   102 MB/sec25 IOPS  
>  662 MB/sec 161 IOPS
> 
> 4M Rand Read73 MB/sec   17 IOPS   103 MB/sec25 IOPS   670 
> MB/sec 163 IOPS
> 
> 4K Writes2.9 MB/sec 738 IOPS   3.8 MB/sec   952 IOPS  
>   2.3 MB/sec 571 IOPS
> 
> 4K Rand Writes 551 KB/sec  134 IOPS   3.6 MB/sec   911 IOPS   2.0 
> MB/sec 501 IOPS
> 
> 4K Read  28 MB/sec 7001 IOPS8 MB/sec 1945 IOPS
>13 MB/sec 3256 IOPS
> 
> 4K Rand Read 263 KB/sec5 MB/sec 1246 IOPS 
> 8 MB/sec  2015 IOPS
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> 
> 
> From: Jim Kilborn<mailto:j...@kilborns.com>
> Sent: Thursday, October 20, 2016 10:20 AM
> To: Christian Balzer<mailto:ch...@gol.com>; 
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
> 
> 
> 
> Thanks Christion for the additional information and comments.
> 
> 
> 
> · upgraded the kernels, but still had poor performance
> 
> · Removed all the pools and recreated with just a replication of 3, 
> with the two pool for the data and metadata. No cache tier pool
> 
> · Turned back on the write caching with hdparm. We do have a Large 
> UPS and dual power supplies in the ceph unit. If we get a long power outage, 
> everything will go down anyway.
> 
> 
> 
> I am no longer seeing the issue of the slow requests, ops blocked, etc.
> 
> 
> 
> I think I will push for the following design per ceph server
> 
> 
> 
> 8  4TB sata drives
> 
> 2 Samsung 128GB SM863 SSD each holding 4 osd journals
> 
> 
> 
> With 4 hosts, and a replication of 3 to start with
> 
> 
> 
> I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD 
> holding the  4 osd journals, with 4 hosts in the cluster over infiniband.
> 
> 
> 
> At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
> 5.5Gb/sec over infiniband
> 
> Which is around 600MB/sec and translates well to the FIO number.
> 
> 
> 
> fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
> --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
> --name=journal-test
> 
> 
> 
> FIO Test
> 
> 
> Local disk
> 
> 
> SAN/NFS
> 
> 
> Ceph w/Repl/SSD journal
> 
> 
> 4M Writes
> 
> 
> 53 MB/sec   12 IOPS
> 
> 
> 62 MB/sec15 IOPS
> 
> 
>   151 MB/sec 37 IOPS
> 
> 
> 4M Rand Writes
> 
> 
> 34 MB/sec 8 IOPS
> 
> 
> 63 MB/sec15 IOPS
> 
> 
>   155 MB/sec 37 IOPS
> 
> 
> 4M Read
> 
> 
> 66 MB/sec   15 IOPS
> 
> 
> 102 MB/sec  25 IOPS
> 
> 
>   662 MB/sec 161 IOPS
> 
> 
> 4M Rand Read
> 
> 
> 73 MB/sec   17 IOPS
> 
> 
> 103 MB/sec  25 IOPS
> 
> 
>   670 MB/sec 163 IOPS
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 4K Writes
> 
> 
> 2.9 MB/sec 738 IOPS
> 
> 
> 3.8 MB/sec   952 IOPS
> 
> 
>   2.3 MB/sec 571 IOPS
> 
> 
> 4K Rand Writes
> 
> 
> 551 KB/sec  134 IOPS
> 
> 
> 3.6 MB/sec   911 IOPS
> 
> 
>   2.0 MB/sec 501 IOPS
> 
> 
> 4K Read
> 
> 
> 28 MB/sec 7001 IOPS
> 
> 
> 8 MB/sec 1945 IOPS
> 
> 
>   13 MB/sec 3256 IOPS
> 
> 
&

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Jim Kilborn
The chart obviously didn’t go well. Here it is again



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test Local disk  SAN/NFS
  Ceph size=3/SSD journal

4M Writes  53 MB/sec   12 IOPS 62 MB/sec15 IOPS
151 MB/sec 37 IOPS

4M Rand Writes34 MB/sec 8 IOPS 63 MB/sec15 IOPS155 
MB/sec 37 IOPS

4M Read  66 MB/sec   15 IOPS   102 MB/sec25 IOPS   
662 MB/sec 161 IOPS

4M Rand Read73 MB/sec   17 IOPS   103 MB/sec25 IOPS   670 
MB/sec 163 IOPS

4K Writes2.9 MB/sec 738 IOPS   3.8 MB/sec   952 IOPS
2.3 MB/sec 571 IOPS

4K Rand Writes 551 KB/sec  134 IOPS   3.6 MB/sec   911 IOPS   2.0 
MB/sec 501 IOPS

4K Read  28 MB/sec 7001 IOPS8 MB/sec 1945 IOPS  
 13 MB/sec 3256 IOPS

4K Rand Read 263 KB/sec5 MB/sec 1246 IOPS   
  8 MB/sec  2015 IOPS



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: Jim Kilborn<mailto:j...@kilborns.com>
Sent: Thursday, October 20, 2016 10:20 AM
To: Christian Balzer<mailto:ch...@gol.com>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Thanks Christion for the additional information and comments.



· upgraded the kernels, but still had poor performance

· Removed all the pools and recreated with just a replication of 3, 
with the two pool for the data and metadata. No cache tier pool

· Turned back on the write caching with hdparm. We do have a Large UPS 
and dual power supplies in the ceph unit. If we get a long power outage, 
everything will go down anyway.



I am no longer seeing the issue of the slow requests, ops blocked, etc.



I think I will push for the following design per ceph server



8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals



With 4 hosts, and a replication of 3 to start with



I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding 
the  4 osd journals, with 4 hosts in the cluster over infiniband.



At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test


Local disk


SAN/NFS


Ceph w/Repl/SSD journal


4M Writes


53 MB/sec   12 IOPS


62 MB/sec15 IOPS


  151 MB/sec 37 IOPS


4M Rand Writes


34 MB/sec 8 IOPS


63 MB/sec15 IOPS


  155 MB/sec 37 IOPS


4M Read


66 MB/sec   15 IOPS


102 MB/sec  25 IOPS


  662 MB/sec 161 IOPS


4M Rand Read


73 MB/sec   17 IOPS


103 MB/sec  25 IOPS


  670 MB/sec 163 IOPS














4K Writes


2.9 MB/sec 738 IOPS


3.8 MB/sec   952 IOPS


  2.3 MB/sec 571 IOPS


4K Rand Writes


551 KB/sec  134 IOPS


3.6 MB/sec   911 IOPS


  2.0 MB/sec 501 IOPS


4K Read


28 MB/sec 7001 IOPS


8 MB/sec 1945 IOPS


  13 MB/sec 3256 IOPS


4K Rand Read


263 KB/sec


5 MB/sec 1246 IOPS


  8 MB/sec  2015 IOPS




That performance is fine for our needs

Again, thanks for the help guys.



Regards,

Jim



From: Christian Balzer<mailto:ch...@gol.com>
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Cc: Jim Kilborn<mailto:j...@kilborns.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.
Memory is fine, CPU I can't t

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-20 Thread Jim Kilborn
Thanks Christion for the additional information and comments.



· upgraded the kernels, but still had poor performance

· Removed all the pools and recreated with just a replication of 3, 
with the two pool for the data and metadata. No cache tier pool

· Turned back on the write caching with hdparm. We do have a Large UPS 
and dual power supplies in the ceph unit. If we get a long power outage, 
everything will go down anyway.



I am no longer seeing the issue of the slow requests, ops blocked, etc.



I think I will push for the following design per ceph server



8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals



With 4 hosts, and a replication of 3 to start with



I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding 
the  4 osd journals, with 4 hosts in the cluster over infiniband.



At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 
5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} 
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting 
--name=journal-test



FIO Test


Local disk


SAN/NFS


Ceph w/Repl/SSD journal


4M Writes


53 MB/sec   12 IOPS


62 MB/sec15 IOPS


  151 MB/sec 37 IOPS


4M Rand Writes


34 MB/sec 8 IOPS


63 MB/sec15 IOPS


  155 MB/sec 37 IOPS


4M Read


66 MB/sec   15 IOPS


102 MB/sec  25 IOPS


  662 MB/sec 161 IOPS


4M Rand Read


73 MB/sec   17 IOPS


103 MB/sec  25 IOPS


  670 MB/sec 163 IOPS














4K Writes


2.9 MB/sec 738 IOPS


3.8 MB/sec   952 IOPS


  2.3 MB/sec 571 IOPS


4K Rand Writes


551 KB/sec  134 IOPS


3.6 MB/sec   911 IOPS


  2.0 MB/sec 501 IOPS


4K Read


28 MB/sec 7001 IOPS


8 MB/sec 1945 IOPS


  13 MB/sec 3256 IOPS


4K Rand Read


263 KB/sec


5 MB/sec 1246 IOPS


  8 MB/sec  2015 IOPS




That performance is fine for our needs

Again, thanks for the help guys.



Regards,

Jim



From: Christian Balzer<mailto:ch...@gol.com>
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Cc: Jim Kilborn<mailto:j...@kilborns.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scr

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Christian Balzer

Hello,

On Wed, 19 Oct 2016 12:28:28 + Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
> 
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.  
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
> 
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
> 
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 
> osd.0 is the cache ssd
> 
> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The 
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
> 
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> sdb
>   0.00 0.339.00   84.33 0.9620.11   462.40
> 75.92  397.56  125.67  426.58  10.70  99.90
>   0.00 0.67   30.00   87.33 5.9621.03   471.20
> 67.86  910.95   87.00 1193.99   8.27  97.07
>   0.0016.67   33.00  289.33 4.2118.80   146.20
> 29.83   88.99   93.91   88.43   3.10  99.83
>   0.00 7.337.67  261.67 1.9219.63   163.81   
> 117.42  331.97  182.04  336.36   3.71 100.00
> 
> 
> If I look at the iostat for all the drives, only the cache ssd drive is 
> backed up
> 
Yes, consumer SSDs on top of a design that channels everything through
them.

Rebuild your cluster along more conventional and conservative lines, don't
use the 850 PROs. 
Feel free to run any new design by us.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
John,



Updating to the latest mainline kernel from elrepo (4.8.2-1) on all 4 ceph 
servers, and the ceph client that I am testing with, still didn’t fix the 
issues.

Still getting “Failing to respond to Cache Pressure”. And ops block currently 
hovering between 100-300 requests  > 32 sec



This is just from doing an rsync from one ceph client (reading data from an old 
san over nfs, and writing to ceph cluster on infiniband



I guess I’ll try getting rid of the EC pool and the cache tier, and just using 
replication with size 3 and see if it works better



Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: John Spray<mailto:jsp...@redhat.com>
Sent: Wednesday, October 19, 2016 12:16 PM
To: Jim Kilborn<mailto:j...@kilborns.com>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn <j...@kilborns.com> wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page 
> http://docs.ceph.com/docs/jewel/start/os-recommendations/

OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).

> I’ll consider either upgrading the kernels or using the fuse client, but will 
> likely go the kernel 4.4 route
>
>
>
> As for moving to just a replicated pool, I take it that a replication size of 
> 3 is minimum recommended.
>
> If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
> servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
> or is that two many osds per journal, creating a hot spot for writes?

That sounds like a lot of journals on one SSD, but people other than
me have more empirical experience in hardware selection.

John

>
>
>
> Thanks!!
>
>
>
>
>
>
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
>
>
> From: John Spray<mailto:jsp...@redhat.com>
> Sent: Wednesday, October 19, 2016 9:10 AM
> To: Jim Kilborn<mailto:j...@kilborns.com>
> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
>
>
>
> On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn <j...@kilborns.com> wrote:
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>>
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  Since I am running cephfs, I have 
>> tiering setup.
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>>
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>>
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>>
>> osd.0 is the cache ssd
>>
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>   0.00 0.339.00   84.33 0.9620.11   462.40   
>>  75.92  397.56  125.67  426.58  10.70  99.90
>>   0.00 0.67   30.00   87.

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread John Spray
On Wed, Oct 19, 2016 at 5:17 PM, Jim Kilborn <j...@kilborns.com> wrote:
> John,
>
>
>
> Thanks for the tips….
>
> Unfortunately, I was looking at this page 
> http://docs.ceph.com/docs/jewel/start/os-recommendations/

OK, thanks - I've pushed an update to clarify that
(https://github.com/ceph/ceph/pull/11564).

> I’ll consider either upgrading the kernels or using the fuse client, but will 
> likely go the kernel 4.4 route
>
>
>
> As for moving to just a replicated pool, I take it that a replication size of 
> 3 is minimum recommended.
>
> If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
> servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
> or is that two many osds per journal, creating a hot spot for writes?

That sounds like a lot of journals on one SSD, but people other than
me have more empirical experience in hardware selection.

John

>
>
>
> Thanks!!
>
>
>
>
>
>
>
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
>
>
>
> From: John Spray<mailto:jsp...@redhat.com>
> Sent: Wednesday, October 19, 2016 9:10 AM
> To: Jim Kilborn<mailto:j...@kilborns.com>
> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - 
> cache pressure, capability release, poor iostat await avg queue size
>
>
>
> On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn <j...@kilborns.com> wrote:
>> I have setup a new linux cluster to allow migration from our old SAN based 
>> cluster to a new cluster with ceph.
>> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
>> I am basically running stock ceph settings, with just turning the write 
>> cache off via hdparm on the drives, and temporarily turning of scrubbing.
>>
>> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
>> Server performance should be good.  Since I am running cephfs, I have 
>> tiering setup.
>> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
>> So the idea is to ensure a single host failure.
>> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
>> replicated set with size=2
>> The cache tier also has a 128GB SM863 SSD that is being used as a journal 
>> for the cache SSD. It has power loss protection
>> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
>> the erasure code uses only the 16 spinning 4TB drives.
>>
>> The problems that I am seeing is that I start copying data from our old san 
>> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
>> 1.4 TB, I start seeing:
>>
>> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>> 26 ops are blocked > 65.536 sec on osd.0
>> 37 ops are blocked > 32.768 sec on osd.0
>> 1 osds have slow requests
>> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>>
>> osd.0 is the cache ssd
>>
>> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
>> await are high
>> Below is the iostat on the cache drive (osd.0) on the first host. The 
>> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> sdb
>>   0.00 0.339.00   84.33 0.9620.11   462.40   
>>  75.92  397.56  125.67  426.58  10.70  99.90
>>   0.00 0.67   30.00   87.33 5.9621.03   471.20   
>>  67.86  910.95   87.00 1193.99   8.27  97.07
>>   0.0016.67   33.00  289.33 4.2118.80   146.20   
>>  29.83   88.99   93.91   88.43   3.10  99.83
>>   0.00 7.337.67  261.67 1.9219.63   163.81   
>> 117.42  331.97  182.04  336.36   3.71 100.00
>>
>>
>> If I look at the iostat for all the drives, only the cache ssd drive is 
>> backed up
>>
>> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
>> avgqu-sz   await r_await w_await  svctm  %util
>> Sdg (journal for cache drive)
>>   0.00 6.330.008.00 0.00 0.0719.04   
>>   0.000.330.000.33   0.33   0.27
>> Sdb (cache drive)
>>   0.00 0.333.33   82.00 0.8320.07   501.68   
>> 106.75 1057.81  269.40 1089.86  11.72 100.00
>> Sda (4TB EC)
>> 

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
John,



Thanks for the tips….

Unfortunately, I was looking at this page 
http://docs.ceph.com/docs/jewel/start/os-recommendations/

I’ll consider either upgrading the kernels or using the fuse client, but will 
likely go the kernel 4.4 route



As for moving to just a replicated pool, I take it that a replication size of 3 
is minimum recommended.

If I move to no EC, I will have to have have 9 4TB spinners on of the 4 
servers. Can I put the 9 journals on the one 128GB ssd with 10GB per journal, 
or is that two many osds per journal, creating a hot spot for writes?



Thanks!!







Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10



From: John Spray<mailto:jsp...@redhat.com>
Sent: Wednesday, October 19, 2016 9:10 AM
To: Jim Kilborn<mailto:j...@kilborns.com>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache 
pressure, capability release, poor iostat await avg queue size



On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn <j...@kilborns.com> wrote:
> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.  Since I am running cephfs, I have tiering 
> setup.
> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2
> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection
> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The 
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> sdb
>   0.00 0.339.00   84.33 0.9620.11   462.40
> 75.92  397.56  125.67  426.58  10.70  99.90
>   0.00 0.67   30.00   87.33 5.9621.03   471.20
> 67.86  910.95   87.00 1193.99   8.27  97.07
>   0.0016.67   33.00  289.33 4.2118.80   146.20
> 29.83   88.99   93.91   88.43   3.10  99.83
>   0.00 7.337.67  261.67 1.9219.63   163.81   
> 117.42  331.97  182.04  336.36   3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is 
> backed up
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> Sdg (journal for cache drive)
>   0.00 6.330.008.00 0.00 0.0719.04
>  0.000.330.000.33   0.33   0.27
> Sdb (cache drive)
>   0.00 0.333.33   82.00 0.8320.07   501.68   
> 106.75 1057.81  269.40 1089.86  11.72 100.00
> Sda (4TB EC)
>   0.00 0.000.004.00 0.00 0.02 9.33
>  0.000.000.000.00   0.00   0.00
> Sdd (4TB EC)
>   0.00 0.000.002.33 0.00 0.45   392.00
>  0.08   34.000.00   34.00   6.86   1.60
> Sdf (4TB EC)
>   0.0014.000.00   26.00 0.00 0.2217.71
>  1.00   38.550.00   38.55   0.68   1.77
> Sdc (4TB EC)
>   0.00 0.000.001.33 0.00 0.01 8.75
>  0.02   12.250.00   12.25  12.25   1.63
>
> While at this time is just complaining about slow osd.0, sometimes the other 
> cache tier ssds show some slow response, but not as frequently.
>
>
> I 

Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread John Spray
On Wed, Oct 19, 2016 at 1:28 PM, Jim Kilborn  wrote:
> I have setup a new linux cluster to allow migration from our old SAN based 
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> I am basically running stock ceph settings, with just turning the write cache 
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
> Server performance should be good.  Since I am running cephfs, I have tiering 
> setup.
> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. 
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
> replicated set with size=2
> The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
> the cache SSD. It has power loss protection
> My crush map is setup to ensure the cache pool uses only the 4 850 pro and 
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san 
> to the ceph volume, and once the cache tier gets to my  target_max_bytes of 
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the 
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The 
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> sdb
>   0.00 0.339.00   84.33 0.9620.11   462.40
> 75.92  397.56  125.67  426.58  10.70  99.90
>   0.00 0.67   30.00   87.33 5.9621.03   471.20
> 67.86  910.95   87.00 1193.99   8.27  97.07
>   0.0016.67   33.00  289.33 4.2118.80   146.20
> 29.83   88.99   93.91   88.43   3.10  99.83
>   0.00 7.337.67  261.67 1.9219.63   163.81   
> 117.42  331.97  182.04  336.36   3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is 
> backed up
>
> Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz 
>   await r_await w_await  svctm  %util
> Sdg (journal for cache drive)
>   0.00 6.330.008.00 0.00 0.0719.04
>  0.000.330.000.33   0.33   0.27
> Sdb (cache drive)
>   0.00 0.333.33   82.00 0.8320.07   501.68   
> 106.75 1057.81  269.40 1089.86  11.72 100.00
> Sda (4TB EC)
>   0.00 0.000.004.00 0.00 0.02 9.33
>  0.000.000.000.00   0.00   0.00
> Sdd (4TB EC)
>   0.00 0.000.002.33 0.00 0.45   392.00
>  0.08   34.000.00   34.00   6.86   1.60
> Sdf (4TB EC)
>   0.0014.000.00   26.00 0.00 0.2217.71
>  1.00   38.550.00   38.55   0.68   1.77
> Sdc (4TB EC)
>   0.00 0.000.001.33 0.00 0.01 8.75
>  0.02   12.250.00   12.25  12.25   1.63
>
> While at this time is just complaining about slow osd.0, sometimes the other 
> cache tier ssds show some slow response, but not as frequently.
>
>
> I occasionally see complaints about a client not responding to cache 
> pressure, and yesterday while copying serveral terabytes, the client doing 
> the copy was noted for failing to respond to capability release, and I ended 
> up rebooting it.
>
> I just seems the cluster isn’t handling large amounts of data copies, like 
> and nfs or san based volume would, and I am worried about moving our users to 
> a cluster that already is showing signs of performance issues, even when I am 
> just doing a copy with no other users. I am doing only one rsync at a time.
>
> Is the problem that I need to user a later kernel for the clients mounting 
> the volume ? I have read some posts about that, but the docs say centos 7 
> with 3.10 is ok.

Which docs say to use the stock centos kernel?  >4.4 is recommended
here: http://docs.ceph.com/docs/master/cephfs/best-practices/

> Do I need more drives in my cache pool? I only have 4 ssd drive in the cache 
> pool (one on each host), with each having a separate journal drive.
> But is that too much of a hot spot since all i/o has to go to the cache layer?
> It seems like my ssds should be able to keep up with a single rsync copy.
> Is there something set wrong on my ssds that they cant keep up?

Cache tiering is pretty sensitive to 

[ceph-users] New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

2016-10-19 Thread Jim Kilborn
I have setup a new linux cluster to allow migration from our old SAN based 
cluster to a new cluster with ceph.
All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
I am basically running stock ceph settings, with just turning the write cache 
off via hdparm on the drives, and temporarily turning of scrubbing.

The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So 
Server performance should be good.  Since I am running cephfs, I have tiering 
setup.
Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So 
the idea is to ensure a single host failure.
Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a 
replicated set with size=2
The cache tier also has a 128GB SM863 SSD that is being used as a journal for 
the cache SSD. It has power loss protection
My crush map is setup to ensure the cache pool uses only the 4 850 pro and the 
erasure code uses only the 16 spinning 4TB drives.

The problems that I am seeing is that I start copying data from our old san to 
the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 
TB, I start seeing:

HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; 
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
26 ops are blocked > 65.536 sec on osd.0
37 ops are blocked > 32.768 sec on osd.0
1 osds have slow requests
noout,noscrub,nodeep-scrub,sortbitwise flag(s) set

osd.0 is the cache ssd

If I watch iostat on the cache ssd, I see the queue lengths are high and the 
await are high
Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz 
is between 87 and 182 and the await is between 88ms and 1193ms

Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz   
await r_await w_await  svctm  %util
sdb
  0.00 0.339.00   84.33 0.9620.11   462.40
75.92  397.56  125.67  426.58  10.70  99.90
  0.00 0.67   30.00   87.33 5.9621.03   471.20
67.86  910.95   87.00 1193.99   8.27  97.07
  0.0016.67   33.00  289.33 4.2118.80   146.20
29.83   88.99   93.91   88.43   3.10  99.83
  0.00 7.337.67  261.67 1.9219.63   163.81   
117.42  331.97  182.04  336.36   3.71 100.00


If I look at the iostat for all the drives, only the cache ssd drive is backed 
up

Device:   rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz   
await r_await w_await  svctm  %util
Sdg (journal for cache drive)
  0.00 6.330.008.00 0.00 0.0719.04 
0.000.330.000.33   0.33   0.27
Sdb (cache drive)
  0.00 0.333.33   82.00 0.8320.07   501.68   
106.75 1057.81  269.40 1089.86  11.72 100.00
Sda (4TB EC)
  0.00 0.000.004.00 0.00 0.02 9.33 
0.000.000.000.00   0.00   0.00
Sdd (4TB EC)
  0.00 0.000.002.33 0.00 0.45   392.00 
0.08   34.000.00   34.00   6.86   1.60
Sdf (4TB EC)
  0.0014.000.00   26.00 0.00 0.2217.71 
1.00   38.550.00   38.55   0.68   1.77
Sdc (4TB EC)
  0.00 0.000.001.33 0.00 0.01 8.75 
0.02   12.250.00   12.25  12.25   1.63

While at this time is just complaining about slow osd.0, sometimes the other 
cache tier ssds show some slow response, but not as frequently.


I occasionally see complaints about a client not responding to cache pressure, 
and yesterday while copying serveral terabytes, the client doing the copy was 
noted for failing to respond to capability release, and I ended up rebooting it.

I just seems the cluster isn’t handling large amounts of data copies, like and 
nfs or san based volume would, and I am worried about moving our users to a 
cluster that already is showing signs of performance issues, even when I am 
just doing a copy with no other users. I am doing only one rsync at a time.

Is the problem that I need to user a later kernel for the clients mounting the 
volume ? I have read some posts about that, but the docs say centos 7 with 3.10 
is ok.
Do I need more drives in my cache pool? I only have 4 ssd drive in the cache 
pool (one on each host), with each having a separate journal drive.
But is that too much of a hot spot since all i/o has to go to the cache layer?
It seems like my ssds should be able to keep up with a single rsync copy.
Is there something set wrong on my ssds that they cant keep up?
I put the metadata pool on the ssd cache tier drives as well.

Any ideas where the problem is or what I need to change to make this stable?


Thanks. Additional details below

The ceph osd tier drives are osd 0,  5,  10, 15

ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
63155G 50960G   12195G 19.31
POOLS:
NAMEID USED   %USED MAX AVAIL OBJECTS