Thanks Christion for the additional information and comments.
· upgraded the kernels, but still had poor performance
· Removed all the pools and recreated with just a replication of 3,
with the two pool for the data and metadata. No cache tier pool
· Turned back on the write caching with hdparm. We do have a Large UPS
and dual power supplies in the ceph unit. If we get a long power outage,
everything will go down anyway.
I am no longer seeing the issue of the slow requests, ops blocked, etc.
I think I will push for the following design per ceph server
8 4TB sata drives
2 Samsung 128GB SM863 SSD each holding 4 osd journals
With 4 hosts, and a replication of 3 to start with
I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding
the 4 osd journals, with 4 hosts in the cluster over infiniband.
At the 4M read, watching iftop, the client is receiving between 4.5 GB/sec -
5.5Gb/sec over infiniband
Which is around 600MB/sec and translates well to the FIO number.
fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K}
--numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting
--name=journal-test
FIO Test
Local disk
SAN/NFS
Ceph w/Repl/SSD journal
4M Writes
53 MB/sec 12 IOPS
62 MB/sec 15 IOPS
151 MB/sec 37 IOPS
4M Rand Writes
34 MB/sec 8 IOPS
63 MB/sec 15 IOPS
155 MB/sec 37 IOPS
4M Read
66 MB/sec 15 IOPS
102 MB/sec 25 IOPS
662 MB/sec 161 IOPS
4M Rand Read
73 MB/sec 17 IOPS
103 MB/sec 25 IOPS
670 MB/sec 163 IOPS
4K Writes
2.9 MB/sec 738 IOPS
3.8 MB/sec 952 IOPS
2.3 MB/sec 571 IOPS
4K Rand Writes
551 KB/sec 134 IOPS
3.6 MB/sec 911 IOPS
2.0 MB/sec 501 IOPS
4K Read
28 MB/sec 7001 IOPS
8 MB/sec 1945 IOPS
13 MB/sec 3256 IOPS
4K Rand Read
263 KB/sec
5 MB/sec 1246 IOPS
8 MB/sec 2015 IOPS
That performance is fine for our needs
Again, thanks for the help guys.
Regards,
Jim
From: Christian Balzer<mailto:[email protected]>
Sent: Wednesday, October 19, 2016 7:54 PM
To: [email protected]<mailto:[email protected]>
Cc: Jim Kilborn<mailto:[email protected]>
Subject: Re: [ceph-users] New cephfs cluster performance issues- Jewel - cache
pressure, capability release, poor iostat await avg queue size
Hello,
On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote:
> I have setup a new linux cluster to allow migration from our old SAN based
> cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.
> I am basically running stock ceph settings, with just turning the write cache
> off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.
The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.
> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So
> Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.
> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.
> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1.
> So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a
> replicated set with size=2
This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.
And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.
> The cache tier also has a 128GB SM863 SSD that is being used as a journal for
> the cache SSD. It has power loss protection
Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.
> My crush map is setup to ensure the cache pool uses only the 4 850 pro and
> the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san
> to the ceph volume, and once the cache tier gets to my target_max_bytes of
> 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests;
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the
> await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The
> avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz
> await r_await w_await svctm %util
> sdb
> 0.00 0.33 9.00 84.33 0.96 20.11 462.40
> 75.92 397.56 125.67 426.58 10.70 99.90
> 0.00 0.67 30.00 87.33 5.96 21.03 471.20
> 67.86 910.95 87.00 1193.99 8.27 97.07
> 0.00 16.67 33.00 289.33 4.21 18.80 146.20
> 29.83 88.99 93.91 88.43 3.10 99.83
> 0.00 7.33 7.67 261.67 1.92 19.63 163.81
> 117.42 331.97 182.04 336.36 3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is
> backed up
>
Yes, consumer SSDs on top of a design that channels everything through
them.
Rebuild your cluster along more conventional and conservative lines, don't
use the 850 PROs.
Feel free to run any new design by us.
Christian
--
Christian Balzer Network/Systems Engineer
[email protected] Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com