On the machine in question, the 2nd newest, we are using the LSI MegaRAID
SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no battery.
The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
earlier, each single drive configured as RAID0.

Thanks for everyone's help.
I am going to run a 32 thread bench test after taking the 2nd machine out
of the cluster with noout.
After it is out of the cluster, I am expecting the slow write issue will
not surface.


On Fri, Oct 20, 2017 at 5:27 AM, David Turner <[email protected]> wrote:

> I can attest that the battery in the raid controller is a thing. I'm used
> to using lsi controllers, but my current position has hp raid controllers
> and we just tracked down 10 of our nodes that had >100ms await pretty much
> always were the only 10 nodes in the cluster with failed batteries on the
> raid controllers.
>
> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <[email protected]> wrote:
>
>>
>> Hello,
>>
>> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
>>
>> > That is a good idea.
>> > However, a previous rebalancing processes has brought performance of our
>> > Guest VMs to a slow drag.
>> >
>>
>> Never mind that I'm not sure that these SSDs are particular well suited
>> for Ceph, your problem is clearly located on that one node.
>>
>> Not that I think it's the case, but make sure your PG distribution is not
>> skewed with many more PGs per OSD on that node.
>>
>> Once you rule that out my first guess is the RAID controller, you're
>> running the SSDs are single RAID0s I presume?
>> If so a either configuration difference or a failed BBU on the controller
>> could result in the writeback cache being disabled, which would explain
>> things beautifully.
>>
>> As for a temporary test/fix (with reduced redundancy of course), set noout
>> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow host
>> off.
>>
>> This should result in much better performance than you have now and of
>> course be the final confirmation of that host being the culprit.
>>
>> Christian
>>
>> >
>> > On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez <[email protected]
>> >
>> > wrote:
>> >
>> > > Hi Russell,
>> > >
>> > > as you have 4 servers, assuming you are not doing EC pools, just stop
>> all
>> > > the OSDs on the second questionable server, mark the OSDs on that
>> server as
>> > > out, let the cluster rebalance and when all PGs are active+clean just
>> > > replay the test.
>> > >
>> > > All IOs should then go only to the other 3 servers.
>> > >
>> > > JC
>> > >
>> > > On Oct 19, 2017, at 13:49, Russell Glaue <[email protected]> wrote:
>> > >
>> > > No, I have not ruled out the disk controller and backplane making the
>> > > disks slower.
>> > > Is there a way I could test that theory, other than swapping out
>> hardware?
>> > > -RG
>> > >
>> > > On Thu, Oct 19, 2017 at 3:44 PM, David Turner <[email protected]>
>> > > wrote:
>> > >
>> > >> Have you ruled out the disk controller and backplane in the server
>> > >> running slower?
>> > >>
>> > >> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <[email protected]>
>> wrote:
>> > >>
>> > >>> I ran the test on the Ceph pool, and ran atop on all 4 storage
>> servers,
>> > >>> as suggested.
>> > >>>
>> > >>> Out of the 4 servers:
>> > >>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
>> > >>> Momentarily spiking up to 50% on one server, and 80% on another
>> > >>> The 2nd newest server was almost averaging 90% disk %busy and 150%
>> CPU
>> > >>> wait. And more than momentarily spiking to 101% disk busy and 250%
>> CPU wait.
>> > >>> For this 2nd newest server, this was the statistics for about 8 of 9
>> > >>> disks, with the 9th disk not far behind the others.
>> > >>>
>> > >>> I cannot believe all 9 disks are bad
>> > >>> They are the same disks as the newest 1st server,
>> Crucial_CT960M500SSD1,
>> > >>> and same exact server hardware too.
>> > >>> They were purchased at the same time in the same purchase order and
>> > >>> arrived at the same time.
>> > >>> So I cannot believe I just happened to put 9 bad disks in one
>> server,
>> > >>> and 9 good ones in the other.
>> > >>>
>> > >>> I know I have Ceph configured exactly the same on all servers
>> > >>> And I am sure I have the hardware settings configured exactly the
>> same
>> > >>> on the 1st and 2nd servers.
>> > >>> So if I were someone else, I would say it maybe is bad hardware on
>> the
>> > >>> 2nd server.
>> > >>> But the 2nd server is running very well without any hint of a
>> problem.
>> > >>>
>> > >>> Any other ideas or suggestions?
>> > >>>
>> > >>> -RG
>> > >>>
>> > >>>
>> > >>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar <
>> [email protected]>
>> > >>> wrote:
>> > >>>
>> > >>>> just run the same 32 threaded rados test as you did before and this
>> > >>>> time run atop while the test is running looking for %busy of
>> cpu/disks. It
>> > >>>> should give an idea if there is a bottleneck in them.
>> > >>>>
>> > >>>> On 2017-10-18 21:35, Russell Glaue wrote:
>> > >>>>
>> > >>>> I cannot run the write test reviewed at the
>> ceph-how-to-test-if-your-s
>> > >>>> sd-is-suitable-as-a-journal-device blog. The tests write directly
>> to
>> > >>>> the raw disk device.
>> > >>>> Reading an infile (created with urandom) on one SSD, writing the
>> > >>>> outfile to another osd, yields about 17MB/s.
>> > >>>> But Isn't this write speed limited by the speed in which in the dd
>> > >>>> infile can be read?
>> > >>>> And I assume the best test should be run with no other load.
>> > >>>>
>> > >>>> How does one run the rados bench "as stress"?
>> > >>>>
>> > >>>> -RG
>> > >>>>
>> > >>>>
>> > >>>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar <
>> [email protected]>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> measuring resource load as outlined earlier will show if the
>> drives
>> > >>>>> are performing well or not. Also how many osds do you have  ?
>> > >>>>>
>> > >>>>> On 2017-10-18 19:26, Russell Glaue wrote:
>> > >>>>>
>> > >>>>> The SSD drives are Crucial M500
>> > >>>>> A Ceph user did some benchmarks and found it had good performance
>> > >>>>> https://forum.proxmox.com/threads/ceph-bad-performance-in-
>> > >>>>> qemu-guests.21551/
>> > >>>>>
>> > >>>>> However, a user comment from 3 years ago on the blog post you
>> linked
>> > >>>>> to says to avoid the Crucial M500
>> > >>>>>
>> > >>>>> Yet, this performance posting tells that the Crucial M500 is good.
>> > >>>>> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>> > >>>>>
>> > >>>>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar <
>> [email protected]>
>> > >>>>> wrote:
>> > >>>>>
>> > >>>>>> Check out the following link: some SSDs perform bad in Ceph due
>> to
>> > >>>>>> sync writes to journal
>> > >>>>>>
>> > >>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
>> > >>>>>> t-if-your-ssd-is-suitable-as-a-journal-device/
>> > >>>>>>
>> > >>>>>> Anther thing that can help is to re-run the rados 32 threads as
>> > >>>>>> stress and view resource usage using atop (or collectl/sar) to
>> check for
>> > >>>>>> %busy cpu and %busy disks to give you an idea of what is holding
>> down your
>> > >>>>>> cluster..for example: if cpu/disk % are all low then check your
>> > >>>>>> network/switches.  If disk %busy is high (90%) for all disks
>> then your
>> > >>>>>> disks are the bottleneck: which either means you have SSDs that
>> are not
>> > >>>>>> suitable for Ceph or you have too few disks (which i doubt is
>> the case). If
>> > >>>>>> only 1 disk %busy is high, there may be something wrong with
>> this disk
>> > >>>>>> should be removed.
>> > >>>>>>
>> > >>>>>> Maged
>> > >>>>>>
>> > >>>>>> On 2017-10-18 18:13, Russell Glaue wrote:
>> > >>>>>>
>> > >>>>>> In my previous post, in one of my points I was wondering if the
>> > >>>>>> request size would increase if I enabled jumbo packets.
>> currently it is
>> > >>>>>> disabled.
>> > >>>>>>
>> > >>>>>> @jdillama: The qemu settings for both these two guest machines,
>> with
>> > >>>>>> RAID/LVM and Ceph/rbd images, are the same. I am not thinking
>> that changing
>> > >>>>>> the qemu settings of "min_io_size=<limited to
>> 16bits>,opt_io_size=<RBD
>> > >>>>>> image object size>" will directly address the issue.
>> > >>>>>>
>> > >>>>>> @mmokhtar: Ok. So you suggest the request size is the result of
>> the
>> > >>>>>> problem and not the cause of the problem. meaning I should go
>> after a
>> > >>>>>> different issue.
>> > >>>>>>
>> > >>>>>> I have been trying to get write speeds up to what people on this
>> mail
>> > >>>>>> list are discussing.
>> > >>>>>> It seems that for our configuration, as it matches others, we
>> should
>> > >>>>>> be getting about 70MB/s write speed.
>> > >>>>>> But we are not getting that.
>> > >>>>>> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are
>> > >>>>>> typically 1MB/s to 2MB/s.
>> > >>>>>> Monitoring the entire Ceph cluster (using
>> > >>>>>> http://cephdash.crapworks.de/), I have seen very rare momentary
>> > >>>>>> spikes up to 30MB/s.
>> > >>>>>>
>> > >>>>>> My storage network is connected via a 10Gb switch
>> > >>>>>> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208
>> controller
>> > >>>>>> Each storage server has 9 1TB SSD drives, each drive as 1 osd (no
>> > >>>>>> RAID)
>> > >>>>>> Each drive is one LVM group, with two volumes - one volume for
>> the
>> > >>>>>> osd, one volume for the journal
>> > >>>>>> Each osd is formatted with xfs
>> > >>>>>> The crush map is simple: default->rack->[host[1..4]->osd] with
>> an
>> > >>>>>> evenly distributed weight
>> > >>>>>> The redundancy is triple replication
>> > >>>>>>
>> > >>>>>> While I have read comments that having the osd and journal on the
>> > >>>>>> same disk decreases write speed, I have also read that once past
>> 8 OSDs per
>> > >>>>>> node this is the recommended configuration, however this is also
>> the reason
>> > >>>>>> why SSD drives are used exclusively for OSDs in the storage
>> nodes.
>> > >>>>>> None-the-less, I was still expecting write speeds to be above
>> 30MB/s,
>> > >>>>>> not below 6MB/s.
>> > >>>>>> Even at 12x slower than the RAID, using my previously posted
>> iostat
>> > >>>>>> data set, I should be seeing write speeds that average 10MB/s,
>> not 2MB/s.
>> > >>>>>>
>> > >>>>>> In regards to the rados benchmark tests you asked me to run,
>> here is
>> > >>>>>> the output:
>> > >>>>>>
>> > >>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 1
>> > >>>>>> Maintaining 1 concurrent writes of 4096 bytes to objects of size
>> 4096
>> > >>>>>> for up to 30 seconds or 0 objects
>> > >>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049
>> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)
>> > >>>>>>  avg lat(s)
>> > >>>>>>     0       0         0         0         0         0           -
>> > >>>>>>       0
>> > >>>>>>     1       1       201       200   0.78356   0.78125  0.00522307
>> > >>>>>>  0.00496574
>> > >>>>>>     2       1       469       468  0.915303   1.04688  0.00437497
>> > >>>>>>  0.00426141
>> > >>>>>>     3       1       741       740  0.964371    1.0625  0.00512853
>> > >>>>>> 0.0040434
>> > >>>>>>     4       1       888       887  0.866739  0.574219  0.00307699
>> > >>>>>>  0.00450177
>> > >>>>>>     5       1      1147      1146  0.895725   1.01172  0.00376454
>> > >>>>>> 0.0043559
>> > >>>>>>     6       1      1325      1324  0.862293  0.695312  0.00459443
>> > >>>>>>  0.004525
>> > >>>>>>     7       1      1494      1493   0.83339  0.660156  0.00461002
>> > >>>>>>  0.00458452
>> > >>>>>>     8       1      1736      1735  0.847369  0.945312  0.00253971
>> > >>>>>>  0.00460458
>> > >>>>>>     9       1      1998      1997  0.866922   1.02344  0.00236573
>> > >>>>>>  0.00450172
>> > >>>>>>    10       1      2260      2259  0.882563   1.02344  0.00262179
>> > >>>>>>  0.00442152
>> > >>>>>>    11       1      2526      2525  0.896775   1.03906  0.00336914
>> > >>>>>>  0.00435092
>> > >>>>>>    12       1      2760      2759  0.898203  0.914062  0.00351827
>> > >>>>>>  0.00434491
>> > >>>>>>    13       1      3016      3015  0.906025         1  0.00335703
>> > >>>>>>  0.00430691
>> > >>>>>>    14       1      3257      3256  0.908545  0.941406  0.00332344
>> > >>>>>>  0.00429495
>> > >>>>>>    15       1      3490      3489  0.908644  0.910156  0.00318815
>> > >>>>>>  0.00426387
>> > >>>>>>    16       1      3728      3727  0.909952  0.929688   0.0032881
>> > >>>>>>  0.00428895
>> > >>>>>>    17       1      3986      3985  0.915703   1.00781  0.00274809
>> > >>>>>> 0.0042614
>> > >>>>>>    18       1      4250      4249  0.922116   1.03125  0.00287411
>> > >>>>>>  0.00423214
>> > >>>>>>    19       1      4505      4504  0.926003  0.996094  0.00375435
>> > >>>>>>  0.00421442
>> > >>>>>> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat: 0.270553
>> avg
>> > >>>>>> lat: 0.00420118
>> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)
>> > >>>>>>  avg lat(s)
>> > >>>>>>    20       1      4757      4756  0.928915  0.984375  0.00463972
>> > >>>>>>  0.00420118
>> > >>>>>>    21       1      5009      5008   0.93155  0.984375  0.00360065
>> > >>>>>>  0.00418937
>> > >>>>>>    22       1      5235      5234  0.929329  0.882812  0.00626214
>> > >>>>>>  0.004199
>> > >>>>>>    23       1      5500      5499  0.933925   1.03516  0.00466584
>> > >>>>>>  0.00417836
>> > >>>>>>    24       1      5708      5707  0.928861    0.8125  0.00285727
>> > >>>>>>  0.00420146
>> > >>>>>>    25       0      5964      5964  0.931858   1.00391  0.00417383
>> > >>>>>> 0.0041881
>> > >>>>>>    26       1      6216      6215  0.933722  0.980469   0.0041009
>> > >>>>>>  0.00417915
>> > >>>>>>    27       1      6481      6480  0.937474   1.03516  0.00307484
>> > >>>>>>  0.00416118
>> > >>>>>>    28       1      6745      6744  0.940819   1.03125  0.00266329
>> > >>>>>>  0.00414777
>> > >>>>>>    29       1      7003      7002  0.943124   1.00781  0.00305905
>> > >>>>>>  0.00413758
>> > >>>>>>    30       1      7271      7270  0.946578   1.04688  0.00391017
>> > >>>>>>  0.00412238
>> > >>>>>> Total time run:         30.006060
>> > >>>>>> Total writes made:      7272
>> > >>>>>> Write size:             4096
>> > >>>>>> Object size:            4096
>> > >>>>>> Bandwidth (MB/sec):     0.946684
>> > >>>>>> Stddev Bandwidth:       0.123762
>> > >>>>>> Max bandwidth (MB/sec): 1.0625
>> > >>>>>> Min bandwidth (MB/sec): 0.574219
>> > >>>>>> Average IOPS:           242
>> > >>>>>> Stddev IOPS:            31
>> > >>>>>> Max IOPS:               272
>> > >>>>>> Min IOPS:               147
>> > >>>>>> Average Latency(s):     0.00412247
>> > >>>>>> Stddev Latency(s):      0.00648437
>> > >>>>>> Max latency(s):         0.270553
>> > >>>>>> Min latency(s):         0.00175318
>> > >>>>>> Cleaning up (deleting benchmark objects)
>> > >>>>>> Clean up completed and total clean up time :29.069423
>> > >>>>>>
>> > >>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 32
>> > >>>>>> Maintaining 32 concurrent writes of 4096 bytes to objects of size
>> > >>>>>> 4096 for up to 30 seconds or 0 objects
>> > >>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076
>> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)
>> > >>>>>>  avg lat(s)
>> > >>>>>>     0       0         0         0         0         0           -
>> > >>>>>>       0
>> > >>>>>>     1      32      3013      2981   11.6438   11.6445  0.00247906
>> > >>>>>>  0.00572026
>> > >>>>>>     2      32      5349      5317   10.3834     9.125  0.00246662
>> > >>>>>>  0.00932016
>> > >>>>>>     3      32      5707      5675    7.3883   1.39844  0.00389774
>> > >>>>>> 0.0156726
>> > >>>>>>     4      32      5895      5863   5.72481  0.734375     1.13137
>> > >>>>>> 0.0167946
>> > >>>>>>     5      32      6869      6837   5.34068   3.80469   0.0027652
>> > >>>>>> 0.0226577
>> > >>>>>>     6      32      8901      8869   5.77306    7.9375   0.0053211
>> > >>>>>> 0.0216259
>> > >>>>>>     7      32     10800     10768   6.00785   7.41797  0.00358187
>> > >>>>>> 0.0207418
>> > >>>>>>     8      32     11825     11793   5.75728   4.00391  0.00217575
>> > >>>>>> 0.0215494
>> > >>>>>>     9      32     12941     12909    5.6019   4.35938  0.00278512
>> > >>>>>> 0.0220567
>> > >>>>>>    10      32     13317     13285   5.18849   1.46875   0.0034973
>> > >>>>>> 0.0240665
>> > >>>>>>    11      32     16189     16157   5.73653   11.2188  0.00255841
>> > >>>>>> 0.0212708
>> > >>>>>>    12      32     16749     16717   5.44077    2.1875  0.00330334
>> > >>>>>> 0.0215915
>> > >>>>>>    13      32     16756     16724   5.02436 0.0273438  0.00338994
>> > >>>>>>  0.021849
>> > >>>>>>    14      32     17908     17876   4.98686       4.5  0.00402598
>> > >>>>>> 0.0244568
>> > >>>>>>    15      32     17936     17904   4.66171  0.109375  0.00375799
>> > >>>>>> 0.0245545
>> > >>>>>>    16      32     18279     18247   4.45409   1.33984  0.00483873
>> > >>>>>> 0.0267929
>> > >>>>>>    17      32     18372     18340   4.21346  0.363281  0.00505187
>> > >>>>>> 0.0275887
>> > >>>>>>    18      32     19403     19371   4.20309   4.02734  0.00545154
>> > >>>>>>  0.029348
>> > >>>>>>    19      31     19845     19814   4.07295   1.73047  0.00254726
>> > >>>>>> 0.0306775
>> > >>>>>> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707
>> avg
>> > >>>>>> lat: 0.0307559
>> > >>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)
>> > >>>>>>  avg lat(s)
>> > >>>>>>    20      31     20401     20370   3.97788   2.17188  0.00307238
>> > >>>>>> 0.0307559
>> > >>>>>>    21      32     21338     21306   3.96254   3.65625  0.00464563
>> > >>>>>> 0.0312288
>> > >>>>>>    22      32     23057     23025    4.0876   6.71484  0.00296295
>> > >>>>>> 0.0299267
>> > >>>>>>    23      32     23057     23025   3.90988         0           -
>> > >>>>>> 0.0299267
>> > >>>>>>    24      32     23803     23771   3.86837   1.45703  0.00301471
>> > >>>>>> 0.0312804
>> > >>>>>>    25      32     24112     24080   3.76191   1.20703  0.00191063
>> > >>>>>> 0.0331462
>> > >>>>>>    26      31     25303     25272   3.79629   4.65625  0.00794399
>> > >>>>>> 0.0329129
>> > >>>>>>    27      32     28803     28771   4.16183    13.668   0.0109817
>> > >>>>>> 0.0297469
>> > >>>>>>    28      32     29592     29560   4.12325   3.08203  0.00188185
>> > >>>>>> 0.0301911
>> > >>>>>>    29      32     30595     30563   4.11616   3.91797  0.00379099
>> > >>>>>> 0.0296794
>> > >>>>>>    30      32     31031     30999   4.03572   1.70312  0.00283347
>> > >>>>>> 0.0302411
>> > >>>>>> Total time run:         30.822350
>> > >>>>>> Total writes made:      31032
>> > >>>>>> Write size:             4096
>> > >>>>>> Object size:            4096
>> > >>>>>> Bandwidth (MB/sec):     3.93282
>> > >>>>>> Stddev Bandwidth:       3.66265
>> > >>>>>> Max bandwidth (MB/sec): 13.668
>> > >>>>>> Min bandwidth (MB/sec): 0
>> > >>>>>> Average IOPS:           1006
>> > >>>>>> Stddev IOPS:            937
>> > >>>>>> Max IOPS:               3499
>> > >>>>>> Min IOPS:               0
>> > >>>>>> Average Latency(s):     0.0317779
>> > >>>>>> Stddev Latency(s):      0.164076
>> > >>>>>> Max latency(s):         2.27707
>> > >>>>>> Min latency(s):         0.0013848
>> > >>>>>> Cleaning up (deleting benchmark objects)
>> > >>>>>> Clean up completed and total clean up time :20.166559
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar <
>> [email protected]>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> First a general comment: local RAID will be faster than Ceph
>> for a
>> > >>>>>>> single threaded (queue depth=1) io operation test. A single
>> thread Ceph
>> > >>>>>>> client will see at best same disk speed for reads and for
>> writes 4-6 times
>> > >>>>>>> slower than single disk. Not to mention the latency of local
>> disks will
>> > >>>>>>> much better. Where Ceph shines is when you have many concurrent
>> ios, it
>> > >>>>>>> scales whereas RAID will decrease speed per client as you add
>> more.
>> > >>>>>>>
>> > >>>>>>> Having said that, i would recommend running rados/rbd
>> bench-write
>> > >>>>>>> and measure 4k iops at 1 and 32 threads to get a better idea of
>> how your
>> > >>>>>>> cluster performs:
>> > >>>>>>>
>> > >>>>>>> ceph osd pool create testpool 256 256
>> > >>>>>>> rados bench -p testpool -b 4096 30 write -t 1
>> > >>>>>>> rados bench -p testpool -b 4096 30 write -t 32
>> > >>>>>>> ceph osd pool delete testpool testpool
>> --yes-i-really-really-mean-it
>> > >>>>>>>
>> > >>>>>>> rbd bench-write test-image --io-threads=1 --io-size 4096
>> > >>>>>>> --io-pattern rand --rbd_cache=false
>> > >>>>>>> rbd bench-write test-image --io-threads=32 --io-size 4096
>> > >>>>>>> --io-pattern rand --rbd_cache=false
>> > >>>>>>>
>> > >>>>>>> I think the request size difference you see is due to the io
>> > >>>>>>> scheduler in the case of local disks having more ios to
>> re-group so has a
>> > >>>>>>> better chance in generating larger requests. Depending on your
>> kernel, the
>> > >>>>>>> io scheduler may be different for rbd (blq-mq) vs sdx (cfq) but
>> again i
>> > >>>>>>> would think the request size is a result not a cause.
>> > >>>>>>>
>> > >>>>>>> Maged
>> > >>>>>>>
>> > >>>>>>> On 2017-10-17 23:12, Russell Glaue wrote:
>> > >>>>>>>
>> > >>>>>>> I am running ceph jewel on 5 nodes with SSD OSDs.
>> > >>>>>>> I have an LVM image on a local RAID of spinning disks.
>> > >>>>>>> I have an RBD image on in a pool of SSD disks.
>> > >>>>>>> Both disks are used to run an almost identical CentOS 7 system.
>> > >>>>>>> Both systems were installed with the same kickstart, though the
>> disk
>> > >>>>>>> partitioning is different.
>> > >>>>>>>
>> > >>>>>>> I want to make writes on the the ceph image faster. For example,
>> > >>>>>>> lots of writes to MySQL (via MySQL replication) on a ceph SSD
>> image are
>> > >>>>>>> about 10x slower than on a spindle RAID disk image. The MySQL
>> server on
>> > >>>>>>> ceph rbd image has a hard time keeping up in replication.
>> > >>>>>>>
>> > >>>>>>> So I wanted to test writes on these two systems
>> > >>>>>>> I have a 10GB compressed (gzip) file on both servers.
>> > >>>>>>> I simply gunzip the file on both systems, while running iostat.
>> > >>>>>>>
>> > >>>>>>> The primary difference I see in the results is the average size
>> of
>> > >>>>>>> the request to the disk.
>> > >>>>>>> CentOS7-lvm-raid-sata writes a lot faster to disk, and the size
>> of
>> > >>>>>>> the request is about 40x, but the number of writes per second
>> is about the
>> > >>>>>>> same
>> > >>>>>>> This makes me want to conclude that the smaller size of the
>> request
>> > >>>>>>> for CentOS7-ceph-rbd-ssd system is the cause of it being slow.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> How can I make the size of the request larger for ceph rbd
>> images,
>> > >>>>>>> so I can increase the write throughput?
>> > >>>>>>> Would this be related to having jumbo packets enabled in my ceph
>> > >>>>>>> storage network?
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> Here is a sample of the results:
>> > >>>>>>>
>> > >>>>>>> [CentOS7-lvm-raid-sata]
>> > >>>>>>> $ gunzip large10gFile.gz &
>> > >>>>>>> $ iostat -x vg_root-lv_var -d 5 -m -N
>> > >>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
>> wMB/s
>> > >>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> > >>>>>>> ...
>> > >>>>>>> vg_root-lv_var     0.00     0.00   30.60  452.20    13.60
>>  222.15
>> > >>>>>>>  1000.04     8.69   14.05    0.99   14.93   2.07 100.04
>> > >>>>>>> vg_root-lv_var     0.00     0.00   88.20  182.00    39.20
>> 89.43
>> > >>>>>>> 974.95     4.65    9.82    0.99   14.10   3.70 100.00
>> > >>>>>>> vg_root-lv_var     0.00     0.00   75.45  278.24    33.53
>>  136.70
>> > >>>>>>> 985.73     4.36   33.26    1.34   41.91   0.59  20.84
>> > >>>>>>> vg_root-lv_var     0.00     0.00  111.60  181.80    49.60
>> 89.34
>> > >>>>>>> 969.84     2.60    8.87    0.81   13.81   0.13   3.90
>> > >>>>>>> vg_root-lv_var     0.00     0.00   68.40  109.60    30.40
>> 53.63
>> > >>>>>>> 966.87     1.51    8.46    0.84   13.22   0.80  14.16
>> > >>>>>>> ...
>> > >>>>>>>
>> > >>>>>>> [CentOS7-ceph-rbd-ssd]
>> > >>>>>>> $ gunzip large10gFile.gz &
>> > >>>>>>> $ iostat -x vg_root-lv_data -d 5 -m -N
>> > >>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
>> wMB/s
>> > >>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> > >>>>>>> ...
>> > >>>>>>> vg_root-lv_data     0.00     0.00   46.40  167.80     0.88
>>  1.46
>> > >>>>>>>    22.36     1.23    5.66    2.47    6.54   4.52  96.82
>> > >>>>>>> vg_root-lv_data     0.00     0.00   16.60   55.20     0.36
>>  0.14
>> > >>>>>>>    14.44     0.99   13.91    9.12   15.36  13.71  98.46
>> > >>>>>>> vg_root-lv_data     0.00     0.00   69.00  173.80     1.34
>>  1.32
>> > >>>>>>>    22.48     1.25    5.19    3.77    5.75   3.94  95.68
>> > >>>>>>> vg_root-lv_data     0.00     0.00   74.40  293.40     1.37
>>  1.47
>> > >>>>>>>    15.83     1.22    3.31    2.06    3.63   2.54  93.26
>> > >>>>>>> vg_root-lv_data     0.00     0.00   90.80  359.00     1.96
>>  3.41
>> > >>>>>>>    24.45     1.63    3.63    1.94    4.05   2.10  94.38
>> > >>>>>>> ...
>> > >>>>>>>
>> > >>>>>>> [iostat key]
>> > >>>>>>> w/s == The number (after merges) of write requests completed per
>> > >>>>>>> second for the device.
>> > >>>>>>> wMB/s == The number of sectors (kilobytes, megabytes) written
>> to the
>> > >>>>>>> device per second.
>> > >>>>>>> avgrq-sz == The average size (in kilobytes) of the requests that
>> > >>>>>>> were issued to the device.
>> > >>>>>>> avgqu-sz == The average queue length of the requests that were
>> > >>>>>>> issued to the device.
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>> _______________________________________________
>> > >>>>>>> ceph-users mailing list
>> > >>>>>>> [email protected]
>> > >>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>
>> > >>> _______________________________________________
>> > >>> ceph-users mailing list
>> > >>> [email protected]
>> > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >>>
>> > >>
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > [email protected]
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > >
>> > >
>>
>>
>> --
>> Christian Balzer        Network/Systems Engineer
>> [email protected]           Rakuten Communications
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to