I wanted to report an update.

We added more ceph storage nodes, so we can take the problem OSDs out.
speeds are faster.

I found a way to monitor OSD latency in ceph, using "ceph pg dump osds"
The commit latency is always "0" for us.
  fs_perf_stat/commit_latency_ms
But the apply latency shows us the slow OSDs.
  fs_perf_stat/apply_latency_ms

The latest ceph has a prometheus plugin (
http://docs.ceph.com/docs/master/mgr/prometheus/), so this information can
be stored and monitored (e.g. with Grafana). Then, over time, we can see
which OSDs are the problem. (so I don't have to deal with atop, nor run
lots of benchmark tests)  (Use this for older ceph versions:
https://github.com/digitalocean/ceph_exporter)

It turns out we had about 5 problem SSD drives in the slowest ceph node,
and about 2 in the second slowest. All the other OSDs in those two machines
(the crucial drives I reported earlier) are running below a max 0.02
milliseconds - so I just had a few bad drives. The newest ceph nodes we
added, we purchased the kingston drives, and their latency is below a max
0.001 millisecond latency - none are bad drives. I now see up to 28MBps
write speeds, and 260MBps read speeds.

-
# ceph pg dump osds -f json-pretty
dumped osds in format json-pretty

[
    {
        "osd": 8,
        "kb": 1952015104,
        "kb_used": 1331273140,
        "kb_avail": 620741964,
        "hb_in": [
            0,
            1,
            2,
            3,
            5,
            6,
            11,
            12,
            13,
            16,
            17,
            18,
            19,
            20,
            21
        ],
        "hb_out": [],
        "snap_trim_queue_len": 0,
        "num_snap_trimming": 0,
        "op_queue_age_hist": {
            "histogram": [],
            "upper_bound": 1
        },
        "fs_perf_stat": {
            "commit_latency_ms": 0,
            "apply_latency_ms": 49
        }
    },
...
-




On Fri, Dec 8, 2017 at 9:20 AM, Russell Glaue <[email protected]> wrote:

> Here are some random samples I recorded in the past 30 minutes.
>
>  11 K blocks   10542 kB/s   909 op/s
>  12 K blocks   15397 kB/s  1247 op/s
>  26 K blocks   34306 kB/s  1307 op/s
>  33 K blocks   48509 kB/s  1465 op/s
>  59 K blocks   59333 kB/s   999 op/s
> 172 K blocks  101939 kB/s   590 op/s
> 104 K blocks   82605 kB/s   788 op/s
> 128 K blocks   77454 kB/s   601 op/s
> 136 K blocks   47526 kB/s   348 op/s
>
>
>
> On Fri, Dec 8, 2017 at 2:04 AM, Maged Mokhtar <[email protected]>
> wrote:
>
>> 4M block sizes you will only need 22.5 iops
>>
>> On 2017-12-08 09:59, Maged Mokhtar wrote:
>>
>> Hi Russell,
>>
>> It is probably due to the difference in block sizes used in the test vs
>> your cluster load. You have a latency problem which is limiting your max
>> write iops to around 2.5K. For large block sizes you do not need that many
>> iops, for example if you write in 4M block sizes you will only need 12.5
>> iops to reach your bandwidth of 90 MB/s, in such case you latency problem
>> will not affect your bandwidth. The reason i had suggested you run the
>> original test in 4k size was because this was the original problem subject
>> of this thread, the gunzip test and the small block sizes you were getting
>> with iostat.
>>
>> If you want to know a "rough" ballpark on what block sizes you currently
>> see on your cluster, get the total bandwidth and iops as reported by ceph (
>> ceph status should give you this ) and divide the first by the second.
>>
>> I still think you have a significant latency/iops issue: a 36 all SSDs
>> cluster should give much higher that 2.5K iops
>>
>> Maged
>>
>>
>> On 2017-12-07 23:57, Russell Glaue wrote:
>>
>> I want to provide an update to my interesting situation.
>> (New storage nodes were purchased and are going into the cluster soon)
>>
>> I have been monitoring the ceph storage nodes with atop and read/write
>> through put with ceph-dash for the last month.
>> I am regularly seeing 80-90MB/s of write throughput (140MB/s read) on the
>> ceph cluster. At these moments, the problem ceph node I have been speaking
>> of shows 101% disk busy on the same 3 to 4 (of the 9) OSDs. So I am getting
>> the throughput that I want with on the cluster, despite the OSDs in
>> question.
>>
>> However, when I run the bench tests described in this thread, I do not
>> see the write throughput go above 5MB/s.
>> When I take the problem node out, and run the bench tests, I see the
>> throughput double, but not over 10MB/s.
>>
>> Why is the ceph cluster getting up to 90MB/s write in the wild, but not
>> when running the bench tests ?
>>
>> -RG
>>
>>
>>
>>
>> On Fri, Oct 27, 2017 at 4:21 PM, Russell Glaue <[email protected]> wrote:
>>
>>> Yes, several have recommended the fio test now.
>>> I cannot perform a fio test at this time. Because the post referred to
>>> directs us to write the fio test data directly to the disk device, e.g.
>>> /dev/sdj. I'd have to take an OSD completely out in order to perform the
>>> test. And I am not ready to do that at this time. Perhaps after I attempt
>>> the hardware firmware updates, and still do not have an answer, I would
>>> then take an OSD out of the cluster to run the fio test.
>>> Also, our M500 disks on the two newest machines are all running version
>>> MU05, the latest firmware. The on the older two, they are behind a RAID0,
>>> but I suspect they might be MU03 firmware.
>>> -RG
>>>
>>>
>>> On Fri, Oct 27, 2017 at 4:12 PM, Brian Andrus <
>>> [email protected]> wrote:
>>>
>>>> I would be interested in seeing the results from the post mentioned by
>>>> an earlier contributor:
>>>>
>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
>>>> t-if-your-ssd-is-suitable-as-a-journal-device/
>>>>
>>>> Test an "old" M500 and a "new" M500 and see if the performance is A)
>>>> acceptable and B) comparable. Find hardware revision or firmware revision
>>>> in case of A=Good and B=different.
>>>>
>>>> If the "old" device doesn't test well in fio/dd testing, then the
>>>> drives are (as expected) not a great choice for journals and you might want
>>>> to look at hardware/backplane/RAID configuration differences that are
>>>> somehow allowing them to perform adequately.
>>>>
>>>> On Fri, Oct 27, 2017 at 12:36 PM, Russell Glaue <[email protected]>
>>>> wrote:
>>>>
>>>>> Yes, all the MD500s we use are both journal and OSD, even the older
>>>>> ones. We have a 3 year lifecycle and move older nodes from one ceph 
>>>>> cluster
>>>>> to another.
>>>>> On old systems with 3 year old MD500s, they run as RAID0, and run
>>>>> faster than our current problem system with 1 year old MD500s, ran as
>>>>> nonraid pass-through on the controller.
>>>>>
>>>>> All disks are SATA and are connected to a SAS controller. We were
>>>>> wondering if the SAS/SATA conversion is an issue. Yet, the older systems
>>>>> don't exhibit a problem.
>>>>>
>>>>> I found what I wanted to know from a colleague, that when the current
>>>>> ceph cluster was put together, the SSDs tested at 300+MB/s, and ceph
>>>>> cluster writes at 30MB/s.
>>>>>
>>>>> Using SMART tools, the reserved cells in all drives is nearly 100%.
>>>>>
>>>>> Restarting the OSDs minorly improved performance. Still betting on
>>>>> hardware issues that a firmware upgrade may resolve.
>>>>>
>>>>> -RG
>>>>>
>>>>>
>>>>> On Oct 27, 2017 1:14 PM, "Brian Andrus" <[email protected]>
>>>>> wrote:
>>>>>
>>>>> @Russel, are your "older Crucial M500"s being used as journals?
>>>>>
>>>>> Crucial M500s are not to be used as a Ceph journal in my last
>>>>> experience with them. They make good OSDs with an NVMe in front of them
>>>>> perhaps, but not much else.
>>>>>
>>>>> Ceph uses O_DSYNC for journal writes and these drives do not handle
>>>>> them as expected. It's been many years since I've dealt with the M500s
>>>>> specifically, but it has to do with the capacitor/power save feature and
>>>>> how it handles those types of writes. I'm sorry I don't have the emails
>>>>> with specifics around anymore, but last I remember, this was a hardware
>>>>> issue and could not be resolved with firmware.
>>>>>
>>>>> Paging Kyle Bader...
>>>>>
>>>>> On Fri, Oct 27, 2017 at 9:24 AM, Russell Glaue <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> We have older crucial M500 disks operating without such problems. So,
>>>>>> I have to believe it is a hardware firmware issue.
>>>>>> And its peculiar seeing performance boost slightly, even 24 hours
>>>>>> later, when I stop then start the OSDs.
>>>>>>
>>>>>> Our actual writes are low, as most of our Ceph Cluster based images
>>>>>> are low-write, high-memory. So a 20GB/day life/write capacity is a
>>>>>> non-issue for us. Only write speed is the concern. Our write-intensive
>>>>>> images are locked on non-ceph disks.
>>>>>>
>>>>>> What are others using for SSD drives in their Ceph cluster?
>>>>>> With 0.50+ DWPD (Drive Writes Per Day), the Kingston SEDC400S37
>>>>>> models seems to be the best for the price today.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 27, 2017 at 6:34 AM, Maged Mokhtar <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> It is quiet likely related, things are pointing to bad disks.
>>>>>>> Probably the best thing is to plan for disk replacement, the sooner the
>>>>>>> better as it could get worse.
>>>>>>>
>>>>>>>
>>>>>>> On 2017-10-27 02:22, Christian Wuerdig wrote:
>>>>>>>
>>>>>>> Hm, no necessarily directly related to your performance problem,
>>>>>>> however: These SSDs have a listed endurance of 72TB total data
>>>>>>> written
>>>>>>> - over a 5 year period that's 40GB a day or approx 0.04 DWPD. Given
>>>>>>> that you run the journal for each OSD on the same disk, that's
>>>>>>> effectively at most 0.02 DWPD (about 20GB per day per disk). I don't
>>>>>>> know many who'd run a cluster on disks like those. Also it means
>>>>>>> these
>>>>>>> are pure consumer drives which have a habit of exhibiting random
>>>>>>> performance at times (based on unquantified anecdotal personal
>>>>>>> experience with other consumer model SSDs). I wouldn't touch these
>>>>>>> with a long stick for anything but small toy-test clusters.
>>>>>>>
>>>>>>> On Fri, Oct 27, 2017 at 3:44 AM, Russell Glaue <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 25, 2017 at 7:09 PM, Maged Mokhtar <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> It depends on what stage you are in:
>>>>>>> in production, probably the best thing is to setup a monitoring tool
>>>>>>> (collectd/grahite/prometheus/grafana) to monitor both ceph stats as
>>>>>>> well as
>>>>>>> resource load. This will, among other things, show you if you have
>>>>>>> slowing
>>>>>>> disks.
>>>>>>>
>>>>>>>
>>>>>>> I am monitoring Ceph performance with ceph-dash
>>>>>>> (http://cephdash.crapworks.de/), that is why I knew to look into
>>>>>>> the slow
>>>>>>> writes issue. And I am using Monitorix (http://www.monitorix.org/)
>>>>>>> to
>>>>>>> monitor system resources, including Disk I/O.
>>>>>>>
>>>>>>> However, though I can monitor individual disk performance at the
>>>>>>> system
>>>>>>> level, it seems Ceph does not tax any disk more than the worst disk.
>>>>>>> So in
>>>>>>> my monitoring charts, all disks have the same performance.
>>>>>>> All four nodes are base-lining at 50 writes/sec during the cluster's
>>>>>>> normal
>>>>>>> load, with the non-problem hosts spiking up to 150, and the problem
>>>>>>> host
>>>>>>> only spikes up to 100.
>>>>>>> But during the window of time I took the problem host OSDs down to
>>>>>>> run the
>>>>>>> bench tests, the OSDs on the other nodes increased to 300-500
>>>>>>> writes/sec.
>>>>>>> Otherwise, the chart looks the same for all disks on all ceph
>>>>>>> nodes/hosts.
>>>>>>>
>>>>>>> Before production you should first make sure your SSDs are suitable
>>>>>>> for
>>>>>>> Ceph, either by being recommend by other Ceph users or you test them
>>>>>>> yourself for sync writes performance using fio tool as outlined
>>>>>>> earlier.
>>>>>>> Then after you build your cluster you can use rados and/or rbd
>>>>>>> bencmark
>>>>>>> tests to benchmark your cluster and find bottlenecks using
>>>>>>> atop/sar/collectl
>>>>>>> which will help you tune your cluster.
>>>>>>>
>>>>>>>
>>>>>>> All 36 OSDs are: Crucial_CT960M500SSD1
>>>>>>>
>>>>>>> Rados bench tests were done at the beginning. The speed was much
>>>>>>> faster than
>>>>>>> it is now. I cannot recall the test results, someone else on my team
>>>>>>> ran
>>>>>>> them. Recently, I had thought the slow disk problem was a
>>>>>>> configuration
>>>>>>> issue with Ceph - before I posted here. Now we are hoping it may be
>>>>>>> resolved
>>>>>>> with a firmware update. (If it is firmware related, rebooting the
>>>>>>> problem
>>>>>>> node may temporarily resolve this)
>>>>>>>
>>>>>>>
>>>>>>> Though you did see better improvements, your cluster with 27 SSDs
>>>>>>> should
>>>>>>> give much higher numbers than 3k iops. If you are running rados
>>>>>>> bench while
>>>>>>> you have other client ios, then obviously the reported number by the
>>>>>>> tool
>>>>>>> will be less than what the cluster is actually giving...which you
>>>>>>> can find
>>>>>>> out via ceph status command, it will print the total cluster
>>>>>>> throughput and
>>>>>>> iops. If the total is still low i would recommend running the fio
>>>>>>> raw disk
>>>>>>> test, maybe the disks are not suitable. When you removed your 9 bad
>>>>>>> disk
>>>>>>> from 36 and your performance doubled, you still had 2 other disk
>>>>>>> slowing
>>>>>>> you..meaning near 100% busy ? It makes me feel the disk type used is
>>>>>>> not
>>>>>>> good. For these near 100% busy disks can you also measure their raw
>>>>>>> disk
>>>>>>> iops at that load (i am not sure atop shows this, if not use
>>>>>>> sat/syssyat/iostat/collecl).
>>>>>>>
>>>>>>>
>>>>>>> I ran another bench test today with all 36 OSDs up. The overall
>>>>>>> performance
>>>>>>> was improved slightly compared to the original tests. Only 3 OSDs on
>>>>>>> the
>>>>>>> problem host were increasing to 101% disk busy.
>>>>>>> The iops reported from ceph status during this bench test ranged
>>>>>>> from 1.6k
>>>>>>> to 3.3k, the test yielding 4k iops.
>>>>>>>
>>>>>>> Yes, the two other OSDs/disks that were the bottleneck were at 101%
>>>>>>> disk
>>>>>>> busy. The other OSD disks on the same host were sailing along at
>>>>>>> like 50-60%
>>>>>>> busy.
>>>>>>>
>>>>>>> All 36 OSD disks are exactly the same disk. They were all purchased
>>>>>>> at the
>>>>>>> same time. All were installed at the same time.
>>>>>>> I cannot believe it is a problem with the disk model. A failed/bad
>>>>>>> disk,
>>>>>>> perhaps is possible. But the disk model itself cannot be the problem
>>>>>>> based
>>>>>>> on what I am seeing. If I am seeing bad performance on all disks on
>>>>>>> one ceph
>>>>>>> node/host, but not on another ceph node with these same disks, it
>>>>>>> has to be
>>>>>>> some other factor. This is why I am now guessing a firmware upgrade
>>>>>>> is
>>>>>>> needed.
>>>>>>>
>>>>>>> Also, as I eluded to here earlier. I took down all 9 OSDs in the
>>>>>>> problem
>>>>>>> host yesterday to run the bench test.
>>>>>>> Today, with those 9 OSDs back online, I rerun the bench test, I am
>>>>>>> see 2-3
>>>>>>> OSD disks with 101% busy on the problem host, and the other disks
>>>>>>> are lower
>>>>>>> than 80%. So, for whatever reason, shutting down the OSDs and
>>>>>>> starting them
>>>>>>> back up, allowed many (not all) of the OSDs performance to improve
>>>>>>> on the
>>>>>>> problem host.
>>>>>>>
>>>>>>>
>>>>>>> Maged
>>>>>>>
>>>>>>> On 2017-10-25 23:44, Russell Glaue wrote:
>>>>>>>
>>>>>>> Thanks to all.
>>>>>>> I took the OSDs down in the problem host, without shutting down the
>>>>>>> machine.
>>>>>>> As predicted, our MB/s about doubled.
>>>>>>> Using this bench/atop procedure, I found two other OSDs on another
>>>>>>> host
>>>>>>> that are the next bottlenecks.
>>>>>>>
>>>>>>> Is this the only good way to really test the performance of the
>>>>>>> drives as
>>>>>>> OSDs? Is there any other way?
>>>>>>>
>>>>>>> While running the bench on all 36 OSDs, the 9 problem OSDs stuck
>>>>>>> out. But
>>>>>>> two new problem OSDs I just discovered in this recent test of 27
>>>>>>> OSDs did
>>>>>>> not stick out at all. Because ceph bench distributes the load making
>>>>>>> only
>>>>>>> the very worst denominators show up in atop. So ceph is a slow as
>>>>>>> your
>>>>>>> slowest drive.
>>>>>>>
>>>>>>> It would be really great if I could run the bench test, and some how
>>>>>>> get
>>>>>>> the bench to use only certain OSDs during the test. Then I could run
>>>>>>> the
>>>>>>> test, avoiding the OSDs that I already know is a problem, so I can
>>>>>>> find the
>>>>>>> next worst OSD.
>>>>>>>
>>>>>>>
>>>>>>> [ the bench test ]
>>>>>>> rados bench -p scbench -b 4096 30 write -t 32
>>>>>>>
>>>>>>> [ original results with all 36 OSDs ]
>>>>>>> Total time run:         30.822350
>>>>>>> Total writes made:      31032
>>>>>>> Write size:             4096
>>>>>>> Object size:            4096
>>>>>>> Bandwidth (MB/sec):     3.93282
>>>>>>> Stddev Bandwidth:       3.66265
>>>>>>> Max bandwidth (MB/sec): 13.668
>>>>>>> Min bandwidth (MB/sec): 0
>>>>>>> Average IOPS:           1006
>>>>>>> Stddev IOPS:            937
>>>>>>> Max IOPS:               3499
>>>>>>> Min IOPS:               0
>>>>>>> Average Latency(s):     0.0317779
>>>>>>> Stddev Latency(s):      0.164076
>>>>>>> Max latency(s):         2.27707
>>>>>>> Min latency(s):         0.0013848
>>>>>>> Cleaning up (deleting benchmark objects)
>>>>>>> Clean up completed and total clean up time :20.166559
>>>>>>>
>>>>>>> [ after stopping all of the OSDs (9) on the problem host ]
>>>>>>> Total time run:         32.586830
>>>>>>> Total writes made:      59491
>>>>>>> Write size:             4096
>>>>>>> Object size:            4096
>>>>>>> Bandwidth (MB/sec):     7.13131
>>>>>>> Stddev Bandwidth:       9.78725
>>>>>>> Max bandwidth (MB/sec): 29.168
>>>>>>> Min bandwidth (MB/sec): 0
>>>>>>> Average IOPS:           1825
>>>>>>> Stddev IOPS:            2505
>>>>>>> Max IOPS:               7467
>>>>>>> Min IOPS:               0
>>>>>>> Average Latency(s):     0.0173691
>>>>>>> Stddev Latency(s):      0.21634
>>>>>>> Max latency(s):         6.71283
>>>>>>> Min latency(s):         0.00107473
>>>>>>> Cleaning up (deleting benchmark objects)
>>>>>>> Clean up completed and total clean up time :16.269393
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 20, 2017 at 1:35 PM, Russell Glaue <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On the machine in question, the 2nd newest, we are using the LSI
>>>>>>> MegaRAID
>>>>>>> SAS-3 3008 [Fury], which allows us a "Non-RAID" option, and has no
>>>>>>> battery.
>>>>>>> The older two use the LSI MegaRAID SAS 2208 [Thunderbolt] I reported
>>>>>>> earlier, each single drive configured as RAID0.
>>>>>>>
>>>>>>> Thanks for everyone's help.
>>>>>>> I am going to run a 32 thread bench test after taking the 2nd
>>>>>>> machine out
>>>>>>> of the cluster with noout.
>>>>>>> After it is out of the cluster, I am expecting the slow write issue
>>>>>>> will
>>>>>>> not surface.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 20, 2017 at 5:27 AM, David Turner <[email protected]
>>>>>>> >
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> I can attest that the battery in the raid controller is a thing. I'm
>>>>>>> used to using lsi controllers, but my current position has hp raid
>>>>>>> controllers and we just tracked down 10 of our nodes that had >100ms
>>>>>>> await
>>>>>>> pretty much always were the only 10 nodes in the cluster with failed
>>>>>>> batteries on the raid controllers.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 19, 2017, 8:15 PM Christian Balzer <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:
>>>>>>>
>>>>>>> That is a good idea.
>>>>>>> However, a previous rebalancing processes has brought performance of
>>>>>>> our
>>>>>>> Guest VMs to a slow drag.
>>>>>>>
>>>>>>>
>>>>>>> Never mind that I'm not sure that these SSDs are particular well
>>>>>>> suited
>>>>>>> for Ceph, your problem is clearly located on that one node.
>>>>>>>
>>>>>>> Not that I think it's the case, but make sure your PG distribution is
>>>>>>> not
>>>>>>> skewed with many more PGs per OSD on that node.
>>>>>>>
>>>>>>> Once you rule that out my first guess is the RAID controller, you're
>>>>>>> running the SSDs are single RAID0s I presume?
>>>>>>> If so a either configuration difference or a failed BBU on the
>>>>>>> controller
>>>>>>> could result in the writeback cache being disabled, which would
>>>>>>> explain
>>>>>>> things beautifully.
>>>>>>>
>>>>>>> As for a temporary test/fix (with reduced redundancy of course), set
>>>>>>> noout
>>>>>>> (or mon_osd_down_out_subtree_limit accordingly) and turn the slow
>>>>>>> host
>>>>>>> off.
>>>>>>>
>>>>>>> This should result in much better performance than you have now and
>>>>>>> of
>>>>>>> course be the final confirmation of that host being the culprit.
>>>>>>>
>>>>>>> Christian
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Russell,
>>>>>>>
>>>>>>> as you have 4 servers, assuming you are not doing EC pools, just
>>>>>>> stop all
>>>>>>> the OSDs on the second questionable server, mark the OSDs on that
>>>>>>> server as
>>>>>>> out, let the cluster rebalance and when all PGs are active+clean
>>>>>>> just
>>>>>>> replay the test.
>>>>>>>
>>>>>>> All IOs should then go only to the other 3 servers.
>>>>>>>
>>>>>>> JC
>>>>>>>
>>>>>>> On Oct 19, 2017, at 13:49, Russell Glaue <[email protected]> wrote:
>>>>>>>
>>>>>>> No, I have not ruled out the disk controller and backplane making
>>>>>>> the
>>>>>>> disks slower.
>>>>>>> Is there a way I could test that theory, other than swapping out
>>>>>>> hardware?
>>>>>>> -RG
>>>>>>>
>>>>>>> On Thu, Oct 19, 2017 at 3:44 PM, David Turner
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Have you ruled out the disk controller and backplane in the server
>>>>>>> running slower?
>>>>>>>
>>>>>>> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I ran the test on the Ceph pool, and ran atop on all 4 storage
>>>>>>> servers,
>>>>>>> as suggested.
>>>>>>>
>>>>>>> Out of the 4 servers:
>>>>>>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
>>>>>>> Momentarily spiking up to 50% on one server, and 80% on another
>>>>>>> The 2nd newest server was almost averaging 90% disk %busy and
>>>>>>> 150% CPU
>>>>>>> wait. And more than momentarily spiking to 101% disk busy and
>>>>>>> 250% CPU wait.
>>>>>>> For this 2nd newest server, this was the statistics for about 8
>>>>>>> of 9
>>>>>>> disks, with the 9th disk not far behind the others.
>>>>>>>
>>>>>>> I cannot believe all 9 disks are bad
>>>>>>> They are the same disks as the newest 1st server,
>>>>>>> Crucial_CT960M500SSD1,
>>>>>>> and same exact server hardware too.
>>>>>>> They were purchased at the same time in the same purchase order
>>>>>>> and
>>>>>>> arrived at the same time.
>>>>>>> So I cannot believe I just happened to put 9 bad disks in one
>>>>>>> server,
>>>>>>> and 9 good ones in the other.
>>>>>>>
>>>>>>> I know I have Ceph configured exactly the same on all servers
>>>>>>> And I am sure I have the hardware settings configured exactly the
>>>>>>> same
>>>>>>> on the 1st and 2nd servers.
>>>>>>> So if I were someone else, I would say it maybe is bad hardware
>>>>>>> on the
>>>>>>> 2nd server.
>>>>>>> But the 2nd server is running very well without any hint of a
>>>>>>> problem.
>>>>>>>
>>>>>>> Any other ideas or suggestions?
>>>>>>>
>>>>>>> -RG
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> just run the same 32 threaded rados test as you did before and
>>>>>>> this
>>>>>>> time run atop while the test is running looking for %busy of
>>>>>>> cpu/disks. It
>>>>>>> should give an idea if there is a bottleneck in them.
>>>>>>>
>>>>>>> On 2017-10-18 21:35, Russell Glaue wrote:
>>>>>>>
>>>>>>> I cannot run the write test reviewed at the
>>>>>>> ceph-how-to-test-if-your-s
>>>>>>> sd-is-suitable-as-a-journal-device blog. The tests write
>>>>>>> directly to
>>>>>>> the raw disk device.
>>>>>>> Reading an infile (created with urandom) on one SSD, writing the
>>>>>>> outfile to another osd, yields about 17MB/s.
>>>>>>> But Isn't this write speed limited by the speed in which in the
>>>>>>> dd
>>>>>>> infile can be read?
>>>>>>> And I assume the best test should be run with no other load.
>>>>>>>
>>>>>>> How does one run the rados bench "as stress"?
>>>>>>>
>>>>>>> -RG
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> measuring resource load as outlined earlier will show if the
>>>>>>> drives
>>>>>>> are performing well or not. Also how many osds do you have  ?
>>>>>>>
>>>>>>> On 2017-10-18 19:26, Russell Glaue wrote:
>>>>>>>
>>>>>>> The SSD drives are Crucial M500
>>>>>>> A Ceph user did some benchmarks and found it had good
>>>>>>> performance
>>>>>>> https://forum.proxmox.com/threads/ceph-bad-performance-in-
>>>>>>> qemu-guests.21551/
>>>>>>>
>>>>>>> However, a user comment from 3 years ago on the blog post you
>>>>>>> linked
>>>>>>> to says to avoid the Crucial M500
>>>>>>>
>>>>>>> Yet, this performance posting tells that the Crucial M500 is
>>>>>>> good.
>>>>>>> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>>>>>>>
>>>>>>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Check out the following link: some SSDs perform bad in Ceph
>>>>>>> due to
>>>>>>> sync writes to journal
>>>>>>>
>>>>>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
>>>>>>> t-if-your-ssd-is-suitable-as-a-journal-device/
>>>>>>>
>>>>>>> Anther thing that can help is to re-run the rados 32 threads
>>>>>>> as
>>>>>>> stress and view resource usage using atop (or collectl/sar) to
>>>>>>> check for
>>>>>>> %busy cpu and %busy disks to give you an idea of what is
>>>>>>> holding down your
>>>>>>> cluster..for example: if cpu/disk % are all low then check
>>>>>>> your
>>>>>>> network/switches.  If disk %busy is high (90%) for all disks
>>>>>>> then your
>>>>>>> disks are the bottleneck: which either means you have SSDs
>>>>>>> that are not
>>>>>>> suitable for Ceph or you have too few disks (which i doubt is
>>>>>>> the case). If
>>>>>>> only 1 disk %busy is high, there may be something wrong with
>>>>>>> this disk
>>>>>>> should be removed.
>>>>>>>
>>>>>>> Maged
>>>>>>>
>>>>>>> On 2017-10-18 18:13, Russell Glaue wrote:
>>>>>>>
>>>>>>> In my previous post, in one of my points I was wondering if
>>>>>>> the
>>>>>>> request size would increase if I enabled jumbo packets.
>>>>>>> currently it is
>>>>>>> disabled.
>>>>>>>
>>>>>>> @jdillama: The qemu settings for both these two guest
>>>>>>> machines, with
>>>>>>> RAID/LVM and Ceph/rbd images, are the same. I am not thinking
>>>>>>> that changing
>>>>>>> the qemu settings of "min_io_size=<limited to
>>>>>>> 16bits>,opt_io_size=<RBD
>>>>>>> image object size>" will directly address the issue.
>>>>>>>
>>>>>>> @mmokhtar: Ok. So you suggest the request size is the result
>>>>>>> of the
>>>>>>> problem and not the cause of the problem. meaning I should go
>>>>>>> after a
>>>>>>> different issue.
>>>>>>>
>>>>>>> I have been trying to get write speeds up to what people on
>>>>>>> this mail
>>>>>>> list are discussing.
>>>>>>> It seems that for our configuration, as it matches others, we
>>>>>>> should
>>>>>>> be getting about 70MB/s write speed.
>>>>>>> But we are not getting that.
>>>>>>> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are
>>>>>>> typically 1MB/s to 2MB/s.
>>>>>>> Monitoring the entire Ceph cluster (using
>>>>>>> http://cephdash.crapworks.de/), I have seen very rare
>>>>>>> momentary
>>>>>>> spikes up to 30MB/s.
>>>>>>>
>>>>>>> My storage network is connected via a 10Gb switch
>>>>>>> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208
>>>>>>> controller
>>>>>>> Each storage server has 9 1TB SSD drives, each drive as 1 osd
>>>>>>> (no
>>>>>>> RAID)
>>>>>>> Each drive is one LVM group, with two volumes - one volume for
>>>>>>> the
>>>>>>> osd, one volume for the journal
>>>>>>> Each osd is formatted with xfs
>>>>>>> The crush map is simple: default->rack->[host[1..4]->osd] with
>>>>>>> an
>>>>>>> evenly distributed weight
>>>>>>> The redundancy is triple replication
>>>>>>>
>>>>>>> While I have read comments that having the osd and journal on
>>>>>>> the
>>>>>>> same disk decreases write speed, I have also read that once
>>>>>>> past 8 OSDs per
>>>>>>> node this is the recommended configuration, however this is
>>>>>>> also the reason
>>>>>>> why SSD drives are used exclusively for OSDs in the storage
>>>>>>> nodes.
>>>>>>> None-the-less, I was still expecting write speeds to be above
>>>>>>> 30MB/s,
>>>>>>> not below 6MB/s.
>>>>>>> Even at 12x slower than the RAID, using my previously posted
>>>>>>> iostat
>>>>>>> data set, I should be seeing write speeds that average 10MB/s,
>>>>>>> not 2MB/s.
>>>>>>>
>>>>>>> In regards to the rados benchmark tests you asked me to run,
>>>>>>> here is
>>>>>>> the output:
>>>>>>>
>>>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 1
>>>>>>> Maintaining 1 concurrent writes of 4096 bytes to objects of
>>>>>>> size 4096
>>>>>>> for up to 30 seconds or 0 objects
>>>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_85049
>>>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>>>>>> lat(s)
>>>>>>>  avg lat(s)
>>>>>>>     0       0         0         0         0         0
>>>>>>> -
>>>>>>>       0
>>>>>>>     1       1       201       200   0.78356   0.78125
>>>>>>> 0.00522307
>>>>>>>  0.00496574
>>>>>>>     2       1       469       468  0.915303   1.04688
>>>>>>> 0.00437497
>>>>>>>  0.00426141
>>>>>>>     3       1       741       740  0.964371    1.0625
>>>>>>> 0.00512853
>>>>>>> 0.0040434
>>>>>>>     4       1       888       887  0.866739  0.574219
>>>>>>> 0.00307699
>>>>>>>  0.00450177
>>>>>>>     5       1      1147      1146  0.895725   1.01172
>>>>>>> 0.00376454
>>>>>>> 0.0043559
>>>>>>>     6       1      1325      1324  0.862293  0.695312
>>>>>>> 0.00459443
>>>>>>>  0.004525
>>>>>>>     7       1      1494      1493   0.83339  0.660156
>>>>>>> 0.00461002
>>>>>>>  0.00458452
>>>>>>>     8       1      1736      1735  0.847369  0.945312
>>>>>>> 0.00253971
>>>>>>>  0.00460458
>>>>>>>     9       1      1998      1997  0.866922   1.02344
>>>>>>> 0.00236573
>>>>>>>  0.00450172
>>>>>>>    10       1      2260      2259  0.882563   1.02344
>>>>>>> 0.00262179
>>>>>>>  0.00442152
>>>>>>>    11       1      2526      2525  0.896775   1.03906
>>>>>>> 0.00336914
>>>>>>>  0.00435092
>>>>>>>    12       1      2760      2759  0.898203  0.914062
>>>>>>> 0.00351827
>>>>>>>  0.00434491
>>>>>>>    13       1      3016      3015  0.906025         1
>>>>>>> 0.00335703
>>>>>>>  0.00430691
>>>>>>>    14       1      3257      3256  0.908545  0.941406
>>>>>>> 0.00332344
>>>>>>>  0.00429495
>>>>>>>    15       1      3490      3489  0.908644  0.910156
>>>>>>> 0.00318815
>>>>>>>  0.00426387
>>>>>>>    16       1      3728      3727  0.909952  0.929688
>>>>>>> 0.0032881
>>>>>>>  0.00428895
>>>>>>>    17       1      3986      3985  0.915703   1.00781
>>>>>>> 0.00274809
>>>>>>> 0.0042614
>>>>>>>    18       1      4250      4249  0.922116   1.03125
>>>>>>> 0.00287411
>>>>>>>  0.00423214
>>>>>>>    19       1      4505      4504  0.926003  0.996094
>>>>>>> 0.00375435
>>>>>>>  0.00421442
>>>>>>> 2017-10-18 10:56:31.267173 min lat: 0.00181259 max lat:
>>>>>>> 0.270553 avg
>>>>>>> lat: 0.00420118
>>>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>>>>>> lat(s)
>>>>>>>  avg lat(s)
>>>>>>>    20       1      4757      4756  0.928915  0.984375
>>>>>>> 0.00463972
>>>>>>>  0.00420118
>>>>>>>    21       1      5009      5008   0.93155  0.984375
>>>>>>> 0.00360065
>>>>>>>  0.00418937
>>>>>>>    22       1      5235      5234  0.929329  0.882812
>>>>>>> 0.00626214
>>>>>>>  0.004199
>>>>>>>    23       1      5500      5499  0.933925   1.03516
>>>>>>> 0.00466584
>>>>>>>  0.00417836
>>>>>>>    24       1      5708      5707  0.928861    0.8125
>>>>>>> 0.00285727
>>>>>>>  0.00420146
>>>>>>>    25       0      5964      5964  0.931858   1.00391
>>>>>>> 0.00417383
>>>>>>> 0.0041881
>>>>>>>    26       1      6216      6215  0.933722  0.980469
>>>>>>> 0.0041009
>>>>>>>  0.00417915
>>>>>>>    27       1      6481      6480  0.937474   1.03516
>>>>>>> 0.00307484
>>>>>>>  0.00416118
>>>>>>>    28       1      6745      6744  0.940819   1.03125
>>>>>>> 0.00266329
>>>>>>>  0.00414777
>>>>>>>    29       1      7003      7002  0.943124   1.00781
>>>>>>> 0.00305905
>>>>>>>  0.00413758
>>>>>>>    30       1      7271      7270  0.946578   1.04688
>>>>>>> 0.00391017
>>>>>>>  0.00412238
>>>>>>> Total time run:         30.006060
>>>>>>> Total writes made:      7272
>>>>>>> Write size:             4096
>>>>>>> Object size:            4096
>>>>>>> Bandwidth (MB/sec):     0.946684
>>>>>>> Stddev Bandwidth:       0.123762
>>>>>>> Max bandwidth (MB/sec): 1.0625
>>>>>>> Min bandwidth (MB/sec): 0.574219
>>>>>>> Average IOPS:           242
>>>>>>> Stddev IOPS:            31
>>>>>>> Max IOPS:               272
>>>>>>> Min IOPS:               147
>>>>>>> Average Latency(s):     0.00412247
>>>>>>> Stddev Latency(s):      0.00648437
>>>>>>> Max latency(s):         0.270553
>>>>>>> Min latency(s):         0.00175318
>>>>>>> Cleaning up (deleting benchmark objects)
>>>>>>> Clean up completed and total clean up time :29.069423
>>>>>>>
>>>>>>> [centos7]# rados bench -p scbench -b 4096 30 write -t 32
>>>>>>> Maintaining 32 concurrent writes of 4096 bytes to objects of
>>>>>>> size
>>>>>>> 4096 for up to 30 seconds or 0 objects
>>>>>>> Object prefix: benchmark_data_hamms.sys.cu.cait.org_86076
>>>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>>>>>> lat(s)
>>>>>>>  avg lat(s)
>>>>>>>     0       0         0         0         0         0
>>>>>>> -
>>>>>>>       0
>>>>>>>     1      32      3013      2981   11.6438   11.6445
>>>>>>> 0.00247906
>>>>>>>  0.00572026
>>>>>>>     2      32      5349      5317   10.3834     9.125
>>>>>>> 0.00246662
>>>>>>>  0.00932016
>>>>>>>     3      32      5707      5675    7.3883   1.39844
>>>>>>> 0.00389774
>>>>>>> 0.0156726
>>>>>>>     4      32      5895      5863   5.72481  0.734375
>>>>>>> 1.13137
>>>>>>> 0.0167946
>>>>>>>     5      32      6869      6837   5.34068   3.80469
>>>>>>> 0.0027652
>>>>>>> 0.0226577
>>>>>>>     6      32      8901      8869   5.77306    7.9375
>>>>>>> 0.0053211
>>>>>>> 0.0216259
>>>>>>>     7      32     10800     10768   6.00785   7.41797
>>>>>>> 0.00358187
>>>>>>> 0.0207418
>>>>>>>     8      32     11825     11793   5.75728   4.00391
>>>>>>> 0.00217575
>>>>>>> 0.0215494
>>>>>>>     9      32     12941     12909    5.6019   4.35938
>>>>>>> 0.00278512
>>>>>>> 0.0220567
>>>>>>>    10      32     13317     13285   5.18849   1.46875
>>>>>>> 0.0034973
>>>>>>> 0.0240665
>>>>>>>    11      32     16189     16157   5.73653   11.2188
>>>>>>> 0.00255841
>>>>>>> 0.0212708
>>>>>>>    12      32     16749     16717   5.44077    2.1875
>>>>>>> 0.00330334
>>>>>>> 0.0215915
>>>>>>>    13      32     16756     16724   5.02436 0.0273438
>>>>>>> 0.00338994
>>>>>>>  0.021849
>>>>>>>    14      32     17908     17876   4.98686       4.5
>>>>>>> 0.00402598
>>>>>>> 0.0244568
>>>>>>>    15      32     17936     17904   4.66171  0.109375
>>>>>>> 0.00375799
>>>>>>> 0.0245545
>>>>>>>    16      32     18279     18247   4.45409   1.33984
>>>>>>> 0.00483873
>>>>>>> 0.0267929
>>>>>>>    17      32     18372     18340   4.21346  0.363281
>>>>>>> 0.00505187
>>>>>>> 0.0275887
>>>>>>>    18      32     19403     19371   4.20309   4.02734
>>>>>>> 0.00545154
>>>>>>>  0.029348
>>>>>>>    19      31     19845     19814   4.07295   1.73047
>>>>>>> 0.00254726
>>>>>>> 0.0306775
>>>>>>> 2017-10-18 10:57:58.160536 min lat: 0.0015005 max lat: 2.27707
>>>>>>> avg
>>>>>>> lat: 0.0307559
>>>>>>>   sec Cur ops   started  finished  avg MB/s  cur MB/s last
>>>>>>> lat(s)
>>>>>>>  avg lat(s)
>>>>>>>    20      31     20401     20370   3.97788   2.17188
>>>>>>> 0.00307238
>>>>>>> 0.0307559
>>>>>>>    21      32     21338     21306   3.96254   3.65625
>>>>>>> 0.00464563
>>>>>>> 0.0312288
>>>>>>>    22      32     23057     23025    4.0876   6.71484
>>>>>>> 0.00296295
>>>>>>> 0.0299267
>>>>>>>    23      32     23057     23025   3.90988         0
>>>>>>> -
>>>>>>> 0.0299267
>>>>>>>    24      32     23803     23771   3.86837   1.45703
>>>>>>> 0.00301471
>>>>>>> 0.0312804
>>>>>>>    25      32     24112     24080   3.76191   1.20703
>>>>>>> 0.00191063
>>>>>>> 0.0331462
>>>>>>>    26      31     25303     25272   3.79629   4.65625
>>>>>>> 0.00794399
>>>>>>> 0.0329129
>>>>>>>    27      32     28803     28771   4.16183    13.668
>>>>>>> 0.0109817
>>>>>>> 0.0297469
>>>>>>>    28      32     29592     29560   4.12325   3.08203
>>>>>>> 0.00188185
>>>>>>> 0.0301911
>>>>>>>    29      32     30595     30563   4.11616   3.91797
>>>>>>> 0.00379099
>>>>>>> 0.0296794
>>>>>>>    30      32     31031     30999   4.03572   1.70312
>>>>>>> 0.00283347
>>>>>>> 0.0302411
>>>>>>> Total time run:         30.822350
>>>>>>> Total writes made:      31032
>>>>>>> Write size:             4096
>>>>>>> Object size:            4096
>>>>>>> Bandwidth (MB/sec):     3.93282
>>>>>>> Stddev Bandwidth:       3.66265
>>>>>>> Max bandwidth (MB/sec): 13.668
>>>>>>> Min bandwidth (MB/sec): 0
>>>>>>> Average IOPS:           1006
>>>>>>> Stddev IOPS:            937
>>>>>>> Max IOPS:               3499
>>>>>>> Min IOPS:               0
>>>>>>> Average Latency(s):     0.0317779
>>>>>>> Stddev Latency(s):      0.164076
>>>>>>> Max latency(s):         2.27707
>>>>>>> Min latency(s):         0.0013848
>>>>>>> Cleaning up (deleting benchmark objects)
>>>>>>> Clean up completed and total clean up time :20.166559
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Oct 18, 2017 at 8:51 AM, Maged Mokhtar
>>>>>>> <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> First a general comment: local RAID will be faster than Ceph
>>>>>>> for a
>>>>>>> single threaded (queue depth=1) io operation test. A single
>>>>>>> thread Ceph
>>>>>>> client will see at best same disk speed for reads and for
>>>>>>> writes 4-6 times
>>>>>>> slower than single disk. Not to mention the latency of local
>>>>>>> disks will
>>>>>>> much better. Where Ceph shines is when you have many
>>>>>>> concurrent ios, it
>>>>>>> scales whereas RAID will decrease speed per client as you add
>>>>>>> more.
>>>>>>>
>>>>>>> Having said that, i would recommend running rados/rbd
>>>>>>> bench-write
>>>>>>> and measure 4k iops at 1 and 32 threads to get a better idea
>>>>>>> of how your
>>>>>>> cluster performs:
>>>>>>>
>>>>>>> ceph osd pool create testpool 256 256
>>>>>>> rados bench -p testpool -b 4096 30 write -t 1
>>>>>>> rados bench -p testpool -b 4096 30 write -t 32
>>>>>>> ceph osd pool delete testpool testpool
>>>>>>> --yes-i-really-really-mean-it
>>>>>>>
>>>>>>> rbd bench-write test-image --io-threads=1 --io-size 4096
>>>>>>> --io-pattern rand --rbd_cache=false
>>>>>>> rbd bench-write test-image --io-threads=32 --io-size 4096
>>>>>>> --io-pattern rand --rbd_cache=false
>>>>>>>
>>>>>>> I think the request size difference you see is due to the io
>>>>>>> scheduler in the case of local disks having more ios to
>>>>>>> re-group so has a
>>>>>>> better chance in generating larger requests. Depending on
>>>>>>> your kernel, the
>>>>>>> io scheduler may be different for rbd (blq-mq) vs sdx (cfq)
>>>>>>> but again i
>>>>>>> would think the request size is a result not a cause.
>>>>>>>
>>>>>>> Maged
>>>>>>>
>>>>>>> On 2017-10-17 23:12, Russell Glaue wrote:
>>>>>>>
>>>>>>> I am running ceph jewel on 5 nodes with SSD OSDs.
>>>>>>> I have an LVM image on a local RAID of spinning disks.
>>>>>>> I have an RBD image on in a pool of SSD disks.
>>>>>>> Both disks are used to run an almost identical CentOS 7
>>>>>>> system.
>>>>>>> Both systems were installed with the same kickstart, though
>>>>>>> the disk
>>>>>>> partitioning is different.
>>>>>>>
>>>>>>> I want to make writes on the the ceph image faster. For
>>>>>>> example,
>>>>>>> lots of writes to MySQL (via MySQL replication) on a ceph SSD
>>>>>>> image are
>>>>>>> about 10x slower than on a spindle RAID disk image. The MySQL
>>>>>>> server on
>>>>>>> ceph rbd image has a hard time keeping up in replication.
>>>>>>>
>>>>>>> So I wanted to test writes on these two systems
>>>>>>> I have a 10GB compressed (gzip) file on both servers.
>>>>>>> I simply gunzip the file on both systems, while running
>>>>>>> iostat.
>>>>>>>
>>>>>>> The primary difference I see in the results is the average
>>>>>>> size of
>>>>>>> the request to the disk.
>>>>>>> CentOS7-lvm-raid-sata writes a lot faster to disk, and the
>>>>>>> size of
>>>>>>> the request is about 40x, but the number of writes per second
>>>>>>> is about the
>>>>>>> same
>>>>>>> This makes me want to conclude that the smaller size of the
>>>>>>> request
>>>>>>> for CentOS7-ceph-rbd-ssd system is the cause of it being
>>>>>>> slow.
>>>>>>>
>>>>>>>
>>>>>>> How can I make the size of the request larger for ceph rbd
>>>>>>> images,
>>>>>>> so I can increase the write throughput?
>>>>>>> Would this be related to having jumbo packets enabled in my
>>>>>>> ceph
>>>>>>> storage network?
>>>>>>>
>>>>>>>
>>>>>>> Here is a sample of the results:
>>>>>>>
>>>>>>> [CentOS7-lvm-raid-sata]
>>>>>>> $ gunzip large10gFile.gz &
>>>>>>> $ iostat -x vg_root-lv_var -d 5 -m -N
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
>>>>>>> wMB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> ...
>>>>>>> vg_root-lv_var     0.00     0.00   30.60  452.20    13.60
>>>>>>> 222.15
>>>>>>>  1000.04     8.69   14.05    0.99   14.93   2.07 100.04
>>>>>>> vg_root-lv_var     0.00     0.00   88.20  182.00    39.20
>>>>>>> 89.43
>>>>>>> 974.95     4.65    9.82    0.99   14.10   3.70 100.00
>>>>>>> vg_root-lv_var     0.00     0.00   75.45  278.24    33.53
>>>>>>> 136.70
>>>>>>> 985.73     4.36   33.26    1.34   41.91   0.59  20.84
>>>>>>> vg_root-lv_var     0.00     0.00  111.60  181.80    49.60
>>>>>>> 89.34
>>>>>>> 969.84     2.60    8.87    0.81   13.81   0.13   3.90
>>>>>>> vg_root-lv_var     0.00     0.00   68.40  109.60    30.40
>>>>>>> 53.63
>>>>>>> 966.87     1.51    8.46    0.84   13.22   0.80  14.16
>>>>>>> ...
>>>>>>>
>>>>>>> [CentOS7-ceph-rbd-ssd]
>>>>>>> $ gunzip large10gFile.gz &
>>>>>>> $ iostat -x vg_root-lv_data -d 5 -m -N
>>>>>>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s
>>>>>>> wMB/s
>>>>>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>>>>>> ...
>>>>>>> vg_root-lv_data     0.00     0.00   46.40  167.80     0.88
>>>>>>> 1.46
>>>>>>>    22.36     1.23    5.66    2.47    6.54   4.52  96.82
>>>>>>> vg_root-lv_data     0.00     0.00   16.60   55.20     0.36
>>>>>>> 0.14
>>>>>>>    14.44     0.99   13.91    9.12   15.36  13.71  98.46
>>>>>>> vg_root-lv_data     0.00     0.00   69.00  173.80     1.34
>>>>>>> 1.32
>>>>>>>    22.48     1.25    5.19    3.77    5.75   3.94  95.68
>>>>>>> vg_root-lv_data     0.00     0.00   74.40  293.40     1.37
>>>>>>> 1.47
>>>>>>>    15.83     1.22    3.31    2.06    3.63   2.54  93.26
>>>>>>> vg_root-lv_data     0.00     0.00   90.80  359.00     1.96
>>>>>>> 3.41
>>>>>>>    24.45     1.63    3.63    1.94    4.05   2.10  94.38
>>>>>>> ...
>>>>>>>
>>>>>>> [iostat key]
>>>>>>> w/s == The number (after merges) of write requests completed
>>>>>>> per
>>>>>>> second for the device.
>>>>>>> wMB/s == The number of sectors (kilobytes, megabytes) written
>>>>>>> to the
>>>>>>> device per second.
>>>>>>> avgrq-sz == The average size (in kilobytes) of the requests
>>>>>>> that
>>>>>>> were issued to the device.
>>>>>>> avgqu-sz == The average queue length of the requests that
>>>>>>> were
>>>>>>> issued to the device.
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Christian Balzer        Network/Systems Engineer
>>>>>>> [email protected]           Rakuten Communications
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> [email protected]
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Brian Andrus | Cloud Systems Engineer | DreamHost
>>>>> [email protected] | www.dreamhost.com
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Brian Andrus | Cloud Systems Engineer | DreamHost
>>>> [email protected] | www.dreamhost.com
>>>>
>>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to