Re: [ceph-users] Fwd: High IOWait Issue

Alex Gorbachev Wed, 28 Mar 2018 05:37:43 -0700

On Mon, Mar 26, 2018 at 11:40 PM, Sam Huracan <[email protected]>
wrote:


> Hi,
>
> We are using Raid cache mode Writeback for SSD journal, I consider this is
> reason of utilization of SSD journal is so low.
> Is it  true? Anybody has experience with this matter, plz confirm.
>


I turn the writeback mode off for the SSDs, as this seems to make the
controller cache a bottleneck.

--
Alex Gorbachev
Storcium




>
> Thanks
>
> 2018-03-26 23:00 GMT+07:00 Sam Huracan <[email protected]>:
>
>> Thanks for your information.
>> Here is result when I run atop on 1 Ceph HDD host:
>> http://prntscr.com/iwmc86
>>
>> There is some disk busy with over 100%, but the SSD journal (SSD) use
>> only 3%, is it normal? Is there any way to optimize using of SSD journal?
>> Could you give me some keyword?
>>
>> Here is configuration of Ceph HDD Host:
>> Dell PowerEdge R730xd Server Quantity
>> PE R730/xd Motherboard 1
>> Intel Xeon E5-2620 v4 2.1GHz,20M Cache,8.0GT/s QPI,Turbo,HT,8C/16T (85W)
>> Max Mem 2133MHz 1
>> 16GB RDIMM, 2400MT/s, Dual Rank, x8 Data Width 2
>> 300GB 15K RPM SAS 12Gbps 2.5in Flex Bay Hard Drive - OS Drive (RAID 1) 2
>> 4TB 7.2K RPM NLSAS 12Gbps 512n 3.5in Hot-plug Hard Drive - OSD Drive 7
>> 200GB Solid State Drive SATA Mix Use MLC 6Gbps 2.5in Hot-plug Drive -
>> Journal Drive (RAID 1) 2
>> PERC H730 Integrated RAID Controller, 1GB Cache *(we are using Writeback
>> mode)* 1
>> Dual, Hot-plug, Redundant Power Supply (1+1), 750W 1
>> Broadcom 5720 QP 1Gb Network Daughter Card 1
>> QLogic 57810 Dual Port 10Gb Direct Attach/SFP+ Network Adapter 1
>>
>> For some reasons, we can't configure Jumbo Frame in this cluster. We'll
>> refer your suggest about scrub.
>>
>>
>> 2018-03-26 7:41 GMT+07:00 Christian Balzer <[email protected]>:
>>
>>>
>>> Hello,
>>>
>>> in general and as reminder for others, the more information you supply,
>>> the more likely are people to answer and answer with actually pertinent
>>> information.
>>> Since you haven't mentioned the hardware (actual HDD/SSD models, CPU/RAM,
>>> controllers, etc) we're still missing a piece of the puzzle that could be
>>> relevant.
>>>
>>> But given what we have some things are more likely than others.
>>> Also, an inline 90KB screenshot of a TEXT iostat output is a bit of a
>>> no-no, never mind that atop instead of top from the start would have
>>> given
>>> you and us much more insight.
>>>
>>> On Sun, 25 Mar 2018 14:35:57 +0700 Sam Huracan wrote:
>>>
>>> > Thank you all.
>>> >
>>> > 1. Here is my ceph.conf file:
>>> > https://pastebin.com/xpF2LUHs
>>> >
>>> As Lazlo noted (and it matches your iostat output beautifully), tuning
>>> down scrubs is likely going to have an immediate beneficial impact, as
>>> deep-scrubs in particular are VERY disruptive and I/O intense operations.
>>>
>>> However the "osd scrub sleep = 0.1" may make things worse in certain
>>> Jewel
>>> versions, as they all went through the unified queue and this would cause
>>> a sleep for ALL operations, not just the scrub ones.
>>> I can't remember when this was fixed and the changelog is of no help, so
>>> hopefully somebody who knows will pipe up.
>>> If in doubt of course, experiment.
>>>
>>> In addition to that, if you have low usage times, set
>>> your osd_scrub_(start|end)_hour accordingly and also check the ML
>>> archives
>>> for other scrub scheduling tips.
>>>
>>> I'd also leave these:
>>>         filestore max sync interval = 100
>>>         filestore min sync interval = 50
>>>         filestore queue max ops  = 5000
>>>         filestore queue committing max ops  = 5000
>>>         journal max write entries  = 1000
>>>         journal queue max ops  = 5000
>>>
>>> at their defaults, playing with those parameters requires a good
>>> understanding of how Ceph filestore works AND usually only makes sense
>>> with SSD/NVMe setups.
>>> Especially the first 2 could lead to quite the IO pileup.
>>>
>>>
>>> > 2. Here is result from ceph -s:
>>> > root@ceph1:/etc/ceph# ceph -s
>>> >     cluster 31154d30-b0d3-4411-9178-0bbe367a5578
>>> >      health HEALTH_OK
>>> >      monmap e3: 3 mons at {ceph1=
>>> > 10.0.30.51:6789/0,ceph2=10.0.30.52:6789/0,ceph3=10.0.30.53:6789/0}
>>> >             election epoch 18, quorum 0,1,2 ceph1,ceph2,ceph3
>>> >      osdmap e2473: 63 osds: 63 up, 63 in
>>> >             flags sortbitwise,require_jewel_osds
>>> >       pgmap v34069952: 4096 pgs, 6 pools, 21534 GB data, 5696 kobjects
>>> >             59762 GB used, 135 TB / 194 TB avail
>>> >                 4092 active+clean
>>> >                    2 active+clean+scrubbing
>>> >                    2 active+clean+scrubbing+deep
>>> >   client io 36096 kB/s rd, 41611 kB/s wr, 1643 op/s rd, 1634 op/s wr
>>> >
>>> See above about deep-scrub, which will read ALL the objects of the PG
>>> being scrubbed and thus not only saturates the OSDs involved with reads
>>> but ALSO dirties the pagecache with cold objects, making other reads on
>>> the nodes slow by requiring them to hit the disks, too.
>>>
>>> It would be interesting to see a "ceph -s" when your cluster is busy but
>>> NOT scrubbing, 1600 write op/s are about what 21 HDDs can handle.
>>> So for the time being, disable scrubs entirely and see if your problems
>>> go away.
>>> If so, you now know the limits of your current setup and will want to
>>> avoid hitting them again.
>>>
>>> Having a dedicated SSD pool for high-end VMs or a cache-tier (if it is a
>>> fit, not likely in your case) would be a way forward if your client
>>> demands are still growing.
>>>
>>> Christian
>>>
>>> >
>>> >
>>> > 3. We use 1 SSD for journaling 7 HDD (/dev/sdi), I set 16GB for each
>>> > journal,  here is result from ceph-disk list command:
>>> >
>>> > /dev/sda :
>>> >  /dev/sda1 ceph data, active, cluster ceph, osd.0, journal /dev/sdi1
>>> > /dev/sdb :
>>> >  /dev/sdb1 ceph data, active, cluster ceph, osd.1, journal /dev/sdi2
>>> > /dev/sdc :
>>> >  /dev/sdc1 ceph data, active, cluster ceph, osd.2, journal /dev/sdi3
>>> > /dev/sdd :
>>> >  /dev/sdd1 ceph data, active, cluster ceph, osd.3, journal /dev/sdi4
>>> > /dev/sde :
>>> >  /dev/sde1 ceph data, active, cluster ceph, osd.4, journal /dev/sdi5
>>> > /dev/sdf :
>>> >  /dev/sdf1 ceph data, active, cluster ceph, osd.5, journal /dev/sdi6
>>> > /dev/sdg :
>>> >  /dev/sdg1 ceph data, active, cluster ceph, osd.6, journal /dev/sdi7
>>> > /dev/sdh :
>>> >  /dev/sdh3 other, LVM2_member
>>> >  /dev/sdh1 other, vfat, mounted on /boot/efi
>>> > /dev/sdi :
>>> >  /dev/sdi1 ceph journal, for /dev/sda1
>>> >  /dev/sdi2 ceph journal, for /dev/sdb1
>>> >  /dev/sdi3 ceph journal, for /dev/sdc1
>>> >  /dev/sdi4 ceph journal, for /dev/sdd1
>>> >  /dev/sdi5 ceph journal, for /dev/sde1
>>> >  /dev/sdi6 ceph journal, for /dev/sdf1
>>> >  /dev/sdi7 ceph journal, for /dev/sdg1
>>> >
>>> > 4. With iostat, we just run "iostat -x 2", /dev/sdi is journal SSD,
>>> > /dev/sdh is OS Disk, and the rest is OSD Disks.
>>> > root@ceph1:/etc/ceph# lsblk
>>> > NAME                             MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>>> > sda                                8:0    0   3.7T  0 disk
>>> > └─sda1                             8:1    0   3.7T  0 part
>>> > /var/lib/ceph/osd/ceph-0
>>> > sdb                                8:16   0   3.7T  0 disk
>>> > └─sdb1                             8:17   0   3.7T  0 part
>>> > /var/lib/ceph/osd/ceph-1
>>> > sdc                                8:32   0   3.7T  0 disk
>>> > └─sdc1                             8:33   0   3.7T  0 part
>>> > /var/lib/ceph/osd/ceph-2
>>> > sdd                                8:48   0   3.7T  0 disk
>>> > └─sdd1                             8:49   0   3.7T  0 part
>>> > /var/lib/ceph/osd/ceph-3
>>> > sde                                8:64   0   3.7T  0 disk
>>> > └─sde1                             8:65   0   3.7T  0 part
>>> > /var/lib/ceph/osd/ceph-4
>>> > sdf                                8:80   0   3.7T  0 disk
>>> > └─sdf1                             8:81   0   3.7T  0 part
>>> > /var/lib/ceph/osd/ceph-5
>>> > sdg                                8:96   0   3.7T  0 disk
>>> > └─sdg1                             8:97   0   3.7T  0 part
>>> > /var/lib/ceph/osd/ceph-6
>>> > sdh                                8:112  0 278.9G  0 disk
>>> > ├─sdh1                             8:113  0   512M  0 part /boot/efi
>>> > └─sdh3                             8:115  0 278.1G  0 part
>>> >   ├─hnceph--hdd1--vg-swap (dm-0) 252:0    0  59.6G  0 lvm  [SWAP]
>>> >   └─hnceph--hdd1--vg-root (dm-1) 252:1    0 218.5G  0 lvm  /
>>> > sdi                                8:128  0 185.8G  0 disk
>>> > ├─sdi1                             8:129  0  16.6G  0 part
>>> > ├─sdi2                             8:130  0  16.6G  0 part
>>> > ├─sdi3                             8:131  0  16.6G  0 part
>>> > ├─sdi4                             8:132  0  16.6G  0 part
>>> > ├─sdi5                             8:133  0  16.6G  0 part
>>> > ├─sdi6                             8:134  0  16.6G  0 part
>>> > └─sdi7                             8:135  0  16.6G  0 part
>>> >
>>> > Could you give me some idea to continue check?
>>> >
>>> >
>>> > 2018-03-25 12:25 GMT+07:00 Budai Laszlo <[email protected]>:
>>> >
>>> > > could you post the result of "ceph -s" ? besides the health status
>>> there
>>> > > are other details that could help, like the status of your PGs.,
>>> also the
>>> > > result of "ceph-disk list" would be useful to understand how your
>>> disks are
>>> > > organized. For instance with 1 SSD for 7 HDD the SSD could be the
>>> > > bottleneck.
>>> > > From the outputs you gave us we don't know which are the spinning
>>> disks
>>> > > and which is the ssd (looking at the numbers I suspect that sdi is
>>> your
>>> > > SSD). we also don't kow what parameters were you using when you've
>>> ran the
>>> > > iostat command.
>>> > >
>>> > > Unfortunately it's difficult to help you without knowing more about
>>> your
>>> > > system.
>>> > >
>>> > > Kind regards,
>>> > > Laszlo
>>> > >
>>> > > On 24.03.2018 20:19, Sam Huracan wrote:
>>> > > > This is from iostat:
>>> > > >
>>> > > > I'm using Ceph jewel, has no HW error.
>>> > > > Ceph  health OK, we've just use 50% total volume.
>>> > > >
>>> > > >
>>> > > > 2018-03-24 22:20 GMT+07:00 <[email protected] <mailto:
>>> [email protected]>>:
>>> > > >
>>> > > >     I would Check with Tools like atop the utilization of your
>>> Disks
>>> > > also. Perhaps something Related in dmesg or dorthin?
>>> > > >
>>> > > >     - Mehmet
>>> > > >
>>> > > >     Am 24. März 2018 08:17:44 MEZ schrieb Sam Huracan <
>>> > > [email protected] <mailto:[email protected]>>:
>>> > > >
>>> > > >
>>> > > >         Hi guys,
>>> > > >         We are running a production OpenStack backend by Ceph.
>>> > > >
>>> > > >         At present, we are meeting an issue relating to high
>>> iowait in
>>> > > VM, in some MySQL VM, we see sometime IOwait reaches  abnormal high
>>> peaks
>>> > > which lead to slow queries increase, despite load is stable (we test
>>> with
>>> > > script simulate real load), you can see in graph.
>>> > > >         https://prnt.sc/ivndni
>>> > > >
>>> > > >         MySQL VM are place on Ceph HDD Cluster, with 1 SSD journal
>>> for 7
>>> > > HDD. In this cluster, IOwait on each ceph host is about 20%.
>>> > > >         https://prnt.sc/ivne08
>>> > > >
>>> > > >
>>> > > >         Can you guy help me find the root cause of this issue, and
>>> how
>>> > > to eliminate this high iowait?
>>> > > >
>>> > > >         Thanks in advance.
>>> > > >
>>> > > >
>>> > > >     _______________________________________________
>>> > > >     ceph-users mailing list
>>> > > >     [email protected] <mailto:[email protected]>
>>> > > >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com <
>>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>> > > >
>>> > > >
>>> > > >
>>> > > >
>>> > > > _______________________________________________
>>> > > > ceph-users mailing list
>>> > > > [email protected]
>>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > ceph-users mailing list
>>> > > [email protected]
>>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> > >
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> [email protected]           Rakuten Communications
>>>
>>
>>
>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: High IOWait Issue

Reply via email to