On Mon, Mar 26, 2018 at 11:40 PM, Sam Huracan <[email protected]> wrote:
> Hi, > > We are using Raid cache mode Writeback for SSD journal, I consider this is > reason of utilization of SSD journal is so low. > Is it true? Anybody has experience with this matter, plz confirm. > I turn the writeback mode off for the SSDs, as this seems to make the controller cache a bottleneck. -- Alex Gorbachev Storcium > > Thanks > > 2018-03-26 23:00 GMT+07:00 Sam Huracan <[email protected]>: > >> Thanks for your information. >> Here is result when I run atop on 1 Ceph HDD host: >> http://prntscr.com/iwmc86 >> >> There is some disk busy with over 100%, but the SSD journal (SSD) use >> only 3%, is it normal? Is there any way to optimize using of SSD journal? >> Could you give me some keyword? >> >> Here is configuration of Ceph HDD Host: >> Dell PowerEdge R730xd Server Quantity >> PE R730/xd Motherboard 1 >> Intel Xeon E5-2620 v4 2.1GHz,20M Cache,8.0GT/s QPI,Turbo,HT,8C/16T (85W) >> Max Mem 2133MHz 1 >> 16GB RDIMM, 2400MT/s, Dual Rank, x8 Data Width 2 >> 300GB 15K RPM SAS 12Gbps 2.5in Flex Bay Hard Drive - OS Drive (RAID 1) 2 >> 4TB 7.2K RPM NLSAS 12Gbps 512n 3.5in Hot-plug Hard Drive - OSD Drive 7 >> 200GB Solid State Drive SATA Mix Use MLC 6Gbps 2.5in Hot-plug Drive - >> Journal Drive (RAID 1) 2 >> PERC H730 Integrated RAID Controller, 1GB Cache *(we are using Writeback >> mode)* 1 >> Dual, Hot-plug, Redundant Power Supply (1+1), 750W 1 >> Broadcom 5720 QP 1Gb Network Daughter Card 1 >> QLogic 57810 Dual Port 10Gb Direct Attach/SFP+ Network Adapter 1 >> >> For some reasons, we can't configure Jumbo Frame in this cluster. We'll >> refer your suggest about scrub. >> >> >> 2018-03-26 7:41 GMT+07:00 Christian Balzer <[email protected]>: >> >>> >>> Hello, >>> >>> in general and as reminder for others, the more information you supply, >>> the more likely are people to answer and answer with actually pertinent >>> information. >>> Since you haven't mentioned the hardware (actual HDD/SSD models, CPU/RAM, >>> controllers, etc) we're still missing a piece of the puzzle that could be >>> relevant. >>> >>> But given what we have some things are more likely than others. >>> Also, an inline 90KB screenshot of a TEXT iostat output is a bit of a >>> no-no, never mind that atop instead of top from the start would have >>> given >>> you and us much more insight. >>> >>> On Sun, 25 Mar 2018 14:35:57 +0700 Sam Huracan wrote: >>> >>> > Thank you all. >>> > >>> > 1. Here is my ceph.conf file: >>> > https://pastebin.com/xpF2LUHs >>> > >>> As Lazlo noted (and it matches your iostat output beautifully), tuning >>> down scrubs is likely going to have an immediate beneficial impact, as >>> deep-scrubs in particular are VERY disruptive and I/O intense operations. >>> >>> However the "osd scrub sleep = 0.1" may make things worse in certain >>> Jewel >>> versions, as they all went through the unified queue and this would cause >>> a sleep for ALL operations, not just the scrub ones. >>> I can't remember when this was fixed and the changelog is of no help, so >>> hopefully somebody who knows will pipe up. >>> If in doubt of course, experiment. >>> >>> In addition to that, if you have low usage times, set >>> your osd_scrub_(start|end)_hour accordingly and also check the ML >>> archives >>> for other scrub scheduling tips. >>> >>> I'd also leave these: >>> filestore max sync interval = 100 >>> filestore min sync interval = 50 >>> filestore queue max ops = 5000 >>> filestore queue committing max ops = 5000 >>> journal max write entries = 1000 >>> journal queue max ops = 5000 >>> >>> at their defaults, playing with those parameters requires a good >>> understanding of how Ceph filestore works AND usually only makes sense >>> with SSD/NVMe setups. >>> Especially the first 2 could lead to quite the IO pileup. >>> >>> >>> > 2. Here is result from ceph -s: >>> > root@ceph1:/etc/ceph# ceph -s >>> > cluster 31154d30-b0d3-4411-9178-0bbe367a5578 >>> > health HEALTH_OK >>> > monmap e3: 3 mons at {ceph1= >>> > 10.0.30.51:6789/0,ceph2=10.0.30.52:6789/0,ceph3=10.0.30.53:6789/0} >>> > election epoch 18, quorum 0,1,2 ceph1,ceph2,ceph3 >>> > osdmap e2473: 63 osds: 63 up, 63 in >>> > flags sortbitwise,require_jewel_osds >>> > pgmap v34069952: 4096 pgs, 6 pools, 21534 GB data, 5696 kobjects >>> > 59762 GB used, 135 TB / 194 TB avail >>> > 4092 active+clean >>> > 2 active+clean+scrubbing >>> > 2 active+clean+scrubbing+deep >>> > client io 36096 kB/s rd, 41611 kB/s wr, 1643 op/s rd, 1634 op/s wr >>> > >>> See above about deep-scrub, which will read ALL the objects of the PG >>> being scrubbed and thus not only saturates the OSDs involved with reads >>> but ALSO dirties the pagecache with cold objects, making other reads on >>> the nodes slow by requiring them to hit the disks, too. >>> >>> It would be interesting to see a "ceph -s" when your cluster is busy but >>> NOT scrubbing, 1600 write op/s are about what 21 HDDs can handle. >>> So for the time being, disable scrubs entirely and see if your problems >>> go away. >>> If so, you now know the limits of your current setup and will want to >>> avoid hitting them again. >>> >>> Having a dedicated SSD pool for high-end VMs or a cache-tier (if it is a >>> fit, not likely in your case) would be a way forward if your client >>> demands are still growing. >>> >>> Christian >>> >>> > >>> > >>> > 3. We use 1 SSD for journaling 7 HDD (/dev/sdi), I set 16GB for each >>> > journal, here is result from ceph-disk list command: >>> > >>> > /dev/sda : >>> > /dev/sda1 ceph data, active, cluster ceph, osd.0, journal /dev/sdi1 >>> > /dev/sdb : >>> > /dev/sdb1 ceph data, active, cluster ceph, osd.1, journal /dev/sdi2 >>> > /dev/sdc : >>> > /dev/sdc1 ceph data, active, cluster ceph, osd.2, journal /dev/sdi3 >>> > /dev/sdd : >>> > /dev/sdd1 ceph data, active, cluster ceph, osd.3, journal /dev/sdi4 >>> > /dev/sde : >>> > /dev/sde1 ceph data, active, cluster ceph, osd.4, journal /dev/sdi5 >>> > /dev/sdf : >>> > /dev/sdf1 ceph data, active, cluster ceph, osd.5, journal /dev/sdi6 >>> > /dev/sdg : >>> > /dev/sdg1 ceph data, active, cluster ceph, osd.6, journal /dev/sdi7 >>> > /dev/sdh : >>> > /dev/sdh3 other, LVM2_member >>> > /dev/sdh1 other, vfat, mounted on /boot/efi >>> > /dev/sdi : >>> > /dev/sdi1 ceph journal, for /dev/sda1 >>> > /dev/sdi2 ceph journal, for /dev/sdb1 >>> > /dev/sdi3 ceph journal, for /dev/sdc1 >>> > /dev/sdi4 ceph journal, for /dev/sdd1 >>> > /dev/sdi5 ceph journal, for /dev/sde1 >>> > /dev/sdi6 ceph journal, for /dev/sdf1 >>> > /dev/sdi7 ceph journal, for /dev/sdg1 >>> > >>> > 4. With iostat, we just run "iostat -x 2", /dev/sdi is journal SSD, >>> > /dev/sdh is OS Disk, and the rest is OSD Disks. >>> > root@ceph1:/etc/ceph# lsblk >>> > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT >>> > sda 8:0 0 3.7T 0 disk >>> > └─sda1 8:1 0 3.7T 0 part >>> > /var/lib/ceph/osd/ceph-0 >>> > sdb 8:16 0 3.7T 0 disk >>> > └─sdb1 8:17 0 3.7T 0 part >>> > /var/lib/ceph/osd/ceph-1 >>> > sdc 8:32 0 3.7T 0 disk >>> > └─sdc1 8:33 0 3.7T 0 part >>> > /var/lib/ceph/osd/ceph-2 >>> > sdd 8:48 0 3.7T 0 disk >>> > └─sdd1 8:49 0 3.7T 0 part >>> > /var/lib/ceph/osd/ceph-3 >>> > sde 8:64 0 3.7T 0 disk >>> > └─sde1 8:65 0 3.7T 0 part >>> > /var/lib/ceph/osd/ceph-4 >>> > sdf 8:80 0 3.7T 0 disk >>> > └─sdf1 8:81 0 3.7T 0 part >>> > /var/lib/ceph/osd/ceph-5 >>> > sdg 8:96 0 3.7T 0 disk >>> > └─sdg1 8:97 0 3.7T 0 part >>> > /var/lib/ceph/osd/ceph-6 >>> > sdh 8:112 0 278.9G 0 disk >>> > ├─sdh1 8:113 0 512M 0 part /boot/efi >>> > └─sdh3 8:115 0 278.1G 0 part >>> > ├─hnceph--hdd1--vg-swap (dm-0) 252:0 0 59.6G 0 lvm [SWAP] >>> > └─hnceph--hdd1--vg-root (dm-1) 252:1 0 218.5G 0 lvm / >>> > sdi 8:128 0 185.8G 0 disk >>> > ├─sdi1 8:129 0 16.6G 0 part >>> > ├─sdi2 8:130 0 16.6G 0 part >>> > ├─sdi3 8:131 0 16.6G 0 part >>> > ├─sdi4 8:132 0 16.6G 0 part >>> > ├─sdi5 8:133 0 16.6G 0 part >>> > ├─sdi6 8:134 0 16.6G 0 part >>> > └─sdi7 8:135 0 16.6G 0 part >>> > >>> > Could you give me some idea to continue check? >>> > >>> > >>> > 2018-03-25 12:25 GMT+07:00 Budai Laszlo <[email protected]>: >>> > >>> > > could you post the result of "ceph -s" ? besides the health status >>> there >>> > > are other details that could help, like the status of your PGs., >>> also the >>> > > result of "ceph-disk list" would be useful to understand how your >>> disks are >>> > > organized. For instance with 1 SSD for 7 HDD the SSD could be the >>> > > bottleneck. >>> > > From the outputs you gave us we don't know which are the spinning >>> disks >>> > > and which is the ssd (looking at the numbers I suspect that sdi is >>> your >>> > > SSD). we also don't kow what parameters were you using when you've >>> ran the >>> > > iostat command. >>> > > >>> > > Unfortunately it's difficult to help you without knowing more about >>> your >>> > > system. >>> > > >>> > > Kind regards, >>> > > Laszlo >>> > > >>> > > On 24.03.2018 20:19, Sam Huracan wrote: >>> > > > This is from iostat: >>> > > > >>> > > > I'm using Ceph jewel, has no HW error. >>> > > > Ceph health OK, we've just use 50% total volume. >>> > > > >>> > > > >>> > > > 2018-03-24 22:20 GMT+07:00 <[email protected] <mailto: >>> [email protected]>>: >>> > > > >>> > > > I would Check with Tools like atop the utilization of your >>> Disks >>> > > also. Perhaps something Related in dmesg or dorthin? >>> > > > >>> > > > - Mehmet >>> > > > >>> > > > Am 24. März 2018 08:17:44 MEZ schrieb Sam Huracan < >>> > > [email protected] <mailto:[email protected]>>: >>> > > > >>> > > > >>> > > > Hi guys, >>> > > > We are running a production OpenStack backend by Ceph. >>> > > > >>> > > > At present, we are meeting an issue relating to high >>> iowait in >>> > > VM, in some MySQL VM, we see sometime IOwait reaches abnormal high >>> peaks >>> > > which lead to slow queries increase, despite load is stable (we test >>> with >>> > > script simulate real load), you can see in graph. >>> > > > https://prnt.sc/ivndni >>> > > > >>> > > > MySQL VM are place on Ceph HDD Cluster, with 1 SSD journal >>> for 7 >>> > > HDD. In this cluster, IOwait on each ceph host is about 20%. >>> > > > https://prnt.sc/ivne08 >>> > > > >>> > > > >>> > > > Can you guy help me find the root cause of this issue, and >>> how >>> > > to eliminate this high iowait? >>> > > > >>> > > > Thanks in advance. >>> > > > >>> > > > >>> > > > _______________________________________________ >>> > > > ceph-users mailing list >>> > > > [email protected] <mailto:[email protected]> >>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com < >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > _______________________________________________ >>> > > > ceph-users mailing list >>> > > > [email protected] >>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > >>> > > >>> > > _______________________________________________ >>> > > ceph-users mailing list >>> > > [email protected] >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > >>> >>> >>> -- >>> Christian Balzer Network/Systems Engineer >>> [email protected] Rakuten Communications >>> >> >> >> > > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
