I would start with defrag the drives, the good part is that you can just
run the defrag with the time parameter and it will take all available xfs
drives.
On 4 Oct 2015 6:13 pm, "Robert LeBlanc" <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> These are Toshiba MG03ACA400 drives.
>
> sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series
> chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
> sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series
> chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
> sde is SATADOM with OS install
> sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI Logic /
> Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
> sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI
> controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2
> (rev 05)
>
> There is probably some performance optimization that we can do in this area,
> however unless I'm missing something, I don't see anything that should cause
> I/O to take 30-60+ seconds to complete from a disk standpoint.
>
> [root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1: ";
> xfs_db -c frag -r /dev/sd${i}1; done
> sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
> sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
> sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
> sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
> sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
> sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
> sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
> sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
> sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
> sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%
>
>
> [root@ceph1 ~]# iostat -xd 2
> Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1) 10/04/2015 _x86_64_ (16
> CPU)
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.09 2.06 9.24 36.18 527.28 1743.71 100.00
> 8.96 197.32 17.50 243.23 4.07 18.47
> sdb 0.17 3.61 16.70 74.44 949.65 2975.30 86.13
> 6.74 73.95 23.94 85.16 4.31 39.32
> sdc 0.14 4.67 15.69 87.80 818.02 3860.11 90.41
> 9.56 92.38 26.73 104.11 4.44 45.91
> sdd 0.17 3.43 7.16 69.13 480.96 2847.42 87.25
> 4.80 62.89 30.00 66.30 4.33 33.00
> sde 0.01 1.13 0.34 0.99 8.35 12.01 30.62
> 0.01 7.37 2.64 9.02 1.64 0.22
> sdj 0.00 1.22 0.01 348.22 0.03 11302.65 64.91
> 0.23 0.66 0.14 0.66 0.15 5.15
> sdk 0.00 1.99 0.01 369.94 0.03 12876.74 69.61
> 0.26 0.71 0.13 0.71 0.16 5.75
> sdf 0.01 1.79 1.55 31.12 39.64 1431.37 90.06
> 4.07 124.67 16.25 130.05 3.11 10.17
> sdi 0.22 3.17 23.92 72.90 1386.45 2676.28 83.93
> 7.75 80.00 24.31 98.27 4.31 41.77
> sdm 0.16 3.10 17.63 72.84 986.29 2767.24 82.98
> 6.57 72.64 23.67 84.50 4.23 38.30
> sdl 0.11 3.01 12.10 55.14 660.85 2361.40 89.89
> 17.87 265.80 21.64 319.36 4.08 27.45
> sdg 0.08 2.45 9.75 53.90 489.67 1929.42 76.01
> 17.27 271.30 20.77 316.61 3.98 25.33
> sdh 0.10 2.76 11.28 60.97 600.10 2114.48 75.14
> 1.70 23.55 22.92 23.66 4.10 29.60
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 0.50 0.00 146.00 0.00 584.00
> 0.01 16.00 16.00 0.00 16.00 0.80
> sdb 0.00 0.50 9.00 119.00 2036.00 2578.00 72.09
> 0.68 5.50 7.06 5.39 2.36 30.25
> sdc 0.00 4.00 34.00 129.00 494.00 6987.75 91.80
> 1.70 10.44 17.00 8.72 4.44 72.40
> sdd 0.00 1.50 1.50 95.50 74.00 2396.50 50.94
> 0.85 8.75 23.33 8.52 7.53 73.05
> sde 0.00 37.00 11.00 1.00 46.00 152.00 33.00
> 0.01 1.00 0.64 5.00 0.54 0.65
> sdj 0.00 0.50 0.00 970.50 0.00 12594.00 25.95
> 0.09 0.09 0.00 0.09 0.08 8.20
> sdk 0.00 0.00 0.00 977.50 0.00 12016.00 24.59
> 0.10 0.10 0.00 0.10 0.09 8.90
> sdf 0.00 0.50 0.50 37.50 2.00 230.25 12.22
> 9.63 10.58 8.00 10.61 1.79 6.80
> sdi 2.00 0.00 10.50 0.00 2528.00 0.00 481.52
> 0.10 9.33 9.33 0.00 7.76 8.15
> sdm 0.00 0.50 15.00 116.00 546.00 833.25 21.06
> 0.94 7.17 14.03 6.28 4.13 54.15
> sdl 0.00 0.00 3.00 0.00 26.00 0.00 17.33
> 0.02 7.50 7.50 0.00 7.50 2.25
> sdg 0.00 3.50 1.00 64.50 4.00 2929.25 89.56
> 0.40 6.04 9.00 5.99 3.42 22.40
> sdh 0.50 0.50 4.00 64.00 770.00 1105.00 55.15
> 4.96 189.42 21.25 199.93 4.21 28.60
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 8.50 0.00 110.00 0.00 25.88
> 0.01 1.59 1.59 0.00 1.53 1.30
> sdb 0.00 4.00 6.50 117.50 494.00 4544.50 81.27
> 0.87 6.99 11.62 6.73 3.28 40.70
> sdc 0.00 0.50 5.50 202.50 526.00 4123.00 44.70
> 1.80 8.66 18.73 8.39 2.08 43.30
> sdd 0.00 3.00 2.50 227.00 108.00 6952.00 61.53
> 46.10 197.44 30.20 199.29 3.86 88.60
> sde 0.00 0.00 0.00 1.50 0.00 6.00 8.00
> 0.00 2.33 0.00 2.33 1.33 0.20
> sdj 0.00 0.00 0.00 834.00 0.00 9912.00 23.77
> 0.08 0.09 0.00 0.09 0.08 6.75
> sdk 0.00 0.00 0.00 777.00 0.00 12318.00 31.71
> 0.12 0.15 0.00 0.15 0.10 7.70
> sdf 0.00 1.00 4.50 117.00 198.00 693.25 14.67
> 34.86 362.88 84.33 373.60 3.59 43.65
> sdi 0.00 0.00 1.50 0.00 6.00 0.00 8.00
> 0.01 9.00 9.00 0.00 9.00 1.35
> sdm 0.50 3.00 3.50 143.00 1014.00 4205.25 71.25
> 0.93 5.95 20.43 5.59 3.08 45.15
> sdl 0.50 0.00 8.00 148.50 1578.00 2128.50 47.37
> 0.82 5.27 6.44 5.21 3.40 53.20
> sdg 1.50 2.00 10.50 100.50 2540.00 2039.50 82.51
> 0.77 7.00 14.19 6.25 5.42 60.20
> sdh 0.50 0.00 5.00 0.00 1050.00 0.00 420.00
> 0.04 7.10 7.10 0.00 7.10 3.55
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 6.00 0.00 604.00 0.00 201.33
> 0.03 5.58 5.58 0.00 5.58 3.35
> sdb 0.00 6.00 7.00 236.00 132.00 8466.00 70.77
> 45.48 186.59 31.79 191.18 1.62 39.45
> sdc 2.00 0.00 19.50 46.50 6334.00 686.00 212.73
> 0.39 5.96 7.97 5.12 3.57 23.55
> sdd 0.00 1.00 3.00 20.00 72.00 1527.25 139.07
> 0.31 47.67 6.17 53.90 3.11 7.15
> sde 0.00 17.00 0.00 4.50 0.00 184.00 81.78
> 0.01 2.33 0.00 2.33 2.33 1.05
> sdj 0.00 0.00 0.00 805.50 0.00 12760.00 31.68
> 0.21 0.27 0.00 0.27 0.09 7.35
> sdk 0.00 0.00 0.00 438.00 0.00 14300.00 65.30
> 0.24 0.54 0.00 0.54 0.13 5.65
> sdf 0.00 0.00 1.00 0.00 6.00 0.00 12.00
> 0.00 2.50 2.50 0.00 2.50 0.25
> sdi 0.00 5.50 14.50 27.50 394.00 6459.50 326.36
> 0.86 20.18 11.00 25.02 7.42 31.15
> sdm 0.00 1.00 9.00 175.00 554.00 3173.25 40.51
> 1.12 6.38 7.22 6.34 2.41 44.40
> sdl 0.00 2.00 2.50 100.50 26.00 2483.00 48.72
> 0.77 7.47 11.80 7.36 2.10 21.65
> sdg 0.00 4.50 9.00 214.00 798.00 7417.00 73.68
> 66.56 298.46 28.83 309.80 3.35 74.70
> sdh 0.00 0.00 16.50 0.00 344.00 0.00 41.70
> 0.09 5.61 5.61 0.00 4.55 7.50
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 1.00 0.00 9.00 0.00 3162.00 0.00 702.67
> 0.07 8.06 8.06 0.00 6.06 5.45
> sdb 0.50 0.00 12.50 13.00 1962.00 298.75 177.31
> 0.63 30.00 4.84 54.19 9.96 25.40
> sdc 0.00 0.50 3.50 131.00 18.00 1632.75 24.55
> 0.87 6.48 16.86 6.20 3.51 47.25
> sdd 0.00 0.00 4.00 0.00 72.00 16.00 44.00
> 0.26 10.38 10.38 0.00 23.38 9.35
> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 0.00 843.50 0.00 16334.00 38.73
> 0.19 0.23 0.00 0.23 0.11 9.10
> sdk 0.00 0.00 0.00 803.00 0.00 10394.00 25.89
> 0.07 0.08 0.00 0.08 0.08 6.25
> sdf 0.00 4.00 11.00 90.50 150.00 2626.00 54.70
> 0.59 5.84 3.82 6.08 4.06 41.20
> sdi 0.00 3.50 17.50 130.50 2132.00 6309.50 114.07
> 1.84 12.55 25.60 10.80 5.76 85.30
> sdm 0.00 4.00 2.00 139.00 44.00 1957.25 28.39
> 0.89 6.28 14.50 6.17 3.55 50.10
> sdl 0.00 0.50 12.00 101.00 334.00 1449.75 31.57
> 0.94 8.28 10.17 8.06 2.11 23.85
> sdg 0.00 0.00 2.50 3.00 204.00 17.00 80.36
> 0.02 5.27 4.60 5.83 3.91 2.15
> sdh 0.00 0.50 9.50 32.50 1810.00 199.50 95.69
> 0.28 6.69 3.79 7.54 5.12 21.50
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.50 0.50 25.00 24.50 1248.00 394.25 66.35
> 0.76 15.30 11.62 19.06 5.25 26.00
> sdb 1.50 0.00 13.50 30.00 2628.00 405.25 139.46
> 0.27 5.94 8.19 4.93 5.31 23.10
> sdc 0.00 6.00 3.00 163.00 60.00 9889.50 119.87
> 1.66 9.83 28.67 9.48 5.95 98.70
> sdd 0.00 11.00 5.50 353.50 50.00 2182.00 12.43
> 118.42 329.26 30.27 333.91 2.78 99.90
> sde 0.00 5.50 0.00 1.50 0.00 28.00 37.33
> 0.00 2.33 0.00 2.33 2.33 0.35
> sdj 0.00 0.00 0.00 1227.50 0.00 22064.00 35.95
> 0.50 0.41 0.00 0.41 0.10 12.50
> sdk 0.00 0.50 0.00 1073.50 0.00 19248.00 35.86
> 0.24 0.23 0.00 0.23 0.10 10.40
> sdf 0.00 4.00 0.00 109.00 0.00 4145.00 76.06
> 0.59 5.44 0.00 5.44 3.63 39.55
> sdi 0.00 1.00 8.50 95.50 218.00 2091.75 44.42
> 1.06 9.70 18.71 8.90 7.00 72.80
> sdm 0.00 0.00 8.00 177.50 82.00 3173.00 35.09
> 1.24 6.65 14.31 6.30 3.53 65.40
> sdl 0.00 3.50 3.00 187.50 32.00 2175.25 23.17
> 1.47 7.68 18.50 7.50 3.85 73.35
> sdg 0.00 0.00 1.00 0.00 12.00 0.00 24.00
> 0.00 1.50 1.50 0.00 1.50 0.15
> sdh 0.50 1.00 14.00 169.50 2364.00 4568.00 75.55
> 1.50 8.12 21.25 7.03 4.91 90.10
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 4.00 3.00 60.00 212.00 2542.00 87.43
> 0.58 8.02 15.50 7.64 7.95 50.10
> sdb 0.00 0.50 2.50 98.00 682.00 1652.00 46.45
> 0.51 5.13 6.20 5.10 3.05 30.65
> sdc 0.00 2.50 4.00 146.00 16.00 4623.25 61.86
> 1.07 7.33 13.38 7.17 2.22 33.25
> sdd 0.00 0.50 9.50 30.00 290.00 358.00 32.81
> 0.84 32.22 49.16 26.85 12.28 48.50
> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00
> sdj 0.00 0.50 0.00 530.00 0.00 7138.00 26.94
> 0.06 0.11 0.00 0.11 0.09 4.65
> sdk 0.00 0.00 0.00 625.00 0.00 8254.00 26.41
> 0.07 0.12 0.00 0.12 0.09 5.75
> sdf 0.00 0.00 0.00 4.00 0.00 18.00 9.00
> 0.01 3.62 0.00 3.62 3.12 1.25
> sdi 0.00 2.50 8.00 61.00 836.00 2681.50 101.96
> 0.58 9.25 15.12 8.48 6.71 46.30
> sdm 0.00 4.50 11.00 273.00 2100.00 8562.00 75.08
> 13.49 47.53 24.95 48.44 1.83 52.00
> sdl 0.00 1.00 0.50 49.00 2.00 1038.00 42.02
> 0.23 4.83 14.00 4.73 2.45 12.15
> sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00
> sdh 1.00 1.00 9.00 109.00 2082.00 2626.25 79.80
> 0.85 7.34 7.83 7.30 3.83 45.20
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 1.50 10.00 177.00 284.00 4857.00 54.98
> 36.26 194.27 21.85 204.01 3.53 66.00
> sdb 1.00 0.50 39.50 119.50 1808.00 2389.25 52.80
> 1.58 9.96 12.32 9.18 2.42 38.45
> sdc 0.00 2.00 15.00 200.50 116.00 4951.00 47.03
> 14.37 66.70 73.87 66.16 2.23 47.95
> sdd 0.00 3.50 6.00 54.50 180.00 2360.50 83.98
> 0.69 11.36 20.42 10.36 7.99 48.35
> sde 0.00 7.50 0.00 32.50 0.00 160.00 9.85
> 1.64 50.51 0.00 50.51 1.48 4.80
> sdj 0.00 0.00 0.00 835.00 0.00 10198.00 24.43
> 0.07 0.09 0.00 0.09 0.08 6.50
> sdk 0.00 0.00 0.00 802.00 0.00 12534.00 31.26
> 0.23 0.29 0.00 0.29 0.10 8.05
> sdf 0.00 2.50 2.00 133.50 14.00 5272.25 78.03
> 4.37 32.21 4.50 32.63 1.73 23.40
> sdi 0.00 4.50 17.00 125.50 2676.00 8683.25 159.43
> 1.86 13.02 27.97 11.00 4.95 70.55
> sdm 0.00 0.00 7.00 0.50 540.00 32.00 152.53
> 0.05 7.07 7.57 0.00 7.07 5.30
> sdl 0.00 7.00 27.00 276.00 2374.00 11955.50 94.58
> 25.87 85.36 15.20 92.23 1.84 55.90
> sdg 0.00 0.00 45.00 0.00 828.00 0.00 36.80
> 0.07 1.62 1.62 0.00 0.68 3.05
> sdh 0.00 0.50 0.50 65.50 2.00 1436.25 43.58
> 0.51 7.79 16.00 7.73 3.61 23.80
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 8.00 14.50 150.00 122.00 929.25 12.78
> 20.65 70.61 7.55 76.71 1.46 24.05
> sdb 0.00 5.00 8.00 283.50 86.00 2757.50 19.51
> 69.43 205.40 51.75 209.73 2.40 69.85
> sdc 0.00 0.00 12.50 1.50 350.00 48.25 56.89
> 0.25 17.75 17.00 24.00 4.75 6.65
> sdd 0.00 3.50 36.50 141.00 394.00 2338.75 30.79
> 1.50 8.42 16.16 6.41 4.56 80.95
> sde 0.00 1.50 0.00 1.00 0.00 10.00 20.00
> 0.00 2.00 0.00 2.00 2.00 0.20
> sdj 0.00 0.00 0.00 1059.00 0.00 18506.00 34.95
> 0.19 0.18 0.00 0.18 0.10 10.75
> sdk 0.00 0.00 0.00 1103.00 0.00 14220.00 25.78
> 0.09 0.08 0.00 0.08 0.08 8.35
> sdf 0.00 5.50 2.00 19.50 8.00 5158.75 480.63
> 0.17 8.05 6.50 8.21 6.95 14.95
> sdi 0.00 5.50 28.00 224.50 2210.00 8971.75 88.57
> 122.15 328.47 27.43 366.02 3.71 93.70
> sdm 0.00 0.00 13.00 4.00 718.00 16.00 86.35
> 0.15 3.76 4.23 2.25 3.62 6.15
> sdl 0.00 0.00 16.50 0.00 832.00 0.00 100.85
> 0.02 1.12 1.12 0.00 1.09 1.80
> sdg 0.00 2.50 17.00 23.50 1032.00 3224.50 210.20
> 0.25 6.25 2.56 8.91 3.41 13.80
> sdh 0.00 10.50 4.50 241.00 66.00 7252.00 59.62
> 23.00 91.66 4.22 93.29 2.11 51.85
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.50 3.50 91.00 92.00 552.75 13.65
> 36.27 479.41 81.57 494.71 5.65 53.35
> sdb 0.00 1.00 6.00 168.00 224.00 962.50 13.64
> 83.35 533.92 62.00 550.77 5.75 100.00
> sdc 0.00 1.00 3.00 171.00 16.00 1640.00 19.03
> 1.08 6.18 11.83 6.08 3.15 54.80
> sdd 0.00 5.00 5.00 107.50 132.00 6576.75 119.27
> 0.79 7.06 18.80 6.51 5.13 57.70
> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 0.00 1111.50 0.00 22346.00 40.21
> 0.27 0.24 0.00 0.24 0.11 12.10
> sdk 0.00 0.00 0.00 1022.00 0.00 33040.00 64.66
> 0.68 0.67 0.00 0.67 0.13 13.60
> sdf 0.00 5.50 2.50 91.00 12.00 4977.25 106.72
> 2.29 24.48 14.40 24.76 2.42 22.60
> sdi 0.00 0.00 10.00 69.50 368.00 858.50 30.86
> 7.40 586.41 5.50 669.99 4.21 33.50
> sdm 0.00 4.00 8.00 210.00 944.00 5833.50 62.18
> 1.57 7.62 18.62 7.20 4.57 99.70
> sdl 0.00 0.00 7.50 22.50 104.00 253.25 23.82
> 0.14 4.82 5.07 4.73 4.03 12.10
> sdg 0.00 4.00 1.00 84.00 4.00 3711.75 87.43
> 0.58 6.88 12.50 6.81 5.75 48.90
> sdh 0.00 3.50 7.50 44.00 72.00 2954.25 117.52
> 1.54 39.50 61.73 35.72 6.40 32.95
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 1.00 0.00 20.00 0.00 40.00
> 0.01 14.50 14.50 0.00 14.50 1.45
> sdb 0.00 7.00 10.50 198.50 2164.00 6014.75 78.27
> 1.94 9.29 28.90 8.25 4.77 99.75
> sdc 0.00 2.00 4.00 95.50 112.00 5152.25 105.81
> 0.94 9.46 24.25 8.84 4.68 46.55
> sdd 0.00 1.00 2.00 131.00 10.00 7167.25 107.93
> 4.55 34.23 83.25 33.48 2.52 33.55
> sde 0.00 0.00 0.00 0.50 0.00 2.00 8.00
> 0.00 0.00 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 0.00 541.50 0.00 6468.00 23.89
> 0.05 0.10 0.00 0.10 0.09 5.00
> sdk 0.00 0.00 0.00 509.00 0.00 7704.00 30.27
> 0.07 0.14 0.00 0.14 0.10 4.85
> sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00
> sdi 0.00 0.00 3.50 0.00 90.00 0.00 51.43
> 0.04 10.14 10.14 0.00 10.14 3.55
> sdm 0.00 2.00 5.00 102.50 1186.00 4583.00 107.33
> 0.81 7.56 23.20 6.80 2.78 29.85
> sdl 0.00 14.00 10.00 216.00 112.00 3645.50 33.25
> 73.45 311.05 46.30 323.31 3.51 79.35
> sdg 0.00 1.00 0.00 52.50 0.00 240.00 9.14
> 0.25 4.76 0.00 4.76 4.48 23.50
> sdh 0.00 0.00 3.50 0.00 18.00 0.00 10.29
> 0.02 7.00 7.00 0.00 7.00 2.45
>
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz
> avgqu-sz await r_await w_await svctm %util
> sda 0.00 0.00 1.00 0.00 4.00 0.00 8.00
> 0.01 14.50 14.50 0.00 14.50 1.45
> sdb 0.00 9.00 2.00 292.00 192.00 10925.75 75.63
> 36.98 100.27 54.75 100.58 2.95 86.60
> sdc 0.00 9.00 10.50 151.00 78.00 6771.25 84.82
> 36.06 94.60 26.57 99.33 3.77 60.85
> sdd 0.00 0.00 5.00 1.00 74.00 24.00 32.67
> 0.03 5.00 6.00 0.00 5.00 3.00
> sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00
> 0.00 0.00 0.00 0.00 0.00 0.00
> sdj 0.00 0.00 0.00 787.50 0.00 9418.00 23.92
> 0.07 0.10 0.00 0.10 0.09 6.70
> sdk 0.00 0.00 0.00 766.50 0.00 9400.00 24.53
> 0.08 0.11 0.00 0.11 0.10 7.70
> sdf 0.00 0.00 0.50 41.50 6.00 391.00 18.90
> 0.24 5.79 9.00 5.75 5.50 23.10
> sdi 0.00 10.00 9.00 268.00 92.00 1618.75 12.35
> 68.20 150.90 15.50 155.45 2.36 65.30
> sdm 0.00 11.50 10.00 330.50 72.00 3201.25 19.23
> 68.83 139.38 37.45 142.46 1.84 62.80
> sdl 0.00 2.50 2.50 228.50 14.00 2526.00 21.99
> 90.42 404.71 242.40 406.49 4.33 100.00
> sdg 0.00 5.50 7.50 298.00 68.00 5275.25 34.98
> 75.31 174.85 26.73 178.58 2.67 81.60
> sdh 0.00 0.00 2.50 2.00 28.00 24.00 23.11
> 0.01 2.78 5.00 0.00 2.78 1.25
>
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
> On Sun, Oct 4, 2015 at 12:16 AM, Josef Johansson wrote:
> Hi,
>
> I don't know what brand those 4TB spindles are, but I know that mine are very
> bad at doing write at the same time as read. Especially small read write.
>
> This has an absurdly bad effect when doing maintenance on ceph. That being
> said we see a lot of difference between dumpling and hammer in performance on
> these drives. Most likely due to hammer able to read write degraded PGs.
>
> We have run into two different problems along the way, the first was blocked
> request where we had to upgrade from 64GB mem on each node to 256GB. We
> thought that it was the only safe buy make things better.
>
> I believe it worked because more reads were cached so we had less mixed read
> write on the nodes, thus giving the spindles more room to breath. Now this
> was a shot in the dark then, but the price is not that high even to just try
> it out.. compared to 6 people working on it. I believe the IO on disk was not
> huge either, but what kills the disk is high latency. How much bandwidth are
> the disk using? We had very low.. 3-5MB/s.
>
> The second problem was defragmentations hitting 70%, lowering that to 6% made
> a lot of difference. Depending on IO pattern this increases different.
>
> TL;DR read kills the 4TB spindles.
>
> Hope you guys clear out of the woods.
> /Josef
>
> On 3 Oct 2015 10:10 pm, "Robert LeBlanc" wrote:
> - -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
>
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
>
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
>
> I don't know what "no flag points reached" means.
>
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
>
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
>
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
>
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
>
> Thanks,
> - -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> l7OF
> =OI++
> - -----END PGP SIGNATURE-----
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc wrote:
> > We dropped the replication on our cluster from 4 to 3 and it looks
> > like all the blocked I/O has stopped (no entries in the log for the
> > last 12 hours). This makes me believe that there is some issue with
> > the number of sockets or some other TCP issue. We have not messed with
> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> > processes and 16K system wide.
> >
> > Does this seem like the right spot to be looking? What are some
> > configuration items we should be looking at?
> >
> > Thanks,
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> seems that there were some major reworks in the network handling in
> >> the kernel to efficiently handle that network rate. If I remember
> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> that we did see packet loss while congesting our ISLs in our initial
> >> testing, but we could not tell where the dropping was happening. We
> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> trying to congest things. We probably already saw this issue, just
> >> didn't know it.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson wrote:
> >>> FWIW, we've got some 40GbE Intel cards in the community performance
> >>> cluster
> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>> with 3.10.0-229.7.2.el7.x86_64. We did get feedback from Intel that older
> >>> drivers might cause problems though.
> >>>
> >>> Here's ifconfig from one of the nodes:
> >>>
> >>> ens513f1: flags=4163 mtu 1500
> >>> inet 10.0.10.101 netmask 255.255.255.0 broadcast 10.0.10.255
> >>> inet6 fe80::6a05:caff:fe2b:7ea1 prefixlen 64 scopeid 0x20
> >>> ether 68:05:ca:2b:7e:a1 txqueuelen 1000 (Ethernet)
> >>> RX packets 169232242875 bytes 229346261232279 (208.5 TiB)
> >>> RX errors 0 dropped 0 overruns 0 frame 0
> >>> TX packets 153491686361 bytes 203976410836881 (185.5 TiB)
> >>> TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
> >>>
> >>> Mark
> >>>
> >>>
> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> OK, here is the update on the saga...
> >>>>
> >>>> I traced some more of blocked I/Os and it seems that communication
> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>>>
> >>>> I then ran iperf between all the nodes with the same results, so that
> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>>> need the network enhancements in the 4.x series to work well.
> >>>>
> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>>> with this issue are running 4.x, other than that, they are pretty
> >>>> differing hardware and network configs.
> >>>>
> >>>> Thanks,
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v1.1.0
> >>>> Comment: https://www.mailvelope.com
> >>>>
> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>>> 4OEo
> >>>> =P33I
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>>> wrote:
> >>>>>
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA256
> >>>>>
> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>>>> blocked I/O.
> >>>>>
> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>>>> the blocked I/O.
> >>>>> - ----------------
> >>>>> Robert LeBlanc
> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil wrote:
> >>>>>>
> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>>>>>>
> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>>>>>> osd.13 started reading it, which is pretty weird. Sage, didn't we
> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>>>>>> delayed for many 10s of seconds?
> >>>>>>
> >>>>>>
> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>>>>> has
> >>>>>> been a network misconfiguration. Usually related to jumbo frames.
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> What kernel are you running?
> >>>>>>> -Sam
> >>>>>>>
> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc wrote:
> >>>>>>>>
> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>> Hash: SHA256
> >>>>>>>>
> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>>>>>>> extracted what I think are important entries from the logs for the
> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>>>>>>>
> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual
> >>>>>>>> data
> >>>>>>>> transfer).
> >>>>>>>>
> >>>>>>>> It looks like osd.17 is receiving responses to start the
> >>>>>>>> communication
> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>>>>>>> later. To me it seems that the message is getting received but not
> >>>>>>>> passed to another thread right away or something. This test was done
> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>>>>>>> thread.
> >>>>>>>>
> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>>>>>>> some help.
> >>>>>>>>
> >>>>>>>> Single Test started about
> >>>>>>>> 2015-09-22 12:52:36
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 56 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.439150 secs
> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 57 :
> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>>>>>>> osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>> currently waiting for subops from 13,16
> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 7 :
> >>>>>>>> cluster
> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>>>>>>> 30.379680 secs
> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 8 :
> >>>>>>>> cluster
> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.406303:
> >>>>>>>> osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>> currently waiting for subops from 13,17
> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 9 :
> >>>>>>>> cluster
> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.318144:
> >>>>>>>> osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>> currently waiting for subops from 13,14
> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 130 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.954212 secs
> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 131 :
> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>>>>>>> osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>> currently waiting for subops from 16,17
> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 10 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.704367 secs
> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 11 :
> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>>>>>>> osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>> currently waiting for subops from 13,17
> >>>>>>>>
> >>>>>>>> Server IP addr OSD
> >>>>>>>> nodev - 192.168.55.11 - 12
> >>>>>>>> nodew - 192.168.55.12 - 13
> >>>>>>>> nodex - 192.168.55.13 - 16
> >>>>>>>> nodey - 192.168.55.14 - 17
> >>>>>>>> nodez - 192.168.55.15 - 14
> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>>>>>>>
> >>>>>>>> fio job:
> >>>>>>>> [rbd-test]
> >>>>>>>> readwrite=write
> >>>>>>>> blocksize=4M
> >>>>>>>> #runtime=60
> >>>>>>>> name=rbd-test
> >>>>>>>> #readwrite=randwrite
> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>>>>>>> #rwmixread=72
> >>>>>>>> #norandommap
> >>>>>>>> #size=1T
> >>>>>>>> #blocksize=4k
> >>>>>>>> ioengine=rbd
> >>>>>>>> rbdname=test2
> >>>>>>>> pool=rbd
> >>>>>>>> clientname=admin
> >>>>>>>> iodepth=8
> >>>>>>>> #numjobs=4
> >>>>>>>> #thread
> >>>>>>>> #group_reporting
> >>>>>>>> #time_based
> >>>>>>>> #direct=1
> >>>>>>>> #ramp_time=60
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>>> Version: Mailvelope v1.1.0
> >>>>>>>> Comment: https://www.mailvelope.com
> >>>>>>>>
> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>>>>>>> J3hS
> >>>>>>>> =0J7F
> >>>>>>>> -----END PGP SIGNATURE-----
> >>>>>>>> ----------------
> >>>>>>>> Robert LeBlanc
> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc wrote:
> >>>>>>>>>>
> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>>>> Hash: SHA256
> >>>>>>>>>>
> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>>>>>>>>>
> >>>>>>>>>> I'm not
> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>>>>>>>> this, it was discussed not too long ago.
> >>>>>>>>>
> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the
> >>>>>>>>>> sessions
> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>>>>>>>>>
> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>>>>>>>>> the
> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>>>>>>>>> on
> >>>>>>>>>> my test cluster with no other load.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hmm, that does make it seem less likely if you're really not
> >>>>>>>>> creating
> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>>>>>>>> 20",
> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>>>>>>>> out
> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>>>>>>>> -Greg
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>>>> in
> >>>>>>>> the body of a message to [email protected]
> >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>> Version: Mailvelope v1.1.0
> >>>>> Comment: https://www.mailvelope.com
> >>>>>
> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>>>> gcZm
> >>>>> =CjwB
> >>>>> -----END PGP SIGNATURE-----
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to [email protected]
> >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.1.0
> >> Comment: https://www.mailvelope.com
> >>
> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> ae22
> >> =AX+L
> >> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing
> [email protected]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWEVA5CRDmVDuy+mK58QAAY4EP/2jTEGPrbR3KDOC1d6FU
> 7TkVeFtow7UCe9/TwArLtcEVTr8rdaXNWRi7gat99zbL5pw+96Sj6bGqpKVz
> ZBSHcBlLIl42Hj10Ju7Svpwn7Q9RnSGOvjEdghEKsTxnf37gZD/KjvMbidJu
> jlPGEfnGEdYbQ+vDYoCoUIuvUNPbCvWQTjJpnTXrMZfhhEBoOepMzF9s6L6B
> AWR9WUrtz4HtGSMT42U1gd3LDOUh/5Ioy6FuhJe04piaf3ikRg+pjX47/WBd
> mQupmKJOblaULCswOrMLTS9R2+p6yaWj0zlUb3OAOErO7JR8OWZ2H7tYjkQN
> rGPsIRNv4yKw2Z5vJdHLksVdYhBQY1I4N1GO3+hf+j/yotPC9Ay4BYLZrQwf
> 3L+uhqSEu80erZjsJF4lilmw0l9nbDSoXc0MqRoXrpUIqyVtmaCBynv5Xq7s
> L5idaH6iVPBwy4Y6qzVuQpP0LaHp48ojIRx7likQJt0MSeDzqnslnp5B/9nb
> Ppu3peRUKf5GEKISRQ6gOI3C4gTSSX6aBatWdtpm01Et0T6ysoxAP/VoO3Nb
> 0PDsuYYT0U1MYqi0USouiNc4yRWNb9hkkBHEJrwjtP52moL1WYdYleL6w+FS
> Y1YQ1DU8YsEtVniBmZc4TBQJRRIS6SaQjH108JCjUcy9oVNwRtOqbcT1aiI6
> EP/Q
> =efx7
> -----END PGP SIGNATURE-----
>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com