Re: [ceph-users] Potential OSD deadlock?

Josef Johansson Mon, 05 Oct 2015 03:10:25 -0700

Hi,

Looking over disks etc and comparing to our setup, we got a bit different 
hardware, but they should be comparable. Running Hitachi 4TB (HUS724040AL), 
Intel DC S3700 and SAS3008 instead.


In our old cluster (almost same hardware in new and old) we have overloaded the 
cluster and had to wait three nights before a last new disk was added, next 
time we’ll turn down recover ratios and let it run daytime. Now we use 
nobackfill to only run during nighttime. We also had to turn off deep-scrub 
during day to let IO have more space. 

The new cluster we still run everything daytime, but I feel that the article 
that Intel wrote, where they specified that a cluster could handle X clients is 
pretty true. What I couldn’t figure out was why the system was not loaded more 
than it was, but it _feels_ like mixed read/write make the latency go beyond a 
certain point where the clients starts to suffer. And that it’s not possible to 
see it through iostat and friends.

Hope that the debug logs can track what the latency is due to. Keeping tabs on 
how this turns out.

I believe these guys do ceph consulting as well. 
https://www.hastexo.com/knowledge/storage-io/ceph-rados 
<https://www.hastexo.com/knowledge/storage-io/ceph-rados>

Regards,
Josef

> On 04 Oct 2015, at 18:13, Robert LeBlanc <rob...@leblancnet.us> wrote:
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> These are Toshiba MG03ACA400 drives.
> 
> sd{a,b} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 3.0 Gb
> sd{c,d} are 4TB on 00:1f.2 SATA controller: Intel Corporation C600/X79 series 
> chipset 6-Port SATA AHCI Controller (rev 05) at 6.0 Gb
> sde is SATADOM with OS install
> sd{f..i,l,m} are 4TB on 01:00.0 Serial Attached SCSI controller: LSI Logic / 
> Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
> sd{j,k} are 240 GB Intel SSDSC2BB240G4 on 01:00.0 Serial Attached SCSI 
> controller: LSI Logic / Symbios Logic SAS2308 PCI-Express Fusion-MPT SAS-2 
> (rev 05)
> 
> There is probably some performance optimization that we can do in this area, 
> however unless I'm missing something, I don't see anything that should cause 
> I/O to take 30-60+ seconds to complete from a disk standpoint.
> 
> [root@ceph1 ~]# for i in {{a..d},{f..i},{l,m}}; do echo -n "sd${i}1: "; 
> xfs_db -c frag -r /dev/sd${i}1; done                                          
>                                                                               
>                                 
> sda1: actual 924229, ideal 414161, fragmentation factor 55.19%
> sdb1: actual 1703083, ideal 655321, fragmentation factor 61.52%
> sdc1: actual 2161827, ideal 746418, fragmentation factor 65.47%
> sdd1: actual 1807008, ideal 654214, fragmentation factor 63.80%
> sdf1: actual 735471, ideal 311837, fragmentation factor 57.60%
> sdg1: actual 1463859, ideal 507362, fragmentation factor 65.34%
> sdh1: actual 1684905, ideal 556571, fragmentation factor 66.97%
> sdi1: actual 1833980, ideal 608499, fragmentation factor 66.82%
> sdl1: actual 1641128, ideal 554364, fragmentation factor 66.22%
> sdm1: actual 2032644, ideal 697129, fragmentation factor 65.70%
> 
> 
> [root@ceph1 ~]# iostat -xd 2
> Linux 4.2.1-1.el7.elrepo.x86_64 (ceph1)   10/04/2015      _x86_64_        (16 
> CPU)
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.09     2.06    9.24   36.18   527.28  1743.71   100.00    
>  8.96  197.32   17.50  243.23   4.07  18.47
> sdb               0.17     3.61   16.70   74.44   949.65  2975.30    86.13    
>  6.74   73.95   23.94   85.16   4.31  39.32
> sdc               0.14     4.67   15.69   87.80   818.02  3860.11    90.41    
>  9.56   92.38   26.73  104.11   4.44  45.91
> sdd               0.17     3.43    7.16   69.13   480.96  2847.42    87.25    
>  4.80   62.89   30.00   66.30   4.33  33.00
> sde               0.01     1.13    0.34    0.99     8.35    12.01    30.62    
>  0.01    7.37    2.64    9.02   1.64   0.22
> sdj               0.00     1.22    0.01  348.22     0.03 11302.65    64.91    
>  0.23    0.66    0.14    0.66   0.15   5.15
> sdk               0.00     1.99    0.01  369.94     0.03 12876.74    69.61    
>  0.26    0.71    0.13    0.71   0.16   5.75
> sdf               0.01     1.79    1.55   31.12    39.64  1431.37    90.06    
>  4.07  124.67   16.25  130.05   3.11  10.17
> sdi               0.22     3.17   23.92   72.90  1386.45  2676.28    83.93    
>  7.75   80.00   24.31   98.27   4.31  41.77
> sdm               0.16     3.10   17.63   72.84   986.29  2767.24    82.98    
>  6.57   72.64   23.67   84.50   4.23  38.30
> sdl               0.11     3.01   12.10   55.14   660.85  2361.40    89.89    
> 17.87  265.80   21.64  319.36   4.08  27.45
> sdg               0.08     2.45    9.75   53.90   489.67  1929.42    76.01    
> 17.27  271.30   20.77  316.61   3.98  25.33
> sdh               0.10     2.76   11.28   60.97   600.10  2114.48    75.14    
>  1.70   23.55   22.92   23.66   4.10  29.60
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.50    0.00   146.00     0.00   584.00    
>  0.01   16.00   16.00    0.00  16.00   0.80
> sdb               0.00     0.50    9.00  119.00  2036.00  2578.00    72.09    
>  0.68    5.50    7.06    5.39   2.36  30.25
> sdc               0.00     4.00   34.00  129.00   494.00  6987.75    91.80    
>  1.70   10.44   17.00    8.72   4.44  72.40
> sdd               0.00     1.50    1.50   95.50    74.00  2396.50    50.94    
>  0.85    8.75   23.33    8.52   7.53  73.05
> sde               0.00    37.00   11.00    1.00    46.00   152.00    33.00    
>  0.01    1.00    0.64    5.00   0.54   0.65
> sdj               0.00     0.50    0.00  970.50     0.00 12594.00    25.95    
>  0.09    0.09    0.00    0.09   0.08   8.20
> sdk               0.00     0.00    0.00  977.50     0.00 12016.00    24.59    
>  0.10    0.10    0.00    0.10   0.09   8.90
> sdf               0.00     0.50    0.50   37.50     2.00   230.25    12.22    
>  9.63   10.58    8.00   10.61   1.79   6.80
> sdi               2.00     0.00   10.50    0.00  2528.00     0.00   481.52    
>  0.10    9.33    9.33    0.00   7.76   8.15
> sdm               0.00     0.50   15.00  116.00   546.00   833.25    21.06    
>  0.94    7.17   14.03    6.28   4.13  54.15
> sdl               0.00     0.00    3.00    0.00    26.00     0.00    17.33    
>  0.02    7.50    7.50    0.00   7.50   2.25
> sdg               0.00     3.50    1.00   64.50     4.00  2929.25    89.56    
>  0.40    6.04    9.00    5.99   3.42  22.40
> sdh               0.50     0.50    4.00   64.00   770.00  1105.00    55.15    
>  4.96  189.42   21.25  199.93   4.21  28.60
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    8.50    0.00   110.00     0.00    25.88    
>  0.01    1.59    1.59    0.00   1.53   1.30
> sdb               0.00     4.00    6.50  117.50   494.00  4544.50    81.27    
>  0.87    6.99   11.62    6.73   3.28  40.70
> sdc               0.00     0.50    5.50  202.50   526.00  4123.00    44.70    
>  1.80    8.66   18.73    8.39   2.08  43.30
> sdd               0.00     3.00    2.50  227.00   108.00  6952.00    61.53    
> 46.10  197.44   30.20  199.29   3.86  88.60
> sde               0.00     0.00    0.00    1.50     0.00     6.00     8.00    
>  0.00    2.33    0.00    2.33   1.33   0.20
> sdj               0.00     0.00    0.00  834.00     0.00  9912.00    23.77    
>  0.08    0.09    0.00    0.09   0.08   6.75
> sdk               0.00     0.00    0.00  777.00     0.00 12318.00    31.71    
>  0.12    0.15    0.00    0.15   0.10   7.70
> sdf               0.00     1.00    4.50  117.00   198.00   693.25    14.67    
> 34.86  362.88   84.33  373.60   3.59  43.65
> sdi               0.00     0.00    1.50    0.00     6.00     0.00     8.00    
>  0.01    9.00    9.00    0.00   9.00   1.35
> sdm               0.50     3.00    3.50  143.00  1014.00  4205.25    71.25    
>  0.93    5.95   20.43    5.59   3.08  45.15
> sdl               0.50     0.00    8.00  148.50  1578.00  2128.50    47.37    
>  0.82    5.27    6.44    5.21   3.40  53.20
> sdg               1.50     2.00   10.50  100.50  2540.00  2039.50    82.51    
>  0.77    7.00   14.19    6.25   5.42  60.20
> sdh               0.50     0.00    5.00    0.00  1050.00     0.00   420.00    
>  0.04    7.10    7.10    0.00   7.10   3.55
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    6.00    0.00   604.00     0.00   201.33    
>  0.03    5.58    5.58    0.00   5.58   3.35
> sdb               0.00     6.00    7.00  236.00   132.00  8466.00    70.77    
> 45.48  186.59   31.79  191.18   1.62  39.45
> sdc               2.00     0.00   19.50   46.50  6334.00   686.00   212.73    
>  0.39    5.96    7.97    5.12   3.57  23.55
> sdd               0.00     1.00    3.00   20.00    72.00  1527.25   139.07    
>  0.31   47.67    6.17   53.90   3.11   7.15
> sde               0.00    17.00    0.00    4.50     0.00   184.00    81.78    
>  0.01    2.33    0.00    2.33   2.33   1.05
> sdj               0.00     0.00    0.00  805.50     0.00 12760.00    31.68    
>  0.21    0.27    0.00    0.27   0.09   7.35
> sdk               0.00     0.00    0.00  438.00     0.00 14300.00    65.30    
>  0.24    0.54    0.00    0.54   0.13   5.65
> sdf               0.00     0.00    1.00    0.00     6.00     0.00    12.00    
>  0.00    2.50    2.50    0.00   2.50   0.25
> sdi               0.00     5.50   14.50   27.50   394.00  6459.50   326.36    
>  0.86   20.18   11.00   25.02   7.42  31.15
> sdm               0.00     1.00    9.00  175.00   554.00  3173.25    40.51    
>  1.12    6.38    7.22    6.34   2.41  44.40
> sdl               0.00     2.00    2.50  100.50    26.00  2483.00    48.72    
>  0.77    7.47   11.80    7.36   2.10  21.65
> sdg               0.00     4.50    9.00  214.00   798.00  7417.00    73.68    
> 66.56  298.46   28.83  309.80   3.35  74.70
> sdh               0.00     0.00   16.50    0.00   344.00     0.00    41.70    
>  0.09    5.61    5.61    0.00   4.55   7.50
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               1.00     0.00    9.00    0.00  3162.00     0.00   702.67    
>  0.07    8.06    8.06    0.00   6.06   5.45
> sdb               0.50     0.00   12.50   13.00  1962.00   298.75   177.31    
>  0.63   30.00    4.84   54.19   9.96  25.40
> sdc               0.00     0.50    3.50  131.00    18.00  1632.75    24.55    
>  0.87    6.48   16.86    6.20   3.51  47.25
> sdd               0.00     0.00    4.00    0.00    72.00    16.00    44.00    
>  0.26   10.38   10.38    0.00  23.38   9.35
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  843.50     0.00 16334.00    38.73    
>  0.19    0.23    0.00    0.23   0.11   9.10
> sdk               0.00     0.00    0.00  803.00     0.00 10394.00    25.89    
>  0.07    0.08    0.00    0.08   0.08   6.25
> sdf               0.00     4.00   11.00   90.50   150.00  2626.00    54.70    
>  0.59    5.84    3.82    6.08   4.06  41.20
> sdi               0.00     3.50   17.50  130.50  2132.00  6309.50   114.07    
>  1.84   12.55   25.60   10.80   5.76  85.30
> sdm               0.00     4.00    2.00  139.00    44.00  1957.25    28.39    
>  0.89    6.28   14.50    6.17   3.55  50.10
> sdl               0.00     0.50   12.00  101.00   334.00  1449.75    31.57    
>  0.94    8.28   10.17    8.06   2.11  23.85
> sdg               0.00     0.00    2.50    3.00   204.00    17.00    80.36    
>  0.02    5.27    4.60    5.83   3.91   2.15
> sdh               0.00     0.50    9.50   32.50  1810.00   199.50    95.69    
>  0.28    6.69    3.79    7.54   5.12  21.50
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.50     0.50   25.00   24.50  1248.00   394.25    66.35    
>  0.76   15.30   11.62   19.06   5.25  26.00
> sdb               1.50     0.00   13.50   30.00  2628.00   405.25   139.46    
>  0.27    5.94    8.19    4.93   5.31  23.10
> sdc               0.00     6.00    3.00  163.00    60.00  9889.50   119.87    
>  1.66    9.83   28.67    9.48   5.95  98.70
> sdd               0.00    11.00    5.50  353.50    50.00  2182.00    12.43   
> 118.42  329.26   30.27  333.91   2.78  99.90
> sde               0.00     5.50    0.00    1.50     0.00    28.00    37.33    
>  0.00    2.33    0.00    2.33   2.33   0.35
> sdj               0.00     0.00    0.00 1227.50     0.00 22064.00    35.95    
>  0.50    0.41    0.00    0.41   0.10  12.50
> sdk               0.00     0.50    0.00 1073.50     0.00 19248.00    35.86    
>  0.24    0.23    0.00    0.23   0.10  10.40
> sdf               0.00     4.00    0.00  109.00     0.00  4145.00    76.06    
>  0.59    5.44    0.00    5.44   3.63  39.55
> sdi               0.00     1.00    8.50   95.50   218.00  2091.75    44.42    
>  1.06    9.70   18.71    8.90   7.00  72.80
> sdm               0.00     0.00    8.00  177.50    82.00  3173.00    35.09    
>  1.24    6.65   14.31    6.30   3.53  65.40
> sdl               0.00     3.50    3.00  187.50    32.00  2175.25    23.17    
>  1.47    7.68   18.50    7.50   3.85  73.35
> sdg               0.00     0.00    1.00    0.00    12.00     0.00    24.00    
>  0.00    1.50    1.50    0.00   1.50   0.15
> sdh               0.50     1.00   14.00  169.50  2364.00  4568.00    75.55    
>  1.50    8.12   21.25    7.03   4.91  90.10
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     4.00    3.00   60.00   212.00  2542.00    87.43    
>  0.58    8.02   15.50    7.64   7.95  50.10
> sdb               0.00     0.50    2.50   98.00   682.00  1652.00    46.45    
>  0.51    5.13    6.20    5.10   3.05  30.65
> sdc               0.00     2.50    4.00  146.00    16.00  4623.25    61.86    
>  1.07    7.33   13.38    7.17   2.22  33.25
> sdd               0.00     0.50    9.50   30.00   290.00   358.00    32.81    
>  0.84   32.22   49.16   26.85  12.28  48.50
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.50    0.00  530.00     0.00  7138.00    26.94    
>  0.06    0.11    0.00    0.11   0.09   4.65
> sdk               0.00     0.00    0.00  625.00     0.00  8254.00    26.41    
>  0.07    0.12    0.00    0.12   0.09   5.75
> sdf               0.00     0.00    0.00    4.00     0.00    18.00     9.00    
>  0.01    3.62    0.00    3.62   3.12   1.25
> sdi               0.00     2.50    8.00   61.00   836.00  2681.50   101.96    
>  0.58    9.25   15.12    8.48   6.71  46.30
> sdm               0.00     4.50   11.00  273.00  2100.00  8562.00    75.08    
> 13.49   47.53   24.95   48.44   1.83  52.00
> sdl               0.00     1.00    0.50   49.00     2.00  1038.00    42.02    
>  0.23    4.83   14.00    4.73   2.45  12.15
> sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdh               1.00     1.00    9.00  109.00  2082.00  2626.25    79.80    
>  0.85    7.34    7.83    7.30   3.83  45.20
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     1.50   10.00  177.00   284.00  4857.00    54.98    
> 36.26  194.27   21.85  204.01   3.53  66.00
> sdb               1.00     0.50   39.50  119.50  1808.00  2389.25    52.80    
>  1.58    9.96   12.32    9.18   2.42  38.45
> sdc               0.00     2.00   15.00  200.50   116.00  4951.00    47.03    
> 14.37   66.70   73.87   66.16   2.23  47.95
> sdd               0.00     3.50    6.00   54.50   180.00  2360.50    83.98    
>  0.69   11.36   20.42   10.36   7.99  48.35
> sde               0.00     7.50    0.00   32.50     0.00   160.00     9.85    
>  1.64   50.51    0.00   50.51   1.48   4.80
> sdj               0.00     0.00    0.00  835.00     0.00 10198.00    24.43    
>  0.07    0.09    0.00    0.09   0.08   6.50
> sdk               0.00     0.00    0.00  802.00     0.00 12534.00    31.26    
>  0.23    0.29    0.00    0.29   0.10   8.05
> sdf               0.00     2.50    2.00  133.50    14.00  5272.25    78.03    
>  4.37   32.21    4.50   32.63   1.73  23.40
> sdi               0.00     4.50   17.00  125.50  2676.00  8683.25   159.43    
>  1.86   13.02   27.97   11.00   4.95  70.55
> sdm               0.00     0.00    7.00    0.50   540.00    32.00   152.53    
>  0.05    7.07    7.57    0.00   7.07   5.30
> sdl               0.00     7.00   27.00  276.00  2374.00 11955.50    94.58    
> 25.87   85.36   15.20   92.23   1.84  55.90
> sdg               0.00     0.00   45.00    0.00   828.00     0.00    36.80    
>  0.07    1.62    1.62    0.00   0.68   3.05
> sdh               0.00     0.50    0.50   65.50     2.00  1436.25    43.58    
>  0.51    7.79   16.00    7.73   3.61  23.80
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     8.00   14.50  150.00   122.00   929.25    12.78    
> 20.65   70.61    7.55   76.71   1.46  24.05
> sdb               0.00     5.00    8.00  283.50    86.00  2757.50    19.51    
> 69.43  205.40   51.75  209.73   2.40  69.85
> sdc               0.00     0.00   12.50    1.50   350.00    48.25    56.89    
>  0.25   17.75   17.00   24.00   4.75   6.65
> sdd               0.00     3.50   36.50  141.00   394.00  2338.75    30.79    
>  1.50    8.42   16.16    6.41   4.56  80.95
> sde               0.00     1.50    0.00    1.00     0.00    10.00    20.00    
>  0.00    2.00    0.00    2.00   2.00   0.20
> sdj               0.00     0.00    0.00 1059.00     0.00 18506.00    34.95    
>  0.19    0.18    0.00    0.18   0.10  10.75
> sdk               0.00     0.00    0.00 1103.00     0.00 14220.00    25.78    
>  0.09    0.08    0.00    0.08   0.08   8.35
> sdf               0.00     5.50    2.00   19.50     8.00  5158.75   480.63    
>  0.17    8.05    6.50    8.21   6.95  14.95
> sdi               0.00     5.50   28.00  224.50  2210.00  8971.75    88.57   
> 122.15  328.47   27.43  366.02   3.71  93.70
> sdm               0.00     0.00   13.00    4.00   718.00    16.00    86.35    
>  0.15    3.76    4.23    2.25   3.62   6.15
> sdl               0.00     0.00   16.50    0.00   832.00     0.00   100.85    
>  0.02    1.12    1.12    0.00   1.09   1.80
> sdg               0.00     2.50   17.00   23.50  1032.00  3224.50   210.20    
>  0.25    6.25    2.56    8.91   3.41  13.80
> sdh               0.00    10.50    4.50  241.00    66.00  7252.00    59.62    
> 23.00   91.66    4.22   93.29   2.11  51.85
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.50    3.50   91.00    92.00   552.75    13.65    
> 36.27  479.41   81.57  494.71   5.65  53.35
> sdb               0.00     1.00    6.00  168.00   224.00   962.50    13.64    
> 83.35  533.92   62.00  550.77   5.75 100.00
> sdc               0.00     1.00    3.00  171.00    16.00  1640.00    19.03    
>  1.08    6.18   11.83    6.08   3.15  54.80
> sdd               0.00     5.00    5.00  107.50   132.00  6576.75   119.27    
>  0.79    7.06   18.80    6.51   5.13  57.70
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00 1111.50     0.00 22346.00    40.21    
>  0.27    0.24    0.00    0.24   0.11  12.10
> sdk               0.00     0.00    0.00 1022.00     0.00 33040.00    64.66    
>  0.68    0.67    0.00    0.67   0.13  13.60
> sdf               0.00     5.50    2.50   91.00    12.00  4977.25   106.72    
>  2.29   24.48   14.40   24.76   2.42  22.60
> sdi               0.00     0.00   10.00   69.50   368.00   858.50    30.86    
>  7.40  586.41    5.50  669.99   4.21  33.50
> sdm               0.00     4.00    8.00  210.00   944.00  5833.50    62.18    
>  1.57    7.62   18.62    7.20   4.57  99.70
> sdl               0.00     0.00    7.50   22.50   104.00   253.25    23.82    
>  0.14    4.82    5.07    4.73   4.03  12.10
> sdg               0.00     4.00    1.00   84.00     4.00  3711.75    87.43    
>  0.58    6.88   12.50    6.81   5.75  48.90
> sdh               0.00     3.50    7.50   44.00    72.00  2954.25   117.52    
>  1.54   39.50   61.73   35.72   6.40  32.95
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    1.00    0.00    20.00     0.00    40.00    
>  0.01   14.50   14.50    0.00  14.50   1.45
> sdb               0.00     7.00   10.50  198.50  2164.00  6014.75    78.27    
>  1.94    9.29   28.90    8.25   4.77  99.75
> sdc               0.00     2.00    4.00   95.50   112.00  5152.25   105.81    
>  0.94    9.46   24.25    8.84   4.68  46.55
> sdd               0.00     1.00    2.00  131.00    10.00  7167.25   107.93    
>  4.55   34.23   83.25   33.48   2.52  33.55
> sde               0.00     0.00    0.00    0.50     0.00     2.00     8.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  541.50     0.00  6468.00    23.89    
>  0.05    0.10    0.00    0.10   0.09   5.00
> sdk               0.00     0.00    0.00  509.00     0.00  7704.00    30.27    
>  0.07    0.14    0.00    0.14   0.10   4.85
> sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdi               0.00     0.00    3.50    0.00    90.00     0.00    51.43    
>  0.04   10.14   10.14    0.00  10.14   3.55
> sdm               0.00     2.00    5.00  102.50  1186.00  4583.00   107.33    
>  0.81    7.56   23.20    6.80   2.78  29.85
> sdl               0.00    14.00   10.00  216.00   112.00  3645.50    33.25    
> 73.45  311.05   46.30  323.31   3.51  79.35
> sdg               0.00     1.00    0.00   52.50     0.00   240.00     9.14    
>  0.25    4.76    0.00    4.76   4.48  23.50
> sdh               0.00     0.00    3.50    0.00    18.00     0.00    10.29    
>  0.02    7.00    7.00    0.00   7.00   2.45
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz 
> avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    1.00    0.00     4.00     0.00     8.00    
>  0.01   14.50   14.50    0.00  14.50   1.45
> sdb               0.00     9.00    2.00  292.00   192.00 10925.75    75.63    
> 36.98  100.27   54.75  100.58   2.95  86.60
> sdc               0.00     9.00   10.50  151.00    78.00  6771.25    84.82    
> 36.06   94.60   26.57   99.33   3.77  60.85
> sdd               0.00     0.00    5.00    1.00    74.00    24.00    32.67    
>  0.03    5.00    6.00    0.00   5.00   3.00
> sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00    
>  0.00    0.00    0.00    0.00   0.00   0.00
> sdj               0.00     0.00    0.00  787.50     0.00  9418.00    23.92    
>  0.07    0.10    0.00    0.10   0.09   6.70
> sdk               0.00     0.00    0.00  766.50     0.00  9400.00    24.53    
>  0.08    0.11    0.00    0.11   0.10   7.70
> sdf               0.00     0.00    0.50   41.50     6.00   391.00    18.90    
>  0.24    5.79    9.00    5.75   5.50  23.10
> sdi               0.00    10.00    9.00  268.00    92.00  1618.75    12.35    
> 68.20  150.90   15.50  155.45   2.36  65.30
> sdm               0.00    11.50   10.00  330.50    72.00  3201.25    19.23    
> 68.83  139.38   37.45  142.46   1.84  62.80
> sdl               0.00     2.50    2.50  228.50    14.00  2526.00    21.99    
> 90.42  404.71  242.40  406.49   4.33 100.00
> sdg               0.00     5.50    7.50  298.00    68.00  5275.25    34.98    
> 75.31  174.85   26.73  178.58   2.67  81.60
> sdh               0.00     0.00    2.50    2.00    28.00    24.00    23.11    
>  0.01    2.78    5.00    0.00   2.78   1.25
> 
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> On Sun, Oct 4, 2015 at 12:16 AM, Josef Johansson  wrote:
> Hi,
> 
> I don't know what brand those 4TB spindles are, but I know that mine are very 
> bad at doing write at the same time as read. Especially small read write.
> 
> This has an absurdly bad effect when doing maintenance on ceph. That being 
> said we see a lot of difference between dumpling and hammer in performance on 
> these drives. Most likely due to hammer able to read write degraded PGs.
> 
> We have run into two different problems along the way, the first was blocked 
> request where we had to upgrade from 64GB mem on each node to 256GB. We 
> thought that it was the only safe buy make things better.
> 
> I believe it worked because more reads were cached so we had less mixed read 
> write on the nodes, thus giving the spindles more room to breath. Now this 
> was a shot in the dark then, but the price is not that high even to just try 
> it out.. compared to 6 people working on it. I believe the IO on disk was not 
> huge either, but what kills the disk is high latency. How much bandwidth are 
> the disk using? We had very low.. 3-5MB/s.
> 
> The second problem was defragmentations hitting 70%, lowering that to 6% made 
> a lot of difference. Depending on IO pattern this increases different.
> 
> TL;DR read kills the 4TB spindles.
> 
> Hope you guys clear out of the woods.
> /Josef
> 
> On 3 Oct 2015 10:10 pm, "Robert LeBlanc"  wrote:
> - -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> We are still struggling with this and have tried a lot of different
> things. Unfortunately, Inktank (now Red Hat) no longer provides
> consulting services for non-Red Hat systems. If there are some
> certified Ceph consultants in the US that we can do both remote and
> on-site engagements, please let us know.
> 
> This certainly seems to be network related, but somewhere in the
> kernel. We have tried increasing the network and TCP buffers, number
> of TCP sockets, reduced the FIN_WAIT2 state. There is about 25% idle
> on the boxes, the disks are busy, but not constantly at 100% (they
> cycle from <10% up to 100%, but not 100% for more than a few seconds
> at a time). There seems to be no reasonable explanation why I/O is
> blocked pretty frequently longer than 30 seconds. We have verified
> Jumbo frames by pinging from/to each node with 9000 byte packets. The
> network admins have verified that packets are not being dropped in the
> switches for these nodes. We have tried different kernels including
> the recent Google patch to cubic. This is showing up on three cluster
> (two Ethernet and one IPoIB). I booted one cluster into Debian Jessie
> (from CentOS 7.1) with similar results.
> 
> The messages seem slightly different:
> 2015-10-03 14:38:23.193082 osd.134 10.208.16.25:6800/1425 
> <http://10.208.16.25:6800/1425> 439 :
> cluster [WRN] 14 slow requests, 1 included below; oldest blocked for >
> 100.087155 secs
> 2015-10-03 14:38:23.193090 osd.134 10.208.16.25:6800/1425 
> <http://10.208.16.25:6800/1425> 440 :
> cluster [WRN] slow request 30.041999 seconds old, received at
> 2015-10-03 14:37:53.151014: osd_op(client.1328605.0:7082862
> rbd_data.13fdcb2ae8944a.000000000001264f [read 975360~4096]
> 11.6d19c36f ack+read+known_if_redirected e10249) currently no flag
> points reached
> 
> I don't know what "no flag points reached" means.
> 
> The problem is most pronounced when we have to reboot an OSD node (1
> of 13), we will have hundreds of I/O blocked for some times up to 300
> seconds. It takes a good 15 minutes for things to settle down. The
> production cluster is very busy doing normally 8,000 I/O and peaking
> at 15,000. This is all 4TB spindles with SSD journals and the disks
> are between 25-50% full. We are currently splitting PGs to distribute
> the load better across the disks, but we are having to do this 10 PGs
> at a time as we get blocked I/O. We have max_backfills and
> max_recovery set to 1, client op priority is set higher than recovery
> priority. We tried increasing the number of op threads but this didn't
> seem to help. It seems as soon as PGs are finished being checked, they
> become active and could be the cause for slow I/O while the other PGs
> are being checked.
> 
> What I don't understand is that the messages are delayed. As soon as
> the message is received by Ceph OSD process, it is very quickly
> committed to the journal and a response is sent back to the primary
> OSD which is received very quickly as well. I've adjust
> min_free_kbytes and it seems to keep the OSDs from crashing, but
> doesn't solve the main problem. We don't have swap and there is 64 GB
> of RAM per nodes for 10 OSDs.
> 
> Is there something that could cause the kernel to get a packet but not
> be able to dispatch it to Ceph such that it could be explaining why we
> are seeing these blocked I/O for 30+ seconds. Is there some pointers
> to tracing Ceph messages from the network buffer through the kernel to
> the Ceph process?
> 
> We can really use some pointers no matter how outrageous. We've have
> over 6 people looking into this for weeks now and just can't think of
> anything else.
> 
> Thanks,
> - -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> 
> wsFcBAEBCAAQBQJWEDY1CRDmVDuy+mK58QAARgoP/RcoL1qVmg7qbQrzStar
> NK80bqYGeYHb26xHbt1fZVgnZhXU0nN0Dv4ew0e/cYJLELSO2KCeXNfXN6F1
> prZuzYagYEyj1Q1TOo+4h/nOQRYsTwQDdFzbHb/OUDN55C0QGZ29DjEvrqP6
> K5l6sAQzvQDpUEEIiOCkS6pH59ira740nSmnYkEWhr1lxF/hMjb6fFlfCFe2
> h1djM0GfY7vBHFGgI3jkw0BL5AQnWe+SCcCiKZmxY6xiR70FWl3XqK5M+nxm
> iq74y7Dv6cpenit6boMr6qtOeIt+8ko85hVMh09Hkaqz/m2FzxAKLcahzkGF
> Fh/M6YBzgnX7QBURTC4YQT/FVyDTW3JMuT3RKQdaX6c0iiOsVdkE+iyidWyY
> Hr1KzWU23Ur9yBfZ39Y43jrsSiAEwHnKjSqMowSGljdTysNEAAZQhlqZIoHb
> JlgpB39ugkHI1H5fZ5b2SIDz32/d5ywG4Gay9Rk6hp8VanvIrBbev+JYEoYT
> 8/WX+fhueHt4dqUYWIl3HZ0CEzbXbug0xmFvhrbmL2f3t9XOkDZRbAjlYrGm
> lswiJMDueY8JkxSnPvCQrHXqjbCcy9rMG7nTnLFz98rTcHNCwtpv0qVYhheg
> 4YRNRVMbfNP/6xsJvG1wVOSQPwxZSPqJh42pDqMRePJl3Zn66MTx5wvdNDpk
> l7OF
> =OI++
> - -----END PGP SIGNATURE-----
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Fri, Sep 25, 2015 at 2:40 PM, Robert LeBlanc  wrote:
> > We dropped the replication on our cluster from 4 to 3 and it looks
> > like all the blocked I/O has stopped (no entries in the log for the
> > last 12 hours). This makes me believe that there is some issue with
> > the number of sockets or some other TCP issue. We have not messed with
> > Ephemeral ports and TIME_WAIT at this point. There are 130 OSDs, 8 KVM
> > hosts hosting about 150 VMs. Open files is set at 32K for the OSD
> > processes and 16K system wide.
> >
> > Does this seem like the right spot to be looking? What are some
> > configuration items we should be looking at?
> >
> > Thanks,
> > ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >
> >
> > On Wed, Sep 23, 2015 at 1:30 PM, Robert LeBlanc  wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA256
> >>
> >> We were able to only get ~17Gb out of the XL710 (heavily tweaked)
> >> until we went to the 4.x kernel where we got ~36Gb (no tweaking). It
> >> seems that there were some major reworks in the network handling in
> >> the kernel to efficiently handle that network rate. If I remember
> >> right we also saw a drop in CPU utilization. I'm starting to think
> >> that we did see packet loss while congesting our ISLs in our initial
> >> testing, but we could not tell where the dropping was happening. We
> >> saw some on the switches, but it didn't seem to be bad if we weren't
> >> trying to congest things. We probably already saw this issue, just
> >> didn't know it.
> >> - ----------------
> >> Robert LeBlanc
> >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>
> >>
> >> On Wed, Sep 23, 2015 at 1:10 PM, Mark Nelson  wrote:
> >>> FWIW, we've got some 40GbE Intel cards in the community performance 
> >>> cluster
> >>> on a Mellanox 40GbE switch that appear (knock on wood) to be running fine
> >>> with 3.10.0-229.7.2.el7.x86_64.  We did get feedback from Intel that older
> >>> drivers might cause problems though.
> >>>
> >>> Here's ifconfig from one of the nodes:
> >>>
> >>> ens513f1: flags=4163  mtu 1500
> >>>         inet 10.0.10.101  netmask 255.255.255.0  broadcast 10.0.10.255
> >>>         inet6 fe80::6a05:caff:fe2b:7ea1  prefixlen 64  scopeid 0x20
> >>>         ether 68:05:ca:2b:7e:a1  txqueuelen 1000  (Ethernet)
> >>>         RX packets 169232242875  bytes 229346261232279 (208.5 TiB)
> >>>         RX errors 0  dropped 0  overruns 0  frame 0
> >>>         TX packets 153491686361  bytes 203976410836881 (185.5 TiB)
> >>>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> >>>
> >>> Mark
> >>>
> >>>
> >>> On 09/23/2015 01:48 PM, Robert LeBlanc wrote:
> >>>>
> >>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>> Hash: SHA256
> >>>>
> >>>> OK, here is the update on the saga...
> >>>>
> >>>> I traced some more of blocked I/Os and it seems that communication
> >>>> between two hosts seemed worse than others. I did a two way ping flood
> >>>> between the two hosts using max packet sizes (1500). After 1.5M
> >>>> packets, no lost pings. Then then had the ping flood running while I
> >>>> put Ceph load on the cluster and the dropped pings started increasing
> >>>> after stopping the Ceph workload the pings stopped dropping.
> >>>>
> >>>> I then ran iperf between all the nodes with the same results, so that
> >>>> ruled out Ceph to a large degree. I then booted in the the
> >>>> 3.10.0-229.14.1.el7.x86_64 kernel and with an hour test so far there
> >>>> hasn't been any dropped pings or blocked I/O. Our 40 Gb NICs really
> >>>> need the network enhancements in the 4.x series to work well.
> >>>>
> >>>> Does this sound familiar to anyone? I'll probably start bisecting the
> >>>> kernel to see where this issue in introduced. Both of the clusters
> >>>> with this issue are running 4.x, other than that, they are pretty
> >>>> differing hardware and network configs.
> >>>>
> >>>> Thanks,
> >>>> -----BEGIN PGP SIGNATURE-----
> >>>> Version: Mailvelope v1.1.0
> >>>> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> >>>>
> >>>> wsFcBAEBCAAQBQJWAvOzCRDmVDuy+mK58QAApOMP/1xmCtW++G11qcE8y/sr
> >>>> RkXguqZJLc4czdOwV/tjUvhVsm5qOl4wvQCtABFZpc6t4+m5nzE3LkA1rl2l
> >>>> AnARPOjh61TO6cV0CT8O0DlqtHmSd2y0ElgAUl0594eInEn7eI7crz8R543V
> >>>> 7I68XU5zL/vNJ9IIx38UqdhtSzXQQL664DGq3DLINK0Yb9XRVBlFip+Slt+j
> >>>> cB64TuWjOPLSH09pv7SUyksodqrTq3K7p6sQkq0MOzBkFQM1FHfOipbo/LYv
> >>>> F42iiQbCvFizArMu20WeOSQ4dmrXT/iecgTfEag/Zxvor2gOi/J6d2XS9ckW
> >>>> byEC5/rbm4yDBua2ZugeNxQLWq0Oa7spZnx7usLsu/6YzeDNI6kmtGURajdE
> >>>> /XC8bESWKveBzmGDzjff5oaMs9A1PZURYnlYADEODGAt6byoaoQEGN6dlFGe
> >>>> LwQ5nOdQYuUrWpJzTJBN3aduOxursoFY8S0eR0uXm0l1CHcp22RWBDvRinok
> >>>> UWk5xRBgjDCD2gIwc+wpImZbCtiTdf0vad1uLvdxGL29iFta4THzJgUGrp98
> >>>> sUqM3RaTRdJYjFcNP293H7/DC0mqpnmo0Clx3jkdHX+x1EXpJUtocSeI44LX
> >>>> KWIMhe9wXtKAoHQFEcJ0o0+wrXWMevvx33HPC4q1ULrFX0ILNx5Mo0Rp944X
> >>>> 4OEo
> >>>> =P33I
> >>>> -----END PGP SIGNATURE-----
> >>>> ----------------
> >>>> Robert LeBlanc
> >>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>
> >>>>
> >>>> On Tue, Sep 22, 2015 at 4:15 PM, Robert LeBlanc
> >>>> wrote:
> >>>>>
> >>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>> Hash: SHA256
> >>>>>
> >>>>> This is IPoIB and we have the MTU set to 64K. There was some issues
> >>>>> pinging hosts with "No buffer space available" (hosts are currently
> >>>>> configured for 4GB to test SSD caching rather than page cache). I
> >>>>> found that MTU under 32K worked reliable for ping, but still had the
> >>>>> blocked I/O.
> >>>>>
> >>>>> I reduced the MTU to 1500 and checked pings (OK), but I'm still seeing
> >>>>> the blocked I/O.
> >>>>> - ----------------
> >>>>> Robert LeBlanc
> >>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 22, 2015 at 3:52 PM, Sage Weil  wrote:
> >>>>>>
> >>>>>> On Tue, 22 Sep 2015, Samuel Just wrote:
> >>>>>>>
> >>>>>>> I looked at the logs, it looks like there was a 53 second delay
> >>>>>>> between when osd.17 started sending the osd_repop message and when
> >>>>>>> osd.13 started reading it, which is pretty weird.  Sage, didn't we
> >>>>>>> once see a kernel issue which caused some messages to be mysteriously
> >>>>>>> delayed for many 10s of seconds?
> >>>>>>
> >>>>>>
> >>>>>> Every time we have seen this behavior and diagnosed it in the wild it
> >>>>>> has
> >>>>>> been a network misconfiguration.  Usually related to jumbo frames.
> >>>>>>
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> What kernel are you running?
> >>>>>>> -Sam
> >>>>>>>
> >>>>>>> On Tue, Sep 22, 2015 at 2:22 PM, Robert LeBlanc  wrote:
> >>>>>>>>
> >>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>> Hash: SHA256
> >>>>>>>>
> >>>>>>>> OK, looping in ceph-devel to see if I can get some more eyes. I've
> >>>>>>>> extracted what I think are important entries from the logs for the
> >>>>>>>> first blocked request. NTP is running all the servers so the logs
> >>>>>>>> should be close in terms of time. Logs for 12:50 to 13:00 are
> >>>>>>>> available at http://162.144.87.113/files/ceph_block_io.logs.tar.xz 
> >>>>>>>> <http://162.144.87.113/files/ceph_block_io.logs.tar.xz>
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:06.500374 - osd.17 gets I/O from client
> >>>>>>>> 2015-09-22 12:55:06.557160 - osd.17 submits I/O to osd.13
> >>>>>>>> 2015-09-22 12:55:06.557305 - osd.17 submits I/O to osd.16
> >>>>>>>> 2015-09-22 12:55:06.573711 - osd.16 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:06.595716 - osd.17 gets ondisk result=0 from osd.16
> >>>>>>>> 2015-09-22 12:55:06.640631 - osd.16 reports to osd.17 ondisk result=0
> >>>>>>>> 2015-09-22 12:55:36.926691 - osd.17 reports slow I/O > 30.439150 sec
> >>>>>>>> 2015-09-22 12:55:59.790591 - osd.13 gets I/O from osd.17
> >>>>>>>> 2015-09-22 12:55:59.812405 - osd.17 gets ondisk result=0 from osd.13
> >>>>>>>> 2015-09-22 12:56:02.941602 - osd.13 reports to osd.17 ondisk result=0
> >>>>>>>>
> >>>>>>>> In the logs I can see that osd.17 dispatches the I/O to osd.13 and
> >>>>>>>> osd.16 almost silmutaniously. osd.16 seems to get the I/O right away,
> >>>>>>>> but for some reason osd.13 doesn't get the message until 53 seconds
> >>>>>>>> later. osd.17 seems happy to just wait and doesn't resend the data
> >>>>>>>> (well, I'm not 100% sure how to tell which entries are the actual 
> >>>>>>>> data
> >>>>>>>> transfer).
> >>>>>>>>
> >>>>>>>> It looks like osd.17 is receiving responses to start the 
> >>>>>>>> communication
> >>>>>>>> with osd.13, but the op is not acknowledged until almost a minute
> >>>>>>>> later. To me it seems that the message is getting received but not
> >>>>>>>> passed to another thread right away or something. This test was done
> >>>>>>>> with an idle cluster, a single fio client (rbd engine) with a single
> >>>>>>>> thread.
> >>>>>>>>
> >>>>>>>> The OSD servers are almost 100% idle during these blocked I/O
> >>>>>>>> requests. I think I'm at the end of my troubleshooting, so I can use
> >>>>>>>> some help.
> >>>>>>>>
> >>>>>>>> Single Test started about
> >>>>>>>> 2015-09-22 12:52:36
> >>>>>>>>
> >>>>>>>> 2015-09-22 12:55:36.926680 osd.17 192.168.55.14:6800/16726 
> >>>>>>>> <http://192.168.55.14:6800/16726> 56 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.439150 secs
> >>>>>>>> 2015-09-22 12:55:36.926699 osd.17 192.168.55.14:6800/16726 
> >>>>>>>> <http://192.168.55.14:6800/16726> 57 :
> >>>>>>>> cluster [WRN] slow request 30.439150 seconds old, received at
> >>>>>>>> 2015-09-22 12:55:06.487451:
> >>>>>>>>   osd_op(client.250874.0:1388 rbd_data.3380e2ae8944a.0000000000000545
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.bbf3e8ff ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,16
> >>>>>>>> 2015-09-22 12:55:36.697904 osd.16 192.168.55.13:6800/29410 
> >>>>>>>> <http://192.168.55.13:6800/29410> 7 : cluster
> >>>>>>>> [WRN] 2 slow requests, 2 included below; oldest blocked for >
> >>>>>>>> 30.379680 secs
> >>>>>>>> 2015-09-22 12:55:36.697918 osd.16 192.168.55.13:6800/29410 
> >>>>>>>> <http://192.168.55.13:6800/29410> 8 : cluster
> >>>>>>>> [WRN] slow request 30.291520 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.406303:
> >>>>>>>>   osd_op(client.250874.0:1384 rbd_data.3380e2ae8944a.0000000000000541
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.5fb2123f ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>> 2015-09-22 12:55:36.697927 osd.16 192.168.55.13:6800/29410 
> >>>>>>>> <http://192.168.55.13:6800/29410> 9 : cluster
> >>>>>>>> [WRN] slow request 30.379680 seconds old, received at 2015-09-22
> >>>>>>>> 12:55:06.318144:
> >>>>>>>>   osd_op(client.250874.0:1382 rbd_data.3380e2ae8944a.000000000000053f
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.312e69ca ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,14
> >>>>>>>> 2015-09-22 12:58:03.998275 osd.13 192.168.55.12:6804/4574 
> >>>>>>>> <http://192.168.55.12:6804/4574> 130 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.954212 secs
> >>>>>>>> 2015-09-22 12:58:03.998286 osd.13 192.168.55.12:6804/4574 
> >>>>>>>> <http://192.168.55.12:6804/4574> 131 :
> >>>>>>>> cluster [WRN] slow request 30.954212 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.044003:
> >>>>>>>>   osd_op(client.250874.0:1873 rbd_data.3380e2ae8944a.000000000000070d
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.e69870d4 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 16,17
> >>>>>>>> 2015-09-22 12:58:03.759826 osd.16 192.168.55.13:6800/29410 
> >>>>>>>> <http://192.168.55.13:6800/29410> 10 :
> >>>>>>>> cluster [WRN] 1 slow requests, 1 included below; oldest blocked for >
> >>>>>>>> 30.704367 secs
> >>>>>>>> 2015-09-22 12:58:03.759840 osd.16 192.168.55.13:6800/29410 
> >>>>>>>> <http://192.168.55.13:6800/29410> 11 :
> >>>>>>>> cluster [WRN] slow request 30.704367 seconds old, received at
> >>>>>>>> 2015-09-22 12:57:33.055404:
> >>>>>>>>   osd_op(client.250874.0:1874 rbd_data.3380e2ae8944a.000000000000070e
> >>>>>>>> [set-alloc-hint object_size 4194304 write_size 4194304,write
> >>>>>>>> 0~4194304] 8.f7635819 ack+ondisk+write+known_if_redirected e56785)
> >>>>>>>>   currently waiting for subops from 13,17
> >>>>>>>>
> >>>>>>>> Server   IP addr              OSD
> >>>>>>>> nodev  - 192.168.55.11 - 12
> >>>>>>>> nodew  - 192.168.55.12 - 13
> >>>>>>>> nodex  - 192.168.55.13 - 16
> >>>>>>>> nodey  - 192.168.55.14 - 17
> >>>>>>>> nodez  - 192.168.55.15 - 14
> >>>>>>>> nodezz - 192.168.55.16 - 15
> >>>>>>>>
> >>>>>>>> fio job:
> >>>>>>>> [rbd-test]
> >>>>>>>> readwrite=write
> >>>>>>>> blocksize=4M
> >>>>>>>> #runtime=60
> >>>>>>>> name=rbd-test
> >>>>>>>> #readwrite=randwrite
> >>>>>>>> #bssplit=4k/85:32k/11:512/3:1m/1,4k/89:32k/10:512k/1
> >>>>>>>> #rwmixread=72
> >>>>>>>> #norandommap
> >>>>>>>> #size=1T
> >>>>>>>> #blocksize=4k
> >>>>>>>> ioengine=rbd
> >>>>>>>> rbdname=test2
> >>>>>>>> pool=rbd
> >>>>>>>> clientname=admin
> >>>>>>>> iodepth=8
> >>>>>>>> #numjobs=4
> >>>>>>>> #thread
> >>>>>>>> #group_reporting
> >>>>>>>> #time_based
> >>>>>>>> #direct=1
> >>>>>>>> #ramp_time=60
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> -----BEGIN PGP SIGNATURE-----
> >>>>>>>> Version: Mailvelope v1.1.0
> >>>>>>>> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> >>>>>>>>
> >>>>>>>> wsFcBAEBCAAQBQJWAcaKCRDmVDuy+mK58QAAPMsQAKBnS94fwuw0OqpPU3/z
> >>>>>>>> tL8Z6TVRxrNigf721+2ClIu4LIH71bupDc3DgrrysQmmqGuvEMn68spmasWu
> >>>>>>>> h9I/CqqgRpHqe4lUVoUEjyWA9/6Dbb6NiHSdpJ6p5jpGc8kZCvNS+ocDgFOl
> >>>>>>>> 903i0M0E9eEMeci5O/hrMrx1FG8SN2LS8nI261aNHMOwQK0bw8wWiCJEvqVB
> >>>>>>>> sz1/+jK1BJoeIYfaT9HfUXBAvfo/W3tY/vj9KbJuZJ5AMpeYPvEHu/LAr1N7
> >>>>>>>> FzzUc7a6EMlaxmSd0ML49JbV0cY9BMDjfrkKEQNKlzszlEHm3iif98QtsxbF
> >>>>>>>> pPJ0hZ0G53BY3k976OWVMFm3WFRWUVOb/oiLF8H6PCm59b4LBNAg6iPNH1AI
> >>>>>>>> 5XhEcPpg06M03vqUaIiY9P1kQlvnn0yCXf82IUEgmg///vhxDsHWmcwClLEn
> >>>>>>>> B0VszouStTzlMYnc/2vlUiI4gFVeilWLMW00VGTWV+7V1oIzIYvWHyl2QpBq
> >>>>>>>> 4/ZwVjQ43qLfuDTS4o+IJ4ztOMd26vIv6Mn6WVwKCjoCXJc8ajywR9Dy+6lL
> >>>>>>>> o8oJ+tn7hMc9Qy1iBhu3/QIP4WCsUf9RVeu60oahNEpde89qW32S9CZlrJDO
> >>>>>>>> gf4iTryRjkAhdmZIj9JiaE8jQ6dvN817D9cqs/CXKV9vhzYoM7p5YWHghBKB
> >>>>>>>> J3hS
> >>>>>>>> =0J7F
> >>>>>>>> -----END PGP SIGNATURE-----
> >>>>>>>> ----------------
> >>>>>>>> Robert LeBlanc
> >>>>>>>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Sep 22, 2015 at 8:31 AM, Gregory Farnum  wrote:
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 22, 2015 at 7:24 AM, Robert LeBlanc  wrote:
> >>>>>>>>>>
> >>>>>>>>>> -----BEGIN PGP SIGNED MESSAGE-----
> >>>>>>>>>> Hash: SHA256
> >>>>>>>>>>
> >>>>>>>>>> Is there some way to tell in the logs that this is happening?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> You can search for the (mangled) name _split_collection
> >>>>>>>>>>
> >>>>>>>>>> I'm not
> >>>>>>>>>> seeing much I/O, CPU usage during these times. Is there some way to
> >>>>>>>>>> prevent the splitting? Is there a negative side effect to doing so?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Bump up the split and merge thresholds. You can search the list for
> >>>>>>>>> this, it was discussed not too long ago.
> >>>>>>>>>
> >>>>>>>>>> We've had I/O block for over 900 seconds and as soon as the 
> >>>>>>>>>> sessions
> >>>>>>>>>> are aborted, they are reestablished and complete immediately.
> >>>>>>>>>>
> >>>>>>>>>> The fio test is just a seq write, starting it over (rewriting from
> >>>>>>>>>> the
> >>>>>>>>>> beginning) is still causing the issue. I was suspect that it is not
> >>>>>>>>>> having to create new file and therefore split collections. This is
> >>>>>>>>>> on
> >>>>>>>>>> my test cluster with no other load.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Hmm, that does make it seem less likely if you're really not 
> >>>>>>>>> creating
> >>>>>>>>> new objects, if you're actually running fio in such a way that it's
> >>>>>>>>> not allocating new FS blocks (this is probably hard to set up?).
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I'll be doing a lot of testing today. Which log options and depths
> >>>>>>>>>> would be the most helpful for tracking this issue down?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> If you want to go log diving "debug osd = 20", "debug filestore =
> >>>>>>>>> 20",
> >>>>>>>>> "debug ms = 1" are what the OSD guys like to see. That should spit
> >>>>>>>>> out
> >>>>>>>>> everything you need to track exactly what each Op is doing.
> >>>>>>>>> -Greg
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>>>>> in
> >>>>>>>> the body of a message to majord...@vger.kernel.org 
> >>>>>>>> <mailto:majord...@vger.kernel.org>
> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
> >>>>>>>> <http://vger.kernel.org/majordomo-info.html>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> -----BEGIN PGP SIGNATURE-----
> >>>>> Version: Mailvelope v1.1.0
> >>>>> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> >>>>>
> >>>>> wsFcBAEBCAAQBQJWAdMSCRDmVDuy+mK58QAAoEgP/AqpH7i1BLpoz6fTlfWG
> >>>>> a6swvF8xvsyR15PDiPINYT0N7MgoikikGrMmhWpJ6utEr1XPW0MPFgzvNIsf
> >>>>> a1eMtNzyww4rAo6JCq6BtjmUsSKmOrBNhRNr6It9v4Nv+biqZHkiY8x/rRtV
> >>>>> s9z0cv3Q9Wqa6y/zKZg3H1XtbtUAx0r/DUwzSsP3omupZgNyaKkCgdkil9Vc
> >>>>> iyzBxFZU4+qXNT2FBG4dYDjxSHQv4psjvKR3AWXSN4yEn286KyMDjFrsDY5B
> >>>>> izS3h603QPoErqsUQngDE8COcaTAHHrV7gNJTikmGoNW6oQBjFq/z/zindTz
> >>>>> caXshVQQ+OTLo/qzJM8QPswh0TGU74SVbDkTq+eTOb5pBhQbp+42Pkkqh7jj
> >>>>> efyyYgDzpB1WrWRbUlWMNqmnjq7DT3lnAtuHyKbkwVs8x3JMPEiCl6PBvJbx
> >>>>> GnNSCqgDJrpb4fHQ2iqfQeh8Ai6AL1C1Ai19RZPrAUhpDW0/DbUvuoKSR8m7
> >>>>> glYYuH3hpy+oPYRhFcHm2fpNJ3u9npyk2Dai9RpzQ+mWmp3xi7becYmL482H
> >>>>> +WyvLeY+8AiJQDpA0CdD8KeSlOC9bw5TPmihAIn9dVTJ1O2RlapCLqL3YAJg
> >>>>> pGyDs8ercTEJLmvEyElj5XWh5DarsGscd2LELNS/UpyuYurbPcyPKUQ0uPjp
> >>>>> gcZm
> >>>>> =CjwB
> >>>>> -----END PGP SIGNATURE-----
> >>>>
> >>>> --
> >>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>> the body of a message to majord...@vger.kernel.org 
> >>>> <mailto:majord...@vger.kernel.org>
> >>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
> >>>> <http://vger.kernel.org/majordomo-info.html>
> >>>>
> >>>
> >>
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: Mailvelope v1.1.0
> >> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> >>
> >> wsFcBAEBCAAQBQJWAv3QCRDmVDuy+mK58QAABr4QAJcQj8zjl606aMdkmQG7
> >> S46iMXVav/Tv2os9GCUsQmMPx2u1w3/WmPfjByd6Divczfo0JLDDqrbsqre2
> >> lq0GNK6e8fq6FXHhPpnL+t4uFV4UZ289cma3yklRqEBDXWHlP59Hu7VpxC5l
> >> 0MIcCg4wM5VM/LkrfcMven5em5CnjyFJYbActGzw9043rZoyUwCM+eL7sotl
> >> JYHMcNWnqwdt8TLFDhUfVGiAQyV8/6E33CuCNUEuFGdtiBKzs9IZadOI8Ce0
> >> dod2DQNyFSvomqNq6t0DuTCSA+pT8uuks2O0NcrHjoqwIWVkxQGPYlpbpckf
> >> nxQdVM7vkqapVeQ0qUZx43Db9A5wDTC3PaEfVJZPZzWsSDjh9z7o6qHs3Kvp
> >> krfyS+dJaZ3tOYAP1VFDfasj06sOTFu3mfGYToKA75zz5HN7QZ13Zau/qhDu
> >> FHxsgk4oIXJsjj22LiSpoiigH5Ls+aVqtIbg8/vWp+EO6pK1fovEtJVeGAfE
> >> tLOdxfJJLVjMCAScFG9BRl1ePPLeptivKV0v9ruWsTpn+Q96VtqAR5GQCkYE
> >> hFrlxM+oIzHeArhhiIxSPCYLlnzxoD5IYXmTrWUYBCGvlY1mrI3j80mZ4VTj
> >> BErsSlqnjUyFKmaI7YNKyARCloMroz3wqdy/wpg/63Io62nmh5IyY+WO8hPo
> >> ae22
> >> =AX+L
> >> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com <https://www.mailvelope.com/>
> 
> wsFcBAEBCAAQBQJWEVA5CRDmVDuy+mK58QAAY4EP/2jTEGPrbR3KDOC1d6FU
> 7TkVeFtow7UCe9/TwArLtcEVTr8rdaXNWRi7gat99zbL5pw+96Sj6bGqpKVz
> ZBSHcBlLIl42Hj10Ju7Svpwn7Q9RnSGOvjEdghEKsTxnf37gZD/KjvMbidJu
> jlPGEfnGEdYbQ+vDYoCoUIuvUNPbCvWQTjJpnTXrMZfhhEBoOepMzF9s6L6B
> AWR9WUrtz4HtGSMT42U1gd3LDOUh/5Ioy6FuhJe04piaf3ikRg+pjX47/WBd
> mQupmKJOblaULCswOrMLTS9R2+p6yaWj0zlUb3OAOErO7JR8OWZ2H7tYjkQN
> rGPsIRNv4yKw2Z5vJdHLksVdYhBQY1I4N1GO3+hf+j/yotPC9Ay4BYLZrQwf
> 3L+uhqSEu80erZjsJF4lilmw0l9nbDSoXc0MqRoXrpUIqyVtmaCBynv5Xq7s
> L5idaH6iVPBwy4Y6qzVuQpP0LaHp48ojIRx7likQJt0MSeDzqnslnp5B/9nb
> Ppu3peRUKf5GEKISRQ6gOI3C4gTSSX6aBatWdtpm01Et0T6ysoxAP/VoO3Nb
> 0PDsuYYT0U1MYqi0USouiNc4yRWNb9hkkBHEJrwjtP52moL1WYdYleL6w+FS
> Y1YQ1DU8YsEtVniBmZc4TBQJRRIS6SaQjH108JCjUcy9oVNwRtOqbcT1aiI6
> EP/Q
> =efx7
> -----END PGP SIGNATURE-----

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Potential OSD deadlock?

Reply via email to