Hi,

I don’t want to hijack this topic but the behavior described below is the same 
as what I am seeing:


[root@osd5-freu ~]# ceph daemonperf /var/run/ceph/ceph-mds.osd5-freu.asok       
                                                                                
                |
-----mds------ --mds_server-- ---objecter--- -----mds_cache----- ---mds_log----
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts subm|
  0  8.0M 3.4M|  0    0    3 |  0    0    0 |  0    0  2.8M   0 | 33   24k   5
  5  8.0M 3.4M|  0    0  230 | 24    1    0 |  0    0  2.8M   0 | 33   24k 381
  1  8.0M 3.4M|  0    0  130 | 16    0    0 |  0    0  2.8M   0 | 34   24k 366
  4  8.0M 3.4M|  0    0   63 | 96    1    2 |  0    0  2.8M   0 | 34   25k 129
30  7.9M 3.4M|  0    1  190 |479    0    3 |  0    0  2.8M  77 | 34   23k 414
  0  7.9M 3.4M|  0    0  194 | 21    0    0 |  0    0  2.8M   1 | 32   23k 132
  0  8.0M 3.4M|  0    0  181 | 27    0    0 |  0    0  2.8M   0 | 33   24k 400
  0  8.0M 3.4M|  0    0  129 | 16    0    1 |  0    0  2.8M   0 | 33   24k 109
  3  7.9M 3.4M|  0    0  110 |194    0    3 |  0    0  2.8M  56 | 32   22k 261
  0  8.0M 3.4M|  0    0  128 | 20    0    1 |  0    0  2.8M   1 | 32   22k 102
  0  8.0M 3.4M|  0    0  164 | 25    0    0 |  0    0  2.8M   1 | 32   23k 137
  1  8.0M 3.4M|  0    0  157 | 13    0    0 |  0    0  2.8M   0 | 32   23k 515
  0  8.0M 3.4M|  0    0  137 | 24    2    0 |  0    0  2.8M   0 | 32   23k 107


[root@osd5-freu ~]# ceph daemon mds.osd5-freu perf dump | grep -i 'stray\|purge'
        "num_strays": 2829348,                                                  
                                                                                
                |
        "num_strays_purging": 1,                                                
                                                                                
                |
        "num_strays_delayed": 1,                                                
                                                                                
                |
        "num_purge_ops": 11,                                                    
                                                                                
                |
        "strays_created": 3302679,                                              
                                                                                
                |
        "strays_purged": 463340,                                                
                                                                                
                |
        "strays_reintegrated": 9994,                                            
                                                                                
                |
        "strays_migrated": 0,


[root@osd5-freu ~]# ceph daemon mds.osd5-freu config show|grep purge
    "filer_max_purge_ops": "10",                                                
                                                                                
                |
    "mds_max_purge_files": "1000",                                              
                                                                                
                |
    "mds_max_purge_ops": "8192",                                                
                                                                                
                |
    "mds_max_purge_ops_per_pg": "2",


When upgrading from infernalis to jewel, I was getting no space left on device 
when deleting files (creating was not a problem). When changing the default of 
mds_bal_fragment_size_max to 3000000, the problem was resolved (I was able to 
delete data again) but I remain with 2.8M strays that don’t get purged. I’m 
using 2 CephFS kernel (4.7.5) clients at the same time. When trying to resolve 
this issue, I bounced into this issue http://tracker.ceph.com/issues/13777 . I 
have tried the following:


·         Re-mounting the clients

·         flushed the journal

·         restarted the MDS servers (active / passive)

·         changed the caps for the MDS cephx user from “ ceph auth caps 
mds.osd5-freu mds 'allow' mon 'allow profile mds' osd 'allow rwx' ” to “ ceph 
auth caps mds.osd5-freu mds 'allow' mon 'allow rwx' osd 'allow rwx' ”

·         Tweaked the default settings up for mds_max_purge_files and 
mds_max_purge_ops_per_pg.

Unfortunatly alll of this did not help. The MDS strays are still at 2.8M and 
growing. We do use hardlinks on our system. Any tips for actions that I can 
take to resolve this?

Kind regards,

Davie De Smet


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mykola 
Dvornik
Sent: Tuesday, October 4, 2016 2:07 PM
To: John Spray <jsp...@redhat.com>
Cc: ceph-users <ceph-us...@ceph.com>
Subject: Re: [ceph-users] CephFS: No space left on device

To my best knowledge nobody used hardlinks within fs.

So I have unmounted everything to see what would happen:

[root@005-s-ragnarok ragnarok]# ceph daemon mds.fast-test session ls
[]
-----mds------ --mds_server-- ---objecter--- -----mds_cache----- ---mds_log----
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts subm|
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0
  0   99k   0 |  0    0    0 |  0    0    0 |  0    0   27k   0 | 31   27k   0

The amount of objects is stry stays the same over the time.
Then I mount one of the clients and start deleting again

ceph tell mds.fast-test injectargs --mds-max-purge-files 64 (default):
2016-10-04 13:58:13.754666 7f39e0010700  0 client.1522041 ms_handle_reset on 
XXX.XXX.XXX.XXX:6800/5261
2016-10-04 13:58:13.773739 7f39e0010700  0 client.1522042 ms_handle_reset on 
XXX.XXX.XXX.XXX:6800/5261
mds_max_purge_files = '64' (unchangeable)
-----mds------ --mds_server-- ---objecter--- -----mds_cache----- ---mds_log----
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts subm|
  0  100k  40k|  0    0  1.1k| 50    0   68 |  0    0   40k  46 | 35   21k 635
  0  100k  40k|  0    0  1.1k| 32    0   68 |  0    0   40k  31 | 35   22k 625
  0  101k  39k|  0    0  935 | 46    0   69 |  0    0   41k  43 | 31   22k 516
  0  101k  39k|  0    0  833 | 80    0   64 |  0    0   41k  75 | 32   23k 495
  0  101k  39k|  0    0  1.1k| 73    0   64 |  0    0   42k  73 | 33   24k 649
  0  100k  39k|  0    0  1.1k| 84    0   68 |  0    0   42k  79 | 31   22k 651
  0  100k  39k|  0    0  1.1k|100    0   67 |  0    0   42k 100 | 31   22k 695
  0  101k  33k|  0    0  1.1k| 38    0   69 |  0    0   43k  36 | 33   23k 607
  0  101k  33k|  0    0  1.1k| 72    0   68 |  0    0   44k  72 | 33   24k 668
  0  102k  33k|  0    0  1.2k| 64    0   68 |  0    0   44k  64 | 34   24k 666
  0  100k  33k|  0    0  1.0k|418    0  360 |  0    0   45k  33 | 35   25k 573
  0  100k  33k|  0    0  1.2k| 19    0  310 |  0    0   45k  19 | 36   25k 624
  0  101k  33k|  0    0  1.2k| 33    0  236 |  0    0   46k  31 | 37   26k 633
  0  102k  33k|  0    0  1.1k| 54    0  176 |  0    0   46k  54 | 37   27k 618
  0  102k  33k|  0    0  1.1k| 65    0  133 |  0    0   47k  63 | 39   27k 639
  0  100k  33k|  0    0  804 | 87    0   93 |  0    0   47k  79 | 39   28k 485
  0  100k  33k|  0    0  1.2k| 62    0   85 |  0    0   48k  62 | 40   28k 670
  0  101k  28k|  0    1  1.0k|109    0   65 |  0    0   48k 103 | 41   29k 617
  0  101k  28k|  0    0  1.1k| 92    0   65 |  0    0   49k  92 | 42   30k 690
  0  102k  28k|  0    0  1.1k| 80    0   65 |  0    0   49k  78 | 43   30k 672
  0  100k  28k|  0    0  1.0k|234    0  261 |  0    0   50k  35 | 34   24k 582
  0  100k  28k|  0    0  1.1k| 71    0  258 |  0    0   50k  71 | 35   25k 667
  0  101k  26k|  0    0  1.2k| 97    0  259 |  0    0   51k  95 | 36   26k 706
  0  102k  26k|  0    0  1.0k| 53    0  258 |  0    0   51k  53 | 37   26k 569

ceph tell mds.fast-test injectargs --mds-max-purge-files 1000:
2016-10-04 14:03:20.449961 7fd9e1012700  0 client.1522044 ms_handle_reset on  
XXX.XXX.XXX.XXX:6800/5261
2016-10-04 14:03:20.469952 7fd9e1012700  0 client.1522045 ms_handle_reset on  
XXX.XXX.XXX.XXX:6800/5261
mds_max_purge_files = '1000' (unchangeable)


-----mds------ --mds_server-- ---objecter--- -----mds_cache----- ---mds_log----
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts subm|
  0   99k 1.0k|  0    0    0 |111    0  260 |  0    0   68k 110 | 39   29k 111
  0   99k 1.0k|  0    0    0 |198    0  260 |  0    0   68k 198 | 39   29k 198
  0   52k 1.0k|  0    0    0 |109    0  264 |  0    0   68k 102 | 39   23k 106
  0   52k 1.0k|  0    0    0 |130    0  265 |  0    0   68k 125 | 39   23k 125
  0   52k 1.0k|  0    1    0 |127    0  265 |  0    0   67k 127 | 39   23k 127
  0   52k 1.0k|  0    0    0 | 84    0  264 |  0    0   67k  84 | 39   24k  84
  0   52k 1.0k|  0    0    0 | 80    0  263 |  0    0   67k  80 | 39   24k  80
  0   52k 1.0k|  0    0    0 | 89    0  260 |  0    0   67k  87 | 32   24k  89
  0   52k 1.0k|  0    0    0 |134    0  259 |  0    0   67k 134 | 32   24k 134
  0   52k 1.0k|  0    0    0 |155    0  259 |  0    0   67k 152 | 33   24k 154
  0   52k 1.0k|  0    0    0 | 99    0  257 |  0    0   67k  99 | 33   24k  99
  0   52k 1.0k|  0    0    0 | 84    0  257 |  0    0   67k  84 | 33   24k  84
  0   52k 1.0k|  0    0    0 |117    0  257 |  0    0   67k 115 | 33   24k 115
  0   52k 1.0k|  0    0    0 |122    0  257 |  0    0   66k 122 | 33   24k 122
  0   52k 1.0k|  0    0    0 | 73    0  257 |  0    0   66k  73 | 33   24k  73
  0   52k 1.0k|  0    0    0 |123    0  257 |  0    0   66k 123 | 33   25k 123
  0   52k 1.0k|  0    0    0 | 87    0  257 |  0    0   66k  87 | 33   25k  87
  0   52k 1.0k|  0    0    0 | 85    0  257 |  0    0   66k  83 | 33   25k  83
  0   52k 1.0k|  0    0    0 | 55    0  257 |  0    0   66k  55 | 33   25k  55
  0   52k 1.0k|  0    0    0 | 34    0  257 |  0    0   66k  34 | 33   25k  34
  0   52k 1.0k|  0    0    0 | 58    0  257 |  0    0   66k  58 | 33   25k  58
  0   52k 1.0k|  0    0    0 | 35    0  257 |  0    0   66k  35 | 33   25k  35
  0   52k 1.0k|  0    0    0 | 65    0  259 |  0    0   66k  63 | 31   22k  64
  0   52k 1.0k|  0    0    0 | 52    0  258 |  0    0   66k  52 | 31   23k  52
Seems like purge rate is virtually not sensitive to mds_max_purge_files. BTW, 
the rm completed well before the stry approached the ground state.
-Mykola



On 4 October 2016 at 09:16, John Spray 
<jsp...@redhat.com<mailto:jsp...@redhat.com>> wrote:
(Re-adding list)

The 7.5k stray dentries while idle is probably indicating that clients
are holding onto references to them (unless you unmount the clients
and they don't purge, in which case you may well have found a bug).
The other way you can end up with lots of dentries sitting in stray
dirs is if you had lots of hard links and unlinked the original
location but left the hard link in place.

The rate at which your files are purging seems to roughly correspond
to mds_max_purge_files, so I'd definitely try changing that to get
things purging faster.

John

On Mon, Oct 3, 2016 at 3:21 PM, Mykola Dvornik 
<mykola.dvor...@gmail.com<mailto:mykola.dvor...@gmail.com>> wrote:
> Hi John,
>
> This is how the daemonperf looks like :
>
> background
>
> -----mds------ --mds_server-- ---objecter--- -----mds_cache-----
> ---mds_log----
> rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts
> subm|
>   0   99k 177k|  0    0    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    0    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    0    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    5    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 1
>   0   99k 177k|  0    0    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    0    0 |  2    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    2    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    2    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    1    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    2    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    0    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    1    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>   0   99k 177k|  0    6    0 |  0    0    0 |  0    0  7.5k   0 | 31   22k
> 0
>
> with 4 rm instances
>
> -----mds------ --mds_server-- ---objecter--- -----mds_cache-----
> ---mds_log----
> rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts
> subm|
>   0  172k 174k|  0    5  3.1k| 85    0   34 |  0    0   79k  83 | 45   31k
> 1.6k
>   0  174k 174k|  0    0  3.0k| 76    0   35 |  0    0   80k  72 | 48   32k
> 1.6k
>   0  175k 174k|  0    0  2.7k| 81    0   37 |  0    0   81k  69 | 42   28k
> 1.4k
>   3  175k 174k|  0    2  468 | 41    0   17 |  0    0   82k  35 | 42   28k
> 276
>   0  177k 174k|  0    2  2.2k|134    0   41 |  0    0   83k 118 | 44   29k
> 1.2k
>   0  178k 174k|  0    1  2.7k|123    0   33 |  0    0   84k 121 | 46   31k
> 1.5k
>   0  179k 162k|  0    2  2.6k|133    0   32 |  0    0   85k 131 | 48   32k
> 1.4k
>   0  181k 162k|  0    0  2.3k|113    0   36 |  0    0   86k 102 | 40   27k
> 1.2k
>   0  182k 162k|  0    1  2.7k| 83    0   36 |  0    0   87k  81 | 42   28k
> 1.4k
>   0  183k 162k|  0    6  2.6k| 22    0   35 |  0    0   89k  22 | 43   30k
> 1.3k
>   0  184k 162k|  0    1  2.5k|  9    0   35 |  0    0   90k   7 | 45   31k
> 1.2k
>   0  186k 155k|  0    3  2.5k|  2    0   36 |  0    0   91k   0 | 47   32k
> 1.2k
>   0  187k 155k|  0    3  1.9k| 18    0   49 |  0    0   92k   0 | 48   32k
> 970
>   0  188k 155k|  0    2  2.5k| 46    0   30 |  0    0   93k  32 | 48   33k
> 1.3k
>   0  189k 155k|  0    0  2.4k| 55    0   36 |  0    0   95k  50 | 50   34k
> 1.2k
>   0  190k 155k|  0    0  2.7k|  2    0   36 |  0    0   96k   0 | 52   36k
> 1.3k
>   0  192k 150k|  0    1  3.0k| 30    0   37 |  0    0   97k  28 | 54   37k
> 1.5k
>   0  183k 150k|  0    0  2.7k| 58    0   40 |  0    0   99k  50 | 56   39k
> 1.4k
>   0  184k 150k|  0    0  3.2k| 12    0   41 |  0    0  100k  10 | 59   40k
> 1.6k
>   0  185k 150k|  0    0  2.1k|  3    0   41 |  0    0  102k   0 | 60   41k
> 1.0k
>   0  186k 150k|  0    5  1.6k| 12    0   41 |  0    0  102k  10 | 62   42k
> 837
>   0  186k 148k|  0    0  1.0k| 62    0   32 |  0    0  103k  57 | 62   43k
> 575
>   0  170k 148k|  0    0  858 | 31    0   25 |  0    0  103k  27 | 40   27k
> 458
>   5  165k 148k|  0    2  865 | 77    2   28 |  0    0  104k  45 | 41   28k
> 495
>
> with all the rm instances killed
>
> -----mds------ --mds_server-- ---objecter--- -----mds_cache-----
> ---mds_log----
> rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts
> subm|
>   0  194k 147k|  0    0    0 | 64    0   32 |  0    0  117k  63 | 31   22k
> 63
>   0  194k 147k|  0    0    0 | 58    0   32 |  0    0  117k  58 | 31   22k
> 58
>   0  194k 147k|  0    0    0 | 49    0   32 |  0    0  117k  49 | 31   22k
> 50
>   0  194k 147k|  0    5    0 | 65    0   32 |  0    0  117k  65 | 31   22k
> 65
>   0  194k 147k|  0    0    0 | 42    0   32 |  0    0  117k  40 | 31   22k
> 40
>   0  194k 147k|  0    0    0 |  7    0   32 |  0    0  117k   7 | 31   22k
> 7
>   0  194k 147k|  0    2    0 | 23    0   32 |  0    0  117k  23 | 31   22k
> 23
>   0  194k 147k|  0    3    0 | 61    0   32 |  0    0  116k  61 | 31   23k
> 62
>   0  194k 147k|  0    0    0 | 59    0   32 |  0    0  116k  59 | 31   23k
> 59
>   0  194k 147k|  0    2    0 |107    0   32 |  0    0  116k 103 | 31   22k
> 103
>   0  194k 147k|  0    1    0 |126    0   32 |  0    0  116k 125 | 31   22k
> 125
>   0  194k 147k|  0    6    0 | 74    0   32 |  0    0  116k  74 | 31   22k
> 74
>   0  194k 147k|  0    1    0 | 37    0   32 |  0    0  116k  37 | 31   23k
> 37
>   0  194k 147k|  0    2    0 | 96    0   32 |  0    0  116k  96 | 31   23k
> 96
>   0  194k 147k|  0    2    0 |111    0   33 |  0    0  116k 110 | 31   23k
> 110
>   0  194k 147k|  0    3    0 |105    0   33 |  0    0  116k 105 | 31   23k
> 105
>   0  194k 147k|  0    1    0 | 79    0   33 |  0    0  116k  79 | 31   23k
> 79
>   0  194k 147k|  0    0    0 | 67    0   33 |  0    0  116k  67 | 31   23k
> 68
>   0  194k 147k|  0    0    0 | 75    0   33 |  0    0  116k  75 | 31   23k
> 75
>   0  194k 147k|  0    1    0 | 54    0   35 |  0    0  116k  51 | 31   23k
> 51
>   0  194k 147k|  0    0    0 | 40    0   35 |  0    0  115k  40 | 31   23k
> 40
>   0  194k 147k|  0    0    0 | 32    0   35 |  0    0  115k  32 | 31   23k
> 32
>   0  194k 147k|  0    5    0 | 43    0   35 |  0    0  115k  43 | 31   23k
> 43
>   0  194k 147k|  0    0    0 |  7    0   35 |  0    0  115k   7 | 31   23k
> 7
>
> So I guess the purge ops are extremely slow.
>
> The first question is it OK to have 7.5K objects in stry when cluster is
> idle for a while?
>
> The second question is who to blame for the slow purges, i.e. MDS or OSDs?
>
> Regards,
>
> -Mykola
>
>
> On 2 October 2016 at 23:48, Mykola Dvornik 
> <mykola.dvor...@gmail.com<mailto:mykola.dvor...@gmail.com>> wrote:
>>
>> Hi Johan,
>>
>> Many thanks for your reply. I will try to play with the mds tunables and
>> report back to your ASAP.
>>
>> So far I see that mds log contains a lot of errors of the following kind:
>>
>> 2016-10-02 11:58:03.002769 7f8372d54700  0 mds.0.cache.dir(100056ddecd)
>> _fetched  badness: got (but i already had) [inode 10005729a77 [2,head]
>> ~mds0/stray1/10005729a77 auth v67464942 s=196728 nl=0 n(v0 b196728 1=1+0)
>> (iversion lock) 0x7f84acae82a0] mode 33204 mtime 2016-08-07 23:06:29.776298
>>
>> 2016-10-02 11:58:03.002789 7f8372d54700 -1 log_channel(cluster) log [ERR]
>> : loaded dup inode 10005729a77 [2,head] v68621 at
>> /users/mykola/mms/NCSHNO/final/120nm-uniform-h8200/j002654.out/m_xrange192-320_yrange192-320_016232.dump,
>> but inode 10005729a77.head v67464942 already exists at
>> ~mds0/stray1/10005729a77
>>
>> Those folders within mds.0.cache.dir that got badness report a size of
>> 16EB on the clients. rm on them fails with 'Directory not empty'.
>>
>> As for the "Client failing to respond to cache pressure", I have 2 kernel
>> clients on 4.4.21, 1 on 4.7.5 and 16 fuse clients always running the most
>> recent release version of ceph-fuse. The funny thing is that every single
>> client misbehaves from time to time. I am aware of quite discussion about
>> this issue on the ML, but cannot really follow how to debug it.
>>
>> Regards,
>>
>> -Mykola
>>
>> On 2 October 2016 at 22:27, John Spray 
>> <jsp...@redhat.com<mailto:jsp...@redhat.com>> wrote:
>>>
>>> On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik
>>> <mykola.dvor...@gmail.com<mailto:mykola.dvor...@gmail.com>> wrote:
>>> > After upgrading to 10.2.3 we frequently see messages like
>>>
>>> From which version did you upgrade?
>>>
>>> > 'rm: cannot remove '...': No space left on device
>>> >
>>> > The folders we are trying to delete contain approx. 50K files 193 KB
>>> > each.
>>>
>>> My guess would be that you are hitting the new
>>> mds_bal_fragment_size_max check.  This limits the number of entries
>>> that the MDS will create in a single directory fragment, to avoid
>>> overwhelming the OSD with oversized objects.  It is 100000 by default.
>>> This limit also applies to "stray" directories where unlinked files
>>> are put while they wait to be purged, so you could get into this state
>>> while doing lots of deletions.  There are ten stray directories that
>>> get a roughly even share of files, so if you have more than about one
>>> million files waiting to be purged, you could see this condition.
>>>
>>> The "Client failing to respond to cache pressure" messages may play a
>>> part here -- if you have misbehaving clients then they may cause the
>>> MDS to delay purging stray files, leading to a backlog.  If your
>>> clients are by any chance older kernel clients, you should upgrade
>>> them.  You can also unmount/remount them to clear this state, although
>>> it will reoccur until the clients are updated (or until the bug is
>>> fixed, if you're running latest clients already).
>>>
>>> The high level counters for strays are part of the default output of
>>> "ceph daemonperf mds.<id>" when run on the MDS server (the "stry" and
>>> "purg" columns).  You can look at these to watch how fast the MDS is
>>> clearing out strays.  If your backlog is just because it's not doing
>>> it fast enough, then you can look at tuning mds_max_purge_files and
>>> mds_max_purge_ops to adjust the throttles on purging.  Those settings
>>> can be adjusted without restarting the MDS using the "injectargs"
>>> command
>>> (http://docs.ceph.com/docs/master/rados/operations/control/#mds-subsystem)
>>>
>>> Let us know how you get on.
>>>
>>> John
>>>
>>>
>>> > The cluster state and storage available are both OK:
>>> >
>>> >     cluster 98d72518-6619-4b5c-b148-9a781ef13bcb
>>> >      health HEALTH_WARN
>>> >             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
>>> > pressure
>>> >             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
>>> > pressure
>>> >             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
>>> > pressure
>>> >             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
>>> > pressure
>>> >             mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
>>> > pressure
>>> >      monmap e1: 1 mons at {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0}
>>> >             election epoch 11, quorum 0 000-s-ragnarok
>>> >       fsmap e62643: 1/1/1 up {0=000-s-ragnarok=up:active}
>>> >      osdmap e20203: 16 osds: 16 up, 16 in
>>> >             flags sortbitwise
>>> >       pgmap v15284654: 1088 pgs, 2 pools, 11263 GB data, 40801 kobjects
>>> >             23048 GB used, 6745 GB / 29793 GB avail
>>> >                 1085 active+clean
>>> >                    2 active+clean+scrubbing
>>> >                    1 active+clean+scrubbing+deep
>>> >
>>> >
>>> > Has anybody experienced this issue so far?
>>> >
>>> > Regards,
>>> > --
>>> >  Mykola
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>
>>
>>
>>
>> --
>>  Mykola
>
>
>
>
> --
>  Mykola



--
 Mykola
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to