Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-26 Thread Konstantin Shalygin

On 1/26/19 10:24 PM, Kevin Olbrich wrote:

I just had the time to check again: even after removing the broken
OSD, mgr still crashes.
All OSDs are on and in.
If I run "ceph balancer on" on a HEALTH_OK cluster, an optimization
plan is generated and started. After some minutes all MGRs die.

This is a major problem for me, as I still got that SSD OSD that is
inbalanced and limiting the whole pools space.


Try to run mgr with `debug mgr = 4/5` and look to mgr log file.



k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-26 Thread Kevin Olbrich
Hi!

I just had the time to check again: even after removing the broken
OSD, mgr still crashes.
All OSDs are on and in.
If I run "ceph balancer on" on a HEALTH_OK cluster, an optimization
plan is generated and started. After some minutes all MGRs die.

This is a major problem for me, as I still got that SSD OSD that is
inbalanced and limiting the whole pools space.


root@adminnode:~# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
 -1   29.91933 root default
-16   29.91933 datacenter dc01
-19   29.91933 pod dc01-agg01
-10   16.52396 rack dc01-rack02
 -46.29695 host node1001
  0   hdd  0.90999 osd.0 up  1.0 1.0
  1   hdd  0.90999 osd.1 up  1.0 1.0
  5   hdd  0.90999 osd.5 up  1.0 1.0
 29   hdd  0.90970 osd.29up  1.0 1.0
 33   hdd  0.90970 osd.33up  1.0 1.0
  2   ssd  0.43700 osd.2 up  1.0 1.0
  3   ssd  0.43700 osd.3 up  1.0 1.0
  4   ssd  0.43700 osd.4 up  1.0 1.0
 30   ssd  0.43660 osd.30up  1.0 1.0
 -76.29724 host node1002
  9   hdd  0.90999 osd.9 up  1.0 1.0
 10   hdd  0.90999 osd.10up  1.0 1.0
 11   hdd  0.90999 osd.11up  1.0 1.0
 12   hdd  0.90999 osd.12up  1.0 1.0
 35   hdd  0.90970 osd.35up  1.0 1.0
  6   ssd  0.43700 osd.6 up  1.0 1.0
  7   ssd  0.43700 osd.7 up  1.0 1.0
  8   ssd  0.43700 osd.8 up  1.0 1.0
 31   ssd  0.43660 osd.31up  1.0 1.0
-282.18318 host node1005
 34   ssd  0.43660 osd.34up  1.0 1.0
 36   ssd  0.87329 osd.36up  1.0 1.0
 37   ssd  0.87329 osd.37up  1.0 1.0
-291.74658 host node1006
 42   ssd  0.87329 osd.42up  1.0 1.0
 43   ssd  0.87329 osd.43up  1.0 1.0
-11   13.39537 rack dc01-rack03
-225.38794 host node1003
 17   hdd  0.90999 osd.17up  1.0 1.0
 18   hdd  0.90999 osd.18up  1.0 1.0
 24   hdd  0.90999 osd.24up  1.0 1.0
 26   hdd  0.90999 osd.26up  1.0 1.0
 13   ssd  0.43700 osd.13up  1.0 1.0
 14   ssd  0.43700 osd.14up  1.0 1.0
 15   ssd  0.43700 osd.15up  1.0 1.0
 16   ssd  0.43700 osd.16up  1.0 1.0
-255.38765 host node1004
 23   hdd  0.90999 osd.23up  1.0 1.0
 25   hdd  0.90999 osd.25up  1.0 1.0
 27   hdd  0.90999 osd.27up  1.0 1.0
 28   hdd  0.90970 osd.28up  1.0 1.0
 19   ssd  0.43700 osd.19up  1.0 1.0
 20   ssd  0.43700 osd.20up  1.0 1.0
 21   ssd  0.43700 osd.21up  1.0 1.0
 22   ssd  0.43700 osd.22up  1.0 1.0
-302.61978 host node1007
 38   ssd  0.43660 osd.38up  1.0 1.0
 39   ssd  0.43660 osd.39up  1.0 1.0
 40   ssd  0.87329 osd.40up  1.0 1.0
 41   ssd  0.87329 osd.41up  1.0 1.0



root@adminnode:~# ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
 0   hdd 0.90999  1.0  932GiB  353GiB  579GiB 37.87 0.83  95
 1   hdd 0.90999  1.0  932GiB  400GiB  531GiB 42.98 0.94 108
 5   hdd 0.90999  1.0  932GiB  267GiB  664GiB 28.70 0.63  72
29   hdd 0.90970  1.0  932GiB  356GiB  576GiB 38.19 0.84  96
33   hdd 0.90970  1.0  932GiB  344GiB  587GiB 36.94 0.81  93
 2   ssd 0.43700  1.0  447GiB  273GiB  174GiB 61.09 1.34  52
 3   ssd 0.43700  1.0  447GiB  252GiB  195GiB 56.38 1.23  61
 4   ssd 0.43700  1.0  447GiB  308GiB  140GiB 68.78 1.51  59
30   ssd 0.43660  1.0  447GiB  231GiB  216GiB 51.77 1.13  48
 9   hdd 0.90999  1.0  932GiB  358GiB  573GiB 38.48 0.84  97
10   hdd 0.90999  1.0  932GiB  347GiB  585GiB 37.25 0.82  94
11   hdd 0.90999  

Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-05 Thread Konstantin Shalygin

On 1/5/19 4:17 PM, Kevin Olbrich wrote:

root@adminnode:~# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
  -1   30.82903 root default
-16   30.82903 datacenter dc01
-19   30.82903 pod dc01-agg01
-10   17.43365 rack dc01-rack02
  -47.20665 host node1001
   0   hdd  0.90999 osd.0 up  1.0 1.0
   1   hdd  0.90999 osd.1 up  1.0 1.0
   5   hdd  0.90999 osd.5 up  1.0 1.0
  29   hdd  0.90970 osd.29up  1.0 1.0
  32   hdd  0.90970 osd.32  down0 1.0
  33   hdd  0.90970 osd.33up  1.0 1.0
   2   ssd  0.43700 osd.2 up  1.0 1.0
   3   ssd  0.43700 osd.3 up  1.0 1.0
   4   ssd  0.43700 osd.4 up  1.0 1.0
  30   ssd  0.43660 osd.30up  1.0 1.0
  -76.29724 host node1002
   9   hdd  0.90999 osd.9 up  1.0 1.0
  10   hdd  0.90999 osd.10up  1.0 1.0
  11   hdd  0.90999 osd.11up  1.0 1.0
  12   hdd  0.90999 osd.12up  1.0 1.0
  35   hdd  0.90970 osd.35up  1.0 1.0
   6   ssd  0.43700 osd.6 up  1.0 1.0
   7   ssd  0.43700 osd.7 up  1.0 1.0
   8   ssd  0.43700 osd.8 up  1.0 1.0
  31   ssd  0.43660 osd.31up  1.0 1.0
-282.18318 host node1005
  34   ssd  0.43660 osd.34up  1.0 1.0
  36   ssd  0.87329 osd.36up  1.0 1.0
  37   ssd  0.87329 osd.37up  1.0 1.0
-291.74658 host node1006
  42   ssd  0.87329 osd.42up  1.0 1.0
  43   ssd  0.87329 osd.43up  1.0 1.0
-11   13.39537 rack dc01-rack03
-225.38794 host node1003
  17   hdd  0.90999 osd.17up  1.0 1.0
  18   hdd  0.90999 osd.18up  1.0 1.0
  24   hdd  0.90999 osd.24up  1.0 1.0
  26   hdd  0.90999 osd.26up  1.0 1.0
  13   ssd  0.43700 osd.13up  1.0 1.0
  14   ssd  0.43700 osd.14up  1.0 1.0
  15   ssd  0.43700 osd.15up  1.0 1.0
  16   ssd  0.43700 osd.16up  1.0 1.0
-255.38765 host node1004
  23   hdd  0.90999 osd.23up  1.0 1.0
  25   hdd  0.90999 osd.25up  1.0 1.0
  27   hdd  0.90999 osd.27up  1.0 1.0
  28   hdd  0.90970 osd.28up  1.0 1.0
  19   ssd  0.43700 osd.19up  1.0 1.0
  20   ssd  0.43700 osd.20up  1.0 1.0
  21   ssd  0.43700 osd.21up  1.0 1.0
  22   ssd  0.43700 osd.22up  1.0 1.0
-302.61978 host node1007
  38   ssd  0.43660 osd.38up  1.0 1.0
  39   ssd  0.43660 osd.39up  1.0 1.0
  40   ssd  0.87329 osd.40up  1.0 1.0
  41   ssd  0.87329 osd.41up  1.0 1.0


root@adminnode:~# ceph osd df tree
ID  CLASS WEIGHT   REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
TYPE NAME
  -1   30.82903- 29.9TiB 14.0TiB 16.0TiB 46.65 1.00   -
root default
-16   30.82903- 29.9TiB 14.0TiB 16.0TiB 46.65 1.00   -
datacenter dc01
-19   30.82903- 29.9TiB 14.0TiB 16.0TiB 46.65 1.00   -
 pod dc01-agg01
-10   17.43365- 16.5TiB 7.31TiB 9.21TiB 44.26 0.95   -
 rack dc01-rack02
  -47.20665- 6.29TiB 2.76TiB 3.54TiB 43.83 0.94   -
 host node1001
   0   hdd  0.90999  1.0  932GiB  356GiB  575GiB 38.22 0.82  95
 osd.0
   1   hdd  0.90999  1.0  932GiB  397GiB  534GiB 42.66 0.91 106
 osd.1
   5   hdd  0.90999  1.0  932GiB  284GiB  647GiB 30.50 0.65  76
 osd.5
  29   hdd  0.90970  1.0  932GiB  366GiB  566GiB 39.29 0.84  98
 osd.29
  32   hdd  0.909700  0B  0B  0B 00   

Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-05 Thread Kevin Olbrich
root@adminnode:~# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
 -1   30.82903 root default
-16   30.82903 datacenter dc01
-19   30.82903 pod dc01-agg01
-10   17.43365 rack dc01-rack02
 -47.20665 host node1001
  0   hdd  0.90999 osd.0 up  1.0 1.0
  1   hdd  0.90999 osd.1 up  1.0 1.0
  5   hdd  0.90999 osd.5 up  1.0 1.0
 29   hdd  0.90970 osd.29up  1.0 1.0
 32   hdd  0.90970 osd.32  down0 1.0
 33   hdd  0.90970 osd.33up  1.0 1.0
  2   ssd  0.43700 osd.2 up  1.0 1.0
  3   ssd  0.43700 osd.3 up  1.0 1.0
  4   ssd  0.43700 osd.4 up  1.0 1.0
 30   ssd  0.43660 osd.30up  1.0 1.0
 -76.29724 host node1002
  9   hdd  0.90999 osd.9 up  1.0 1.0
 10   hdd  0.90999 osd.10up  1.0 1.0
 11   hdd  0.90999 osd.11up  1.0 1.0
 12   hdd  0.90999 osd.12up  1.0 1.0
 35   hdd  0.90970 osd.35up  1.0 1.0
  6   ssd  0.43700 osd.6 up  1.0 1.0
  7   ssd  0.43700 osd.7 up  1.0 1.0
  8   ssd  0.43700 osd.8 up  1.0 1.0
 31   ssd  0.43660 osd.31up  1.0 1.0
-282.18318 host node1005
 34   ssd  0.43660 osd.34up  1.0 1.0
 36   ssd  0.87329 osd.36up  1.0 1.0
 37   ssd  0.87329 osd.37up  1.0 1.0
-291.74658 host node1006
 42   ssd  0.87329 osd.42up  1.0 1.0
 43   ssd  0.87329 osd.43up  1.0 1.0
-11   13.39537 rack dc01-rack03
-225.38794 host node1003
 17   hdd  0.90999 osd.17up  1.0 1.0
 18   hdd  0.90999 osd.18up  1.0 1.0
 24   hdd  0.90999 osd.24up  1.0 1.0
 26   hdd  0.90999 osd.26up  1.0 1.0
 13   ssd  0.43700 osd.13up  1.0 1.0
 14   ssd  0.43700 osd.14up  1.0 1.0
 15   ssd  0.43700 osd.15up  1.0 1.0
 16   ssd  0.43700 osd.16up  1.0 1.0
-255.38765 host node1004
 23   hdd  0.90999 osd.23up  1.0 1.0
 25   hdd  0.90999 osd.25up  1.0 1.0
 27   hdd  0.90999 osd.27up  1.0 1.0
 28   hdd  0.90970 osd.28up  1.0 1.0
 19   ssd  0.43700 osd.19up  1.0 1.0
 20   ssd  0.43700 osd.20up  1.0 1.0
 21   ssd  0.43700 osd.21up  1.0 1.0
 22   ssd  0.43700 osd.22up  1.0 1.0
-302.61978 host node1007
 38   ssd  0.43660 osd.38up  1.0 1.0
 39   ssd  0.43660 osd.39up  1.0 1.0
 40   ssd  0.87329 osd.40up  1.0 1.0
 41   ssd  0.87329 osd.41up  1.0 1.0


root@adminnode:~# ceph osd df tree
ID  CLASS WEIGHT   REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
TYPE NAME
 -1   30.82903- 29.9TiB 14.0TiB 16.0TiB 46.65 1.00   -
root default
-16   30.82903- 29.9TiB 14.0TiB 16.0TiB 46.65 1.00   -
datacenter dc01
-19   30.82903- 29.9TiB 14.0TiB 16.0TiB 46.65 1.00   -
pod dc01-agg01
-10   17.43365- 16.5TiB 7.31TiB 9.21TiB 44.26 0.95   -
rack dc01-rack02
 -47.20665- 6.29TiB 2.76TiB 3.54TiB 43.83 0.94   -
host node1001
  0   hdd  0.90999  1.0  932GiB  356GiB  575GiB 38.22 0.82  95
osd.0
  1   hdd  0.90999  1.0  932GiB  397GiB  534GiB 42.66 0.91 106
osd.1
  5   hdd  0.90999  1.0  932GiB  284GiB  647GiB 30.50 0.65  76
osd.5
 29   hdd  0.90970  1.0  932GiB  366GiB  566GiB 39.29 0.84  98
osd.29
 32   hdd  0.909700  0B  0B  0B 00   0
osd.32
 33   hdd  0.90970  1.0  932GiB  369GiB  563GiB 39.57 0.85  99

Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-04 Thread Konstantin Shalygin

On 1/5/19 1:51 AM, Kevin Olbrich wrote:

PS: Could behttp://tracker.ceph.com/issues/36361
There is one HDD OSD that is out (which will not be replaced because
the SSD pool will get the images and the hdd pool will be deleted).


Paste your `ceph osd tree`, `ceph osd df tree` please.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-04 Thread Kevin Olbrich
PS: Could be http://tracker.ceph.com/issues/36361
There is one HDD OSD that is out (which will not be replaced because
the SSD pool will get the images and the hdd pool will be deleted).

Kevin

Am Fr., 4. Jan. 2019 um 19:46 Uhr schrieb Kevin Olbrich :
>
> Hi!
>
> I did what you wrote but my MGRs started to crash again:
> root@adminnode:~# ceph -s
>   cluster:
> id: 086d9f80-6249-4594-92d0-e31b6a9c
> health: HEALTH_WARN
> no active mgr
> 105498/6277782 objects misplaced (1.680%)
>
>   services:
> mon: 3 daemons, quorum mon01,mon02,mon03
> mgr: no daemons active
> osd: 44 osds: 43 up, 43 in
>
>   data:
> pools:   4 pools, 1616 pgs
> objects: 1.88M objects, 7.07TiB
> usage:   13.2TiB used, 16.7TiB / 29.9TiB avail
> pgs: 105498/6277782 objects misplaced (1.680%)
>  1606 active+clean
>  8active+remapped+backfill_wait
>  2active+remapped+backfilling
>
>   io:
> client:   5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr
> recovery: 60.3MiB/s, 15objects/s
>
>
> MON 1 log:
>-13> 2019-01-04 14:05:04.432186 7fec56a93700  4 mgr ms_dispatch
> active mgrdigest v1
>-12> 2019-01-04 14:05:04.432194 7fec56a93700  4 mgr ms_dispatch mgrdigest 
> v1
>-11> 2019-01-04 14:05:04.822041 7fec434e1700  4 mgr[balancer]
> Optimize plan auto_2019-01-04_14:05:04
>-10> 2019-01-04 14:05:04.822170 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/mode
> -9> 2019-01-04 14:05:04.822231 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/max_misplaced
> -8> 2019-01-04 14:05:04.822268 7fec434e1700  4 ceph_config_get
> max_misplaced not found
> -7> 2019-01-04 14:05:04.822444 7fec434e1700  4 mgr[balancer] Mode
> upmap, max misplaced 0.05
> -6> 2019-01-04 14:05:04.822849 7fec434e1700  4 mgr[balancer] do_upmap
> -5> 2019-01-04 14:05:04.822923 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/upmap_max_iterations
> -4> 2019-01-04 14:05:04.822964 7fec434e1700  4 ceph_config_get
> upmap_max_iterations not found
> -3> 2019-01-04 14:05:04.823013 7fec434e1700  4 mgr get_config
> get_configkey: mgr/balancer/upmap_max_deviation
> -2> 2019-01-04 14:05:04.823048 7fec434e1700  4 ceph_config_get
> upmap_max_deviation not found
> -1> 2019-01-04 14:05:04.823265 7fec434e1700  4 mgr[balancer] pools
> ['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec']
>  0> 2019-01-04 14:05:04.836124 7fec434e1700 -1
> /build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int
> OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set int>&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04
> 14:05:04.832885
> /build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0)
>
>  ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x558c3c0bb572]
>  2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set std::less, std::allocator > const&,
> OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
>  3: (()+0x2f3020) [0x558c3bf5d020]
>  4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
>  5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
>  6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
>  7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
>  8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
>  9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
>  10: (()+0x13e370) [0x7fec5e8be370]
>  11: (PyObject_Call()+0x43) [0x7fec5e891273]
>  12: (()+0x1853ac) [0x7fec5e9053ac]
>  13: (PyObject_Call()+0x43) [0x7fec5e891273]
>  14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
>  15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
>  16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
>  17: (()+0x76ba) [0x7fec5d74c6ba]
>  18: (clone()+0x6d) [0x7fec5c7b841d]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> --- logging levels ---
>0/ 5 none
>0/ 1 lockdep
>0/ 1 context
>1/ 1 crush
>1/ 5 mds
>1/ 5 mds_balancer
>1/ 5 mds_locker
>1/ 5 mds_log
>1/ 5 mds_log_expire
>1/ 5 mds_migrator
>0/ 1 buffer
>0/ 1 timer
>0/ 1 filer
>0/ 1 striper
>0/ 1 objecter
>0/ 5 rados
>0/ 5 rbd
>0/ 5 rbd_mirror
>0/ 5 rbd_replay
>0/ 5 journaler
>0/ 5 objectcacher
>0/ 5 client
>1/ 5 osd
>0/ 5 optracker
>0/ 5 objclass
>1/ 3 filestore
>1/ 3 journal
>0/ 5 ms
>1/ 5 mon
>0/10 monc
>1/ 5 paxos
>0/ 5 tp
>1/ 5 auth
>1/ 5 crypto
>1/ 1 finisher
>1/ 1 reserver
>1/ 5 heartbeatmap
>1/ 5 perfcounter
>1/ 5 rgw
>1/10 civetweb
>1/ 5 javaclient
>1/ 5 asok
>1/ 1 throttle
>0/ 0 refs
>1/ 5 xio
>1/ 5 compressor
>1/ 5 bluestore
>1/ 5 bluefs
>1/ 3 bdev
>1/ 5 kstore
>4/ 5 rocksdb
>4/ 5 leveldb
>4/ 5 memdb
>1/ 5 kinetic
>

Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-04 Thread Kevin Olbrich
Hi!

I did what you wrote but my MGRs started to crash again:
root@adminnode:~# ceph -s
  cluster:
id: 086d9f80-6249-4594-92d0-e31b6a9c
health: HEALTH_WARN
no active mgr
105498/6277782 objects misplaced (1.680%)

  services:
mon: 3 daemons, quorum mon01,mon02,mon03
mgr: no daemons active
osd: 44 osds: 43 up, 43 in

  data:
pools:   4 pools, 1616 pgs
objects: 1.88M objects, 7.07TiB
usage:   13.2TiB used, 16.7TiB / 29.9TiB avail
pgs: 105498/6277782 objects misplaced (1.680%)
 1606 active+clean
 8active+remapped+backfill_wait
 2active+remapped+backfilling

  io:
client:   5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr
recovery: 60.3MiB/s, 15objects/s


MON 1 log:
   -13> 2019-01-04 14:05:04.432186 7fec56a93700  4 mgr ms_dispatch
active mgrdigest v1
   -12> 2019-01-04 14:05:04.432194 7fec56a93700  4 mgr ms_dispatch mgrdigest v1
   -11> 2019-01-04 14:05:04.822041 7fec434e1700  4 mgr[balancer]
Optimize plan auto_2019-01-04_14:05:04
   -10> 2019-01-04 14:05:04.822170 7fec434e1700  4 mgr get_config
get_configkey: mgr/balancer/mode
-9> 2019-01-04 14:05:04.822231 7fec434e1700  4 mgr get_config
get_configkey: mgr/balancer/max_misplaced
-8> 2019-01-04 14:05:04.822268 7fec434e1700  4 ceph_config_get
max_misplaced not found
-7> 2019-01-04 14:05:04.822444 7fec434e1700  4 mgr[balancer] Mode
upmap, max misplaced 0.05
-6> 2019-01-04 14:05:04.822849 7fec434e1700  4 mgr[balancer] do_upmap
-5> 2019-01-04 14:05:04.822923 7fec434e1700  4 mgr get_config
get_configkey: mgr/balancer/upmap_max_iterations
-4> 2019-01-04 14:05:04.822964 7fec434e1700  4 ceph_config_get
upmap_max_iterations not found
-3> 2019-01-04 14:05:04.823013 7fec434e1700  4 mgr get_config
get_configkey: mgr/balancer/upmap_max_deviation
-2> 2019-01-04 14:05:04.823048 7fec434e1700  4 ceph_config_get
upmap_max_deviation not found
-1> 2019-01-04 14:05:04.823265 7fec434e1700  4 mgr[balancer] pools
['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec']
 0> 2019-01-04 14:05:04.836124 7fec434e1700 -1
/build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int
OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04
14:05:04.832885
/build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0)

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x558c3c0bb572]
 2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set, std::allocator > const&,
OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
 3: (()+0x2f3020) [0x558c3bf5d020]
 4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
 5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
 6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
 7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
 8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
 9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
 10: (()+0x13e370) [0x7fec5e8be370]
 11: (PyObject_Call()+0x43) [0x7fec5e891273]
 12: (()+0x1853ac) [0x7fec5e9053ac]
 13: (PyObject_Call()+0x43) [0x7fec5e891273]
 14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
 15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
 16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
 17: (()+0x76ba) [0x7fec5d74c6ba]
 18: (clone()+0x6d) [0x7fec5c7b841d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log
--- end dump of recent events ---
2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal (Aborted) **
 in thread 7fec434e1700 thread_name:balancer

 ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
 1: (()+0x4105b4) [0x558c3c07a5b4]
 2: (()+0x11390) [0x7fec5d756390]
 3: 

Re: [ceph-users] Usage of devices in SSD pool vary very much

2019-01-02 Thread Konstantin Shalygin

On a medium sized cluster with device-classes, I am experiencing a
problem with the SSD pool:

root at adminnode  :~# 
ceph osd df | grep ssd
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
  2   ssd 0.43700  1.0  447GiB  254GiB  193GiB 56.77 1.28  50
  3   ssd 0.43700  1.0  447GiB  208GiB  240GiB 46.41 1.04  58
  4   ssd 0.43700  1.0  447GiB  266GiB  181GiB 59.44 1.34  55
30   ssd 0.43660  1.0  447GiB  222GiB  225GiB 49.68 1.12  49
  6   ssd 0.43700  1.0  447GiB  238GiB  209GiB 53.28 1.20  59
  7   ssd 0.43700  1.0  447GiB  228GiB  220GiB 50.88 1.14  56
  8   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.16 1.35  57
31   ssd 0.43660  1.0  447GiB  231GiB  217GiB 51.58 1.16  56
34   ssd 0.43660  1.0  447GiB  186GiB  261GiB 41.65 0.94  49
36   ssd 0.87329  1.0  894GiB  364GiB  530GiB 40.68 0.92  91
37   ssd 0.87329  1.0  894GiB  321GiB  573GiB 35.95 0.81  78
42   ssd 0.87329  1.0  894GiB  375GiB  519GiB 41.91 0.94  92
43   ssd 0.87329  1.0  894GiB  438GiB  456GiB 49.00 1.10  92
13   ssd 0.43700  1.0  447GiB  249GiB  198GiB 55.78 1.25  72
14   ssd 0.43700  1.0  447GiB  290GiB  158GiB 64.76 1.46  71
15   ssd 0.43700  1.0  447GiB  368GiB 78.6GiB 82.41 1.85  78 <
16   ssd 0.43700  1.0  447GiB  253GiB  194GiB 56.66 1.27  70
19   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.21 1.35  70
20   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.81 1.57  77
21   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.77 1.57  77
22   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.10 1.35  67
38   ssd 0.43660  1.0  447GiB  153GiB  295GiB 34.11 0.77  46
39   ssd 0.43660  1.0  447GiB  127GiB  320GiB 28.37 0.64  38
40   ssd 0.87329  1.0  894GiB  386GiB  508GiB 43.17 0.97  97
41   ssd 0.87329  1.0  894GiB  375GiB  520GiB 41.88 0.94 113

This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool).
Currently, the balancer plugin is off because it immediately crashed
the MGR in the past (on 12.2.5).
Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I
am unable to find the bugtracker ID]

Would the balancer plugin correct this situation?
What happens if all MGRs die like they did on 12.2.5 because of the plugin?
Will the balancer take data from the most-unbalanced OSDs first?
Otherwise the OSD may fill up more then FULL which would cause the
whole pool to freeze (because the smallest OSD is taken into account
for free space calculation).
This would be the worst case as over 100 VMs would freeze, causing lot
of trouble. This is also the reason I did not try to enable the
balancer again.

Please read this [1], all about Balancer with upmap mode.

It's stable from 12.2.8 with upmap mode.



k

[1] 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/032002.html


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Usage of devices in SSD pool vary very much

2019-01-02 Thread Kevin Olbrich
Hi!

On a medium sized cluster with device-classes, I am experiencing a
problem with the SSD pool:

root@adminnode:~# ceph osd df | grep ssd
ID CLASS WEIGHT  REWEIGHT SIZEUSE AVAIL   %USE  VAR  PGS
 2   ssd 0.43700  1.0  447GiB  254GiB  193GiB 56.77 1.28  50
 3   ssd 0.43700  1.0  447GiB  208GiB  240GiB 46.41 1.04  58
 4   ssd 0.43700  1.0  447GiB  266GiB  181GiB 59.44 1.34  55
30   ssd 0.43660  1.0  447GiB  222GiB  225GiB 49.68 1.12  49
 6   ssd 0.43700  1.0  447GiB  238GiB  209GiB 53.28 1.20  59
 7   ssd 0.43700  1.0  447GiB  228GiB  220GiB 50.88 1.14  56
 8   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.16 1.35  57
31   ssd 0.43660  1.0  447GiB  231GiB  217GiB 51.58 1.16  56
34   ssd 0.43660  1.0  447GiB  186GiB  261GiB 41.65 0.94  49
36   ssd 0.87329  1.0  894GiB  364GiB  530GiB 40.68 0.92  91
37   ssd 0.87329  1.0  894GiB  321GiB  573GiB 35.95 0.81  78
42   ssd 0.87329  1.0  894GiB  375GiB  519GiB 41.91 0.94  92
43   ssd 0.87329  1.0  894GiB  438GiB  456GiB 49.00 1.10  92
13   ssd 0.43700  1.0  447GiB  249GiB  198GiB 55.78 1.25  72
14   ssd 0.43700  1.0  447GiB  290GiB  158GiB 64.76 1.46  71
15   ssd 0.43700  1.0  447GiB  368GiB 78.6GiB 82.41 1.85  78 <
16   ssd 0.43700  1.0  447GiB  253GiB  194GiB 56.66 1.27  70
19   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.21 1.35  70
20   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.81 1.57  77
21   ssd 0.43700  1.0  447GiB  312GiB  135GiB 69.77 1.57  77
22   ssd 0.43700  1.0  447GiB  269GiB  178GiB 60.10 1.35  67
38   ssd 0.43660  1.0  447GiB  153GiB  295GiB 34.11 0.77  46
39   ssd 0.43660  1.0  447GiB  127GiB  320GiB 28.37 0.64  38
40   ssd 0.87329  1.0  894GiB  386GiB  508GiB 43.17 0.97  97
41   ssd 0.87329  1.0  894GiB  375GiB  520GiB 41.88 0.94 113

This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool).
Currently, the balancer plugin is off because it immediately crashed
the MGR in the past (on 12.2.5).
Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I
am unable to find the bugtracker ID]

Would the balancer plugin correct this situation?
What happens if all MGRs die like they did on 12.2.5 because of the plugin?
Will the balancer take data from the most-unbalanced OSDs first?
Otherwise the OSD may fill up more then FULL which would cause the
whole pool to freeze (because the smallest OSD is taken into account
for free space calculation).
This would be the worst case as over 100 VMs would freeze, causing lot
of trouble. This is also the reason I did not try to enable the
balancer again.

Kind regards
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com