Hi!
I did what you wrote but my MGRs started to crash again:
root@adminnode:~# ceph -s
cluster:
id: 086d9f80-6249-4594-92d0-e31b6aaaaa9c
health: HEALTH_WARN
no active mgr
105498/6277782 objects misplaced (1.680%)
services:
mon: 3 daemons, quorum mon01,mon02,mon03
mgr: no daemons active
osd: 44 osds: 43 up, 43 in
data:
pools: 4 pools, 1616 pgs
objects: 1.88M objects, 7.07TiB
usage: 13.2TiB used, 16.7TiB / 29.9TiB avail
pgs: 105498/6277782 objects misplaced (1.680%)
1606 active+clean
8 active+remapped+backfill_wait
2 active+remapped+backfilling
io:
client: 5.51MiB/s rd, 3.38MiB/s wr, 33op/s rd, 317op/s wr
recovery: 60.3MiB/s, 15objects/s
MON 1 log:
-13> 2019-01-04 14:05:04.432186 7fec56a93700 4 mgr ms_dispatch
active mgrdigest v1
-12> 2019-01-04 14:05:04.432194 7fec56a93700 4 mgr ms_dispatch mgrdigest v1
-11> 2019-01-04 14:05:04.822041 7fec434e1700 4 mgr[balancer]
Optimize plan auto_2019-01-04_14:05:04
-10> 2019-01-04 14:05:04.822170 7fec434e1700 4 mgr get_config
get_configkey: mgr/balancer/mode
-9> 2019-01-04 14:05:04.822231 7fec434e1700 4 mgr get_config
get_configkey: mgr/balancer/max_misplaced
-8> 2019-01-04 14:05:04.822268 7fec434e1700 4 ceph_config_get
max_misplaced not found
-7> 2019-01-04 14:05:04.822444 7fec434e1700 4 mgr[balancer] Mode
upmap, max misplaced 0.050000
-6> 2019-01-04 14:05:04.822849 7fec434e1700 4 mgr[balancer] do_upmap
-5> 2019-01-04 14:05:04.822923 7fec434e1700 4 mgr get_config
get_configkey: mgr/balancer/upmap_max_iterations
-4> 2019-01-04 14:05:04.822964 7fec434e1700 4 ceph_config_get
upmap_max_iterations not found
-3> 2019-01-04 14:05:04.823013 7fec434e1700 4 mgr get_config
get_configkey: mgr/balancer/upmap_max_deviation
-2> 2019-01-04 14:05:04.823048 7fec434e1700 4 ceph_config_get
upmap_max_deviation not found
-1> 2019-01-04 14:05:04.823265 7fec434e1700 4 mgr[balancer] pools
['rbd_vms_hdd', 'rbd_vms_ssd', 'rbd_vms_ssd_01', 'rbd_vms_ssd_01_ec']
0> 2019-01-04 14:05:04.836124 7fec434e1700 -1
/build/ceph-12.2.8/src/osd/OSDMap.cc: In function 'int
OSDMap::calc_pg_upmaps(CephContext*, float, int, const std::set<long
int>&, OSDMap::Incremental*)' thread 7fec434e1700 time 2019-01-04
14:05:04.832885
/build/ceph-12.2.8/src/osd/OSDMap.cc: 4102: FAILED assert(target > 0)
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x558c3c0bb572]
2: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long,
std::less<long>, std::allocator<long> > const&,
OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
3: (()+0x2f3020) [0x558c3bf5d020]
4: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
5: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
6: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
7: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
8: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
10: (()+0x13e370) [0x7fec5e8be370]
11: (PyObject_Call()+0x43) [0x7fec5e891273]
12: (()+0x1853ac) [0x7fec5e9053ac]
13: (PyObject_Call()+0x43) [0x7fec5e891273]
14: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
15: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
16: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
17: (()+0x76ba) [0x7fec5d74c6ba]
18: (clone()+0x6d) [0x7fec5c7b841d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log
--- end dump of recent events ---
2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal (Aborted) **
in thread 7fec434e1700 thread_name:balancer
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
1: (()+0x4105b4) [0x558c3c07a5b4]
2: (()+0x11390) [0x7fec5d756390]
3: (gsignal()+0x38) [0x7fec5c6e6428]
4: (abort()+0x16a) [0x7fec5c6e802a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x558c3c0bb6fe]
6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long,
std::less<long>, std::allocator<long> > const&,
OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
7: (()+0x2f3020) [0x558c3bf5d020]
8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
14: (()+0x13e370) [0x7fec5e8be370]
15: (PyObject_Call()+0x43) [0x7fec5e891273]
16: (()+0x1853ac) [0x7fec5e9053ac]
17: (PyObject_Call()+0x43) [0x7fec5e891273]
18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
21: (()+0x76ba) [0x7fec5d74c6ba]
22: (clone()+0x6d) [0x7fec5c7b841d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
0> 2019-01-04 14:05:05.032479 7fec434e1700 -1 *** Caught signal
(Aborted) **
in thread 7fec434e1700 thread_name:balancer
ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
luminous (stable)
1: (()+0x4105b4) [0x558c3c07a5b4]
2: (()+0x11390) [0x7fec5d756390]
3: (gsignal()+0x38) [0x7fec5c6e6428]
4: (abort()+0x16a) [0x7fec5c6e802a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x558c3c0bb6fe]
6: (OSDMap::calc_pg_upmaps(CephContext*, float, int, std::set<long,
std::less<long>, std::allocator<long> > const&,
OSDMap::Incremental*)+0x2801) [0x558c3c1c0ee1]
7: (()+0x2f3020) [0x558c3bf5d020]
8: (PyEval_EvalFrameEx()+0x8a51) [0x7fec5e832971]
9: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
10: (PyEval_EvalFrameEx()+0x6ffd) [0x7fec5e830f1d]
11: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
12: (PyEval_EvalFrameEx()+0x7124) [0x7fec5e831044]
13: (PyEval_EvalCodeEx()+0x85c) [0x7fec5e96805c]
14: (()+0x13e370) [0x7fec5e8be370]
15: (PyObject_Call()+0x43) [0x7fec5e891273]
16: (()+0x1853ac) [0x7fec5e9053ac]
17: (PyObject_Call()+0x43) [0x7fec5e891273]
18: (PyObject_CallMethod()+0xf4) [0x7fec5e892444]
19: (PyModuleRunner::serve()+0x5c) [0x558c3bf5a18c]
20: (PyModuleRunner::PyModuleRunnerThread::entry()+0x1b8) [0x558c3bf5a998]
21: (()+0x76ba) [0x7fec5d74c6ba]
22: (clone()+0x6d) [0x7fec5c7b841d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mgr.mon01.ceph01.srvfarm.net.log
--- end dump of recent events ---
Kevin
Am Mi., 2. Jan. 2019 um 17:35 Uhr schrieb Konstantin Shalygin <[email protected]>:
>
> On a medium sized cluster with device-classes, I am experiencing a
> problem with the SSD pool:
>
> root at adminnode:~# ceph osd df | grep ssd
> ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
> 2 ssd 0.43700 1.00000 447GiB 254GiB 193GiB 56.77 1.28 50
> 3 ssd 0.43700 1.00000 447GiB 208GiB 240GiB 46.41 1.04 58
> 4 ssd 0.43700 1.00000 447GiB 266GiB 181GiB 59.44 1.34 55
> 30 ssd 0.43660 1.00000 447GiB 222GiB 225GiB 49.68 1.12 49
> 6 ssd 0.43700 1.00000 447GiB 238GiB 209GiB 53.28 1.20 59
> 7 ssd 0.43700 1.00000 447GiB 228GiB 220GiB 50.88 1.14 56
> 8 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.16 1.35 57
> 31 ssd 0.43660 1.00000 447GiB 231GiB 217GiB 51.58 1.16 56
> 34 ssd 0.43660 1.00000 447GiB 186GiB 261GiB 41.65 0.94 49
> 36 ssd 0.87329 1.00000 894GiB 364GiB 530GiB 40.68 0.92 91
> 37 ssd 0.87329 1.00000 894GiB 321GiB 573GiB 35.95 0.81 78
> 42 ssd 0.87329 1.00000 894GiB 375GiB 519GiB 41.91 0.94 92
> 43 ssd 0.87329 1.00000 894GiB 438GiB 456GiB 49.00 1.10 92
> 13 ssd 0.43700 1.00000 447GiB 249GiB 198GiB 55.78 1.25 72
> 14 ssd 0.43700 1.00000 447GiB 290GiB 158GiB 64.76 1.46 71
> 15 ssd 0.43700 1.00000 447GiB 368GiB 78.6GiB 82.41 1.85 78 <----
> 16 ssd 0.43700 1.00000 447GiB 253GiB 194GiB 56.66 1.27 70
> 19 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.21 1.35 70
> 20 ssd 0.43700 1.00000 447GiB 312GiB 135GiB 69.81 1.57 77
> 21 ssd 0.43700 1.00000 447GiB 312GiB 135GiB 69.77 1.57 77
> 22 ssd 0.43700 1.00000 447GiB 269GiB 178GiB 60.10 1.35 67
> 38 ssd 0.43660 1.00000 447GiB 153GiB 295GiB 34.11 0.77 46
> 39 ssd 0.43660 1.00000 447GiB 127GiB 320GiB 28.37 0.64 38
> 40 ssd 0.87329 1.00000 894GiB 386GiB 508GiB 43.17 0.97 97
> 41 ssd 0.87329 1.00000 894GiB 375GiB 520GiB 41.88 0.94 113
>
> This leads to just 1.2TB free space (some GBs away from NEAR_FULL pool).
> Currently, the balancer plugin is off because it immediately crashed
> the MGR in the past (on 12.2.5).
> Since then I upgraded to 12.2.8 but did not re-enable the balancer. [I
> am unable to find the bugtracker ID]
>
> Would the balancer plugin correct this situation?
> What happens if all MGRs die like they did on 12.2.5 because of the plugin?
> Will the balancer take data from the most-unbalanced OSDs first?
> Otherwise the OSD may fill up more then FULL which would cause the
> whole pool to freeze (because the smallest OSD is taken into account
> for free space calculation).
> This would be the worst case as over 100 VMs would freeze, causing lot
> of trouble. This is also the reason I did not try to enable the
> balancer again.
>
> Please read this [1], all about Balancer with upmap mode.
>
> It's stable from 12.2.8 with upmap mode.
>
>
>
> k
>
> [1]
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-December/032002.html
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com