[ceph-users] Two issues remaining after luminous upgrade

Matthew Stroud Thu, 25 Jan 2018 14:16:40 -0800

The first and hopefully easy one:

I have a situation where I have two pools that are rarely used (the third will 
be in use after I can get through these issues), but they need to present at 
the whims of our cloud team. Is there a way I can turn off ‘2 pools have many 
more objects per pg than average’?


What I have done to this point is played with ‘mon_pg_warn_max_object_skew’, 
but that didn’t remove the message. After googling and looking through the 
docs, nothing stood out to me to resolve the issue.

Technical info:

[[email protected] ~] # ceph health detail
HEALTH_WARN 2 pools have many more objects per pg than average
MANY_OBJECTS_PER_PG 2 pools have many more objects per pg than average
    pool images objects per pg (480) is more than 60 times cluster average (8)
    pool metrics objects per pg (336) is more than 42 times cluster average (8)

The second:

I’m seeing osds randomly getting marked as down, but then state they are still 
running. This issue wasn’t present before the upgrade. This is a multipath 
setup but the paths appear healthy and the cluster isn’t really being utilized 
at the moment. Please let me know if you want more information:

Ceph.log:

2018-01-25 14:56:29.011831 mon.mon01 mon.0 10.20.57.10:6789/0 823 : cluster 
[INF] osd.12 marked down after no beacon for 300.775605 seconds
2018-01-25 14:56:29.013280 mon.mon01 mon.0 10.20.57.10:6789/0 824 : cluster 
[WRN] Health check failed: 1 osds down (OSD_DOWN)
2018-01-25 14:56:32.034002 mon.mon01 mon.0 10.20.57.10:6789/0 830 : cluster 
[INF] Health check cleared: OSD_DOWN (was: 1 osds down)
2018-01-25 14:56:31.322228 osd.12 osd.12 10.20.57.14:6804/4163 1 : cluster 
[WRN] Monitor daemon marked osd.12 down, but it is still running

Ceph-osd.12.log:

2018-01-25 14:56:00.606493 7facfde03700  4 rocksdb: (Original Log Time 
2018/01/25-14:56:00.602100) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:360]
 [default] Level-0 commit table #213 started
2018-01-25 14:56:00.606498 7facfde03700  4 rocksdb: (Original Log Time 
2018/01/25-14:56:00.606406) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/memtable_list.cc:383]
 [default] Level-0 commit table #213: memtable #1 done
2018-01-25 14:56:00.606517 7facfde03700  4 rocksdb: (Original Log Time 
2018/01/25-14:56:00.606437) EVENT_LOG_v1 {"time_micros": 1516917360606429, 
"job": 29, "event": "flush_finished", "lsm_state": [2, 1, 1, 0, 0, 0, 0], 
"immutable_memtables": 0}
2018-01-25 14:56:00.606529 7facfde03700  4 rocksdb: (Original Log Time 
2018/01/25-14:56:00.606466) 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_compaction_flush.cc:132]
 [default] Level summary: base level 1 max bytes base 268435456 files[2 1 1 0 0 
0 0] max score 0.50

2018-01-25 14:56:00.606538 7facfde03700  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.2/rpm/el7/BUILD/ceph-12.2.2/src/rocksdb/db/db_impl_files.cc:388]
 [JOB 29] Try to delete WAL files size 252104127, prev total WAL file size 
253684537, number of live WAL files 2.

2018-01-25 14:56:31.322223 7fad1262c700  0 log_channel(cluster) log [WRN] : 
Monitor daemon marked osd.12 down, but it is still running
2018-01-25 14:56:31.322233 7fad1262c700  0 log_channel(cluster) log [DBG] : map 
e18531 wrongly marked me down at e18530
2018-01-25 14:56:31.322236 7fad1262c700  1 osd.12 18531 
start_waiting_for_healthy
2018-01-25 14:56:31.327816 7fad0c620700  1 osd.12 pg_epoch: 18530 pg[14.8f( v 
18432'17 (0'0,18432'17] local-lis/les=18521/18522 n=1 ec=18405/18405 lis/c 
18521/18521 les/c/f 18522/18522/0 18530/18530/18530) [3,19] r=-1 lpr=18530 
pi=[18521,18530)/1 luod=0'0 crt=18432'17 lcod 0'0 active] 
start_peering_interval up [12,3,19] -> [3,19], acting [12,3,19] -> [3,19], 
acting_primary 12 -> 3, up_primary 12 -> 3, role 0 -> -1, features acting 
2305244844532236283 upacting 2305244844532236283
2018-01-25 14:56:31.327851 7fad0be1f700  1 osd.12 pg_epoch: 18530 pg[14.9e( 
empty local-lis/les=18522/18523 n=0 ec=18405/18405 lis/c 18522/18522 les/c/f 
18523/18523/0 18530/18530/18530) [15,10] r=-1 lpr=18530 pi=[18522,18530)/1 
crt=0'0 active] start_peering_interval up [12,15,10] -> [15,10], acting 
[12,15,10] -> [15,10], acting_primary 12 -> 15, up_primary 12 -> 15, role 0 -> 
-1, features acting 2305244844532236283 upacting 2305244844532236283
2018-01-25 14:56:31.327918 7fad0c620700  1 osd.12 pg_epoch: 18531 pg[14.8f( v 
18432'17 (0'0,18432'17] local-lis/les=18521/18522 n=1 ec=18405/18405 lis/c 
18521/18521 les/c/f 18522/18522/0 18530/18530/18530) [3,19] r=-1 lpr=18530 
pi=[18521,18530)/1 crt=18432'17 lcod 0'0 unknown NOTIFY] state<Start>: 
transitioning to Stray

Ceph osd tree:

[[email protected] ceph-conf] # ceph osd tree
ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
-1       31.99658 root default
-2        7.99915     host osd01
0   ssd  0.99989         osd.0      up  1.00000 1.00000
1   ssd  0.99989         osd.1      up  1.00000 1.00000
5   ssd  0.99989         osd.5      up  1.00000 1.00000
6   ssd  0.99989         osd.6      up  1.00000 1.00000
7   ssd  0.99989         osd.7      up  1.00000 1.00000
11   ssd  0.99989         osd.11     up  1.00000 1.00000
20   ssd  0.99989         osd.20     up  1.00000 1.00000
22   ssd  0.99989         osd.22     up  1.00000 1.00000
-3        7.99915     host osd02
12   ssd  0.99989         osd.12     up  1.00000 1.00000
18   ssd  0.99989         osd.18     up  1.00000 1.00000
23   ssd  0.99989         osd.23     up  1.00000 1.00000
26   ssd  0.99989         osd.26     up  1.00000 1.00000
27   ssd  0.99989         osd.27     up  1.00000 1.00000
28   ssd  0.99989         osd.28     up  1.00000 1.00000
29   ssd  0.99989         osd.29     up  1.00000 1.00000
30   ssd  0.99989         osd.30     up  1.00000 1.00000
-4        7.99915     host osd03
13   ssd  0.99989         osd.13     up  1.00000 1.00000
15   ssd  0.99989         osd.15     up  1.00000 1.00000
16   ssd  0.99989         osd.16     up  1.00000 1.00000
17   ssd  0.99989         osd.17     up  1.00000 1.00000
19   ssd  0.99989         osd.19     up  1.00000 1.00000
21   ssd  0.99989         osd.21     up  1.00000 1.00000
24   ssd  0.99989         osd.24     up  1.00000 1.00000
25   ssd  0.99989         osd.25     up  1.00000 1.00000
-5        7.99915     host osd04
2   ssd  0.99989         osd.2      up  1.00000 1.00000
3   ssd  0.99989         osd.3      up  1.00000 1.00000
4   ssd  0.99989         osd.4      up  1.00000 1.00000
8   ssd  0.99989         osd.8      up  1.00000 1.00000
9   ssd  0.99989         osd.9      up  1.00000 1.00000
10   ssd  0.99989         osd.10     up  1.00000 1.00000
14   ssd  0.99989         osd.14     up  1.00000 1.00000
31   ssd  0.99989         osd.31     up  1.00000 1.00000

Mon settings for down:

[[email protected] ceph-conf] # ceph --admin-daemon 
/var/run/ceph/ceph-mon.mon01.asok config show | grep -i down
    "mds_mon_shutdown_timeout": "5.000000",
    "mds_shutdown_check": "0",
    "mon_osd_adjust_down_out_interval": "true",
    "mon_osd_down_out_interval": "30",
    "mon_osd_down_out_subtree_limit": "rack",
    "mon_osd_min_down_reporters": "2",
    "mon_pg_check_down_all_threshold": "0.500000",
    "mon_warn_on_osd_down_out_interval_zero": "true",
    "osd_backoff_on_down": "true",
    "osd_debug_shutdown": "false",
    "osd_journal_flush_on_shutdown": "true",
    "osd_max_markdown_count": "5",
    "osd_max_markdown_period": "600",
    "osd_mon_shutdown_timeout": "5.000000",
    "osd_shutdown_pgref_assert": "false",



________________________________

CONFIDENTIALITY NOTICE: This message is intended only for the use and review of 
the individual or entity to which it is addressed and may contain information 
that is privileged and confidential. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message solely to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify 
sender immediately by telephone or return email. Thank you.

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Two issues remaining after luminous upgrade

Reply via email to