We are using replica 2 and min size is 2. A small amount of data is
sitting around from when we were running the default 3.
Looks like the problem started around here:
2017-06-22 14:54:29.173982 7f3c39f6f700 0 log_channel(cluster) log
[INF] : 1.2c9 deep-scrub ok
2017-06-22 14:54:29.690401 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.8 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.690398)
2017-06-22 14:54:29.690423 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.10 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.690398)
2017-06-22 14:54:29.690429 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.11 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.690398)
2017-06-22 14:54:29.907210 7f3c3776a700 -1 osd.13 25313 heartbeat_check:
no reply from osd.8 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.907207)
2017-06-22 14:54:29.907221 7f3c3776a700 -1 osd.13 25313 heartbeat_check:
no reply from osd.10 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.907207)
2017-06-22 14:54:29.907227 7f3c3776a700 -1 osd.13 25313 heartbeat_check:
no reply from osd.11 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:09.907207)
2017-06-22 14:54:30.690551 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.8 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:10.690548)
2017-06-22 14:54:30.690573 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.10 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:10.690548)
2017-06-22 14:54:30.690579 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.11 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:10.690548)
2017-06-22 14:54:31.690708 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.8 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:11.690706)
2017-06-22 14:54:31.690729 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.10 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:11.690706)
2017-06-22 14:54:31.690735 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.11 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:11.690706)
2017-06-22 14:54:32.690862 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.8 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:12.690860)
2017-06-22 14:54:32.690884 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.10 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:12.690860)
2017-06-22 14:54:32.690890 7f3c6e03d700 -1 osd.13 25313 heartbeat_check:
no reply from osd.11 since back 2017-06-22 14:53:13.582897 front
2017-06-22 14:53:13.582897 (cutoff 2017-06-22 14:54:12.690860)
2017-06-22 14:54:32.955768 7f3c5675c700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.3:6804/54002870 pipe(0x7f3ca7475400 sd=116 :6805 s=2
pgs=15531 cs=1 l=0 c=0x7f3c935ee700).fault with nothing to send, going
to standby
2017-06-22 14:54:32.958675 7f3c2ea0e700 0 -- 172.16.31.7:0/2128624 >>
172.16.31.3:6808/54002870 pipe(0x7f3c9c150000 sd=189 :0 s=1 pgs=0 cs=0
l=1 c=0x7f3c97726880).fault
2017-06-22 14:54:32.958712 7f3c2c3e8700 0 -- 172.16.31.7:0/2128624 >>
172.16.31.3:6810/54002870 pipe(0x7f3ca1727400 sd=233 :0 s=1 pgs=0 cs=0
l=1 c=0x7f3c9cb16300).fault
2017-06-22 14:54:34.176427 7f3c33a5e700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.3:6800/55002870 pipe(0x7f3c99679400 sd=216 :6805 s=0 pgs=0
cs=0 l=0 c=0x7f3c9532d200).accept connect_seq 0 vs existing 0 state
connecting
2017-06-22 14:54:34.545873 7f3c3ef79700 0 log_channel(cluster) log
[INF] : 2.1b5 continuing backfill to osd.30 from
(25014'10407450,25294'10411861] MIN to 25294'10411861
2017-06-22 14:54:34.546531 7f3c3e778700 0 log_channel(cluster) log
[INF] : 2.145 continuing backfill to osd.30 from
(25014'10399385,25294'10404028] MIN to 25294'10404028
2017-06-22 14:54:34.546551 7f3c43782700 0 log_channel(cluster) log
[INF] : 1.2b3 continuing backfill to osd.30 from
(24856'173854,25294'177823] MIN to 25294'177823
2017-06-22 14:54:57.873097 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c95e0b400 sd=188 :6805 s=0 pgs=0
cs=0 l=0 c=0x7f3c9fc71f80).accept we reset (peer sent cseq 1), sending
RESETSESSION
2017-06-22 14:54:57.874965 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c95e0b400 sd=188 :6805 s=2
pgs=15769 cs=1 l=0 c=0x7f3c9fc71f80).reader missed message? skipped
from seq 0 to 1739054688
2017-06-22 14:54:57.875902 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c9af11400 sd=188 :6805 s=0 pgs=0
cs=0 l=0 c=0x7f3c9fc72e80).accept we reset (peer sent cseq 2), sending
RESETSESSION
2017-06-22 14:54:57.878969 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c9af11400 sd=188 :6805 s=2
pgs=15771 cs=1 l=0 c=0x7f3c9fc72e80).reader missed message? skipped
from seq 0 to 2095419103
2017-06-22 14:54:57.880075 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c9af10000 sd=188 :6805 s=0 pgs=0
cs=0 l=0 c=0x7f3c87f3b480).accept we reset (peer sent cseq 2), sending
RESETSESSION
2017-06-22 14:54:57.880781 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c9af10000 sd=188 :6805 s=2
pgs=15772 cs=1 l=0 c=0x7f3c87f3b480).reader missed message? skipped
from seq 0 to 1022945821
2017-06-22 14:54:57.881842 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c99679400 sd=188 :6805 s=0 pgs=0
cs=0 l=0 c=0x7f3c91431a80).accept we reset (peer sent cseq 2), sending
RESETSESSION
2017-06-22 14:54:57.902933 7f3c27763700 0 -- 172.16.31.7:6805/7128624
>> 172.16.31.4:6803/57002857 pipe(0x7f3c99679400 sd=188 :6805 s=2
pgs=15773 cs=1 l=0 c=0x7f3c91431a80).fault with nothing to send, going
to standby
2017-06-22 14:56:52.538631 7f3c6e03d700 0 log_channel(cluster) log
[WRN] : 2 slow requests, 2 included below; oldest blocked for >
31.862172 secs
2017-06-22 14:56:52.538641 7f3c6e03d700 0 log_channel(cluster) log
[WRN] : slow request 31.665672 seconds old, received at 2017-06-22
14:56:20.872915: osd_op(client.365488.1:6212453 2.debad545
10009cc83a6.00000666 [read 0~4194304 [1@-1]] snapc 0=[]
ack+read+known_if_redirected e25348) currently waiting for active
2017-06-22 14:56:52.538646 7f3c6e03d700 0 log_channel(cluster) log
[WRN] : slow request 31.862172 seconds old, received at 2017-06-22
14:56:20.676415: osd_op(client.365488.1:6212450 2.f781b45
10009cc83a6.00000664 [read 0~4194304 [1@-1]] snapc 0=[]
ack+read+known_if_redirected e25348) currently waiting for active
2017-06-22 14:57:18.140672 7f3c6e03d700 0 log_channel(cluster) log
[WRN] : 3 slow requests, 1 included below; oldest blocked for >
57.464203 secs
2017-06-22 14:57:18.140683 7f3c6e03d700 0 log_channel(cluster) log
[WRN] : slow request 30.554865 seconds old, received at 2017-06-22
14:56:47.585754: osd_op(client.364255.1:1681646 2.b387afb5
1000a234aea.00000136 [write 0~4194304 [1@-1]] snapc 1=[]
ondisk+write+known_if_redirected e25351) currently waiting for active
On 06/23/2017 03:28 PM, David Turner wrote:
Something about it is blocking the cluster. I would first try running
this command. If that doesn't work, then I would restart the daemon.
# ceph osd down 13
Marking it down should force it to reassert itself to the cluster
without restarting the daemon and stopping any operations it's working
on. Also while it's down, the secondary OSDs for the PG should be
able to handle the requests that are blocked. Check it's log to see
what it's doing.
You didn't answer what your size and min_size are for your 2 pools.
On Fri, Jun 23, 2017 at 3:11 PM Daniel Davidson
<dani...@igb.illinois.edu <mailto:dani...@igb.illinois.edu>> wrote:
Thanks for the response:
[root@ceph-control ~]# ceph health detail | grep 'ops are blocked'
100 ops are blocked > 134218 sec on osd.13
[root@ceph-control ~]# ceph osd blocked-by
osd num_blocked
A problem with osd.13?
Dan
On 06/23/2017 02:03 PM, David Turner wrote:
# ceph health detail | grep 'ops are blocked'
# ceph osd blocked-by
My guess is that you have an OSD that is in a funky state
blocking the requests and the peering. Let me know what the
output of those commands are.
Also what are the replica sizes of your 2 pools? It shows that
only 1 OSD was last active for the 2 inactive PGs. Not sure yet
if that is anything of concern, but didn't want to ignore it.
On Fri, Jun 23, 2017 at 1:16 PM Daniel Davidson
<dani...@igb.illinois.edu <mailto:dani...@igb.illinois.edu>> wrote:
Two of our OSD systems hit 75% disk utilization, so I added
another
system to try and bring that back down. The system was
usable for a day
while the data was being migrated, but now the system is not
responding
when I try to mount it:
mount -t ceph ceph-0,ceph-1,ceph-2,ceph-3:6789:/ /home -o
name=admin,secretfile=/etc/ceph/admin.secret
mount error 5 = Input/output error
Here is our ceph health
[root@ceph-3 ~]# ceph -s
cluster 7bffce86-9d7b-4bdf-a9c9-67670e68ca77
health HEALTH_ERR
2 pgs are stuck inactive for more than 300 seconds
58 pgs backfill_wait
20 pgs backfilling
3 pgs degraded
2 pgs stuck inactive
76 pgs stuck unclean
2 pgs undersized
100 requests are blocked > 32 sec
recovery 1197145/653713908 objects degraded (0.183%)
recovery 47420551/653713908 objects misplaced
(7.254%)
mds0: Behind on trimming (180/30)
mds0: Client biologin-0 failing to respond to
capability
release
mds0: Many clients (20) failing to respond to
cache pressure
monmap e3: 4 mons at
{ceph-0=*MailScanner has detected a possible fraud attempt
from "172.16.31.1:6789" claiming to be* *MailScanner warning:
numerical links are often malicious:*
172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0
<http://172.16.31.1:6789/0,ceph-1=172.16.31.2:6789/0,ceph-2=172.16.31.3:6789/0,ceph-3=172.16.31.4:6789/0>}
election epoch 542, quorum 0,1,2,3
ceph-0,ceph-1,ceph-2,ceph-3
fsmap e17666: 1/1/1 up {0=ceph-0=up:active}, 3 up:standby
osdmap e25535: 32 osds: 32 up, 32 in; 78 remapped pgs
flags sortbitwise,require_jewel_osds
pgmap v19199544: 1536 pgs, 2 pools, 786 TB data, 299
Mobjects
1595 TB used, 1024 TB / 2619 TB avail
1197145/653713908 objects degraded (0.183%)
47420551/653713908 objects misplaced (7.254%)
1448 active+clean
58 active+remapped+wait_backfill
17 active+remapped+backfilling
10 active+clean+scrubbing+deep
2
undersized+degraded+remapped+backfilling+peered
1 active+degraded+remapped+backfilling
recovery io 906 MB/s, 331 objects/s
Checking in on the inactive PGs
[root@ceph-control ~]# ceph health detail |grep inactive
HEALTH_ERR 2 pgs are stuck inactive for more than 300
seconds; 58 pgs
backfill_wait; 20 pgs backfilling; 3 pgs degraded; 2 pgs
stuck inactive;
78 pgs stuck unclean; 2 pgs undersized; 100 requests are
blocked > 32
sec; 1 osds have slow requests; recovery 1197145/653713908
objects
degraded (0.183%); recovery 47390082/653713908 objects misplaced
(7.249%); mds0: Behind on trimming (180/30); mds0: Client
biologin-0
failing to respond to capability release; mds0: Many clients (20)
failing to respond to cache pressure
pg 2.1b5 is stuck inactive for 77215.112164, current state
undersized+degraded+remapped+backfilling+peered, last acting [13]
pg 2.145 is stuck inactive for 76910.328647, current state
undersized+degraded+remapped+backfilling+peered, last acting [13]
If I query, then I dont get a response:
[root@ceph-control ~]# ceph pg 2.1b5 query
Any ideas on what to do?
Dan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com