[ceph-users] PG Down+Incomplete but wihtout block

2016-11-19 Thread Bruno Silva
I have a lot of stuck and down+incomplete and incomplete, but on pg query
doesn't show where is the fail

ceph health detail
HEALTH_WARN clock skew detected on mon.3; 3 pgs down; 6 pgs incomplete; 6
pgs stuck inactive; 6 pgs stuck unclean; 17 requests are blocked > 32 sec;
3 osds have slow requests; Monitor clock skew detected
pg 0.3 is stuck inactive since forever, current state down+incomplete, last
acting [5,4,8]
pg 0.38 is stuck inactive for 308757.882019, current state incomplete, last
acting [1,4,8]
pg 0.43 is stuck inactive for 308590.063291, current state incomplete, last
acting [2,1,4]
pg 0.78 is stuck inactive since forever, current state down+incomplete,
last acting [6,4,3]
pg 0.27 is stuck inactive for 308606.854986, current state down+incomplete,
last acting [2,7,5]
pg 0.67 is stuck inactive for 308606.854992, current state incomplete, last
acting [2,1,3]
pg 0.3 is stuck unclean since forever, current state down+incomplete, last
acting [5,4,8]
pg 0.38 is stuck unclean for 308757.882075, current state incomplete, last
acting [1,4,8]
pg 0.43 is stuck unclean for 308590.063345, current state incomplete, last
acting [2,1,4]
pg 0.78 is stuck unclean since forever, current state down+incomplete, last
acting [6,4,3]
pg 0.27 is stuck unclean for 308991.817516, current state down+incomplete,
last acting [2,7,5]
pg 0.67 is stuck unclean for 308991.817523, current state incomplete, last
acting [2,1,3]
pg 0.27 is down+incomplete, acting [2,7,5]
pg 0.3 is down+incomplete, acting [5,4,8]
pg 0.78 is down+incomplete, acting [6,4,3]
pg 0.67 is incomplete, acting [2,1,3]
pg 0.43 is incomplete, acting [2,1,4]
pg 0.38 is incomplete, acting [1,4,8]
3 ops are blocked > 2097.15 sec
14 ops are blocked > 131.072 sec
1 ops are blocked > 2097.15 sec on osd.1
1 ops are blocked > 2097.15 sec on osd.2
14 ops are blocked > 131.072 sec on osd.2
1 ops are blocked > 2097.15 sec on osd.6
3 osds have slow requests
mon.3 addr 172.20.20.13:6789/0 clock skew 0.0559069s > max 0.05s (latency
0.00118267s)

#ceph pg 0.27 query
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8"
],
"down_osds_we_would_probe": [],
"peering_blocked_by": []
},


#ceph pg 0.3 query


 {
"first": 3318,
"last": 3320,
"maybe_went_rw": 1,
"up": [
5,
4
],
"acting": [
5,
4
],
"primary": 5,
"up_primary": 5
}
],
"probing_osds": [
"1",
"2",
"3",
"4",
"5",
"6",
"7",
"8"
],
"down_osds_we_would_probe": [],
"peering_blocked_by": []
},

What i can do to solve this?
By th way, the clock is sinchronized.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG down incomplete

2013-05-17 Thread John Wilkins
If you can follow the documentation here:
http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/  and
http://ceph.com/docs/master/rados/troubleshooting/  to provide some
additional information, we may be better able to help you.

For example, ceph osd tree would help us understand the status of
your cluster a bit better.

On Thu, May 16, 2013 at 10:32 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote:
 Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit :
 Hi,

 I have some PG in state down and/or incomplete on my cluster, because I
 loose 2 OSD and a pool was having only 2 replicas. So of course that
 data is lost.

 My problem now is that I can't retreive a HEALTH_OK status : if I try
 to remove, read or overwrite the corresponding RBD images, near all OSD
 hang (well... they don't do anything and requests stay in a growing
 queue, until the production will be done).

 So, what can I do to remove that corrupts images ?

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 Up. Nobody can help me on that problem ?

 Thanks,

 Olivier

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
John Wilkins
Senior Technical Writer
Intank
john.wilk...@inktank.com
(415) 425-9599
http://inktank.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG down incomplete

2013-05-17 Thread Olivier Bonvalet
Hi,

thanks for your answer. In fact I have several different problems, which
I tried to solve separatly :

1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
lost.
2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
monitors running.
3) I have 4 old inconsistent PG that I can't repair.


So the status :

   health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
   monmap e7: 5 mons at
{a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
 election epoch 2584, quorum 0,1,2,3 a,b,c,e
   osdmap e82502: 50 osds: 48 up, 48 in
pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
+scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
+scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
137KB/s rd, 1852KB/s wr, 199op/s
   mdsmap e1: 0/0/1 up



The tree :

# idweight  type name   up/down reweight
-8  14.26   root SSDroot
-27 8   datacenter SSDrbx2
-26 8   room SSDs25
-25 8   net SSD188-165-12
-24 8   rack SSD25B09
-23 8   host lyll
46  2   osd.46  up  
1   
47  2   osd.47  up  
1   
48  2   osd.48  up  
1   
49  2   osd.49  up  
1   
-10 4.26datacenter SSDrbx3
-12 2   room SSDs43
-13 2   net SSD178-33-122
-16 2   rack SSD43S01
-17 2   host kaino
42  1   osd.42  up  
1   
43  1   osd.43  up  
1   
-22 2.26room SSDs45
-21 2.26net SSD5-135-138
-20 2.26rack SSD45F01
-19 2.26host taman
44  1.13osd.44  up  
1   
45  1.13osd.45  up  
1   
-9  2   datacenter SSDrbx4
-11 2   room SSDs52
-14 2   net SSD176-31-226
-15 2   rack SSD52B09
-18 2   host dragan
40  1   osd.40  up  
1   
41  1   osd.41  up  
1   
-1  33.43   root SASroot
-10015.9datacenter SASrbx1
-90 15.9room SASs15
-72 15.9net SAS188-165-15
-40 8   rack SAS15B01
-3  8   host brontes
0   1   osd.0   up  
1   
1   1   osd.1   up  
1   
2   1   osd.2   up  
1   
3   1   osd.3   up  
1   
4   1   osd.4   up  
1   
5   1   osd.5   up  
1   
6   1   osd.6   up  
1   
7   1   osd.7   up  
1   
-41 7.9 rack SAS15B02
-6  7.9 host alim
24  1   osd.24  up  
1   
25  1   osd.25  down
0   
26  1   osd.26  up  
1   
27  1   osd.27  up  
1   
28  1   osd.28  up  
1   
29  1   osd.29  up  
1   
30  1   osd.30  up  
1   
31  0.9 osd.31  up  
1   
-10117.53   datacenter SASrbx2
-91 17.53 

Re: [ceph-users] PG down incomplete

2013-05-17 Thread John Wilkins
It looks like you have the noout flag set:

noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
   monmap e7: 5 mons at
{a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
election epoch 2584, quorum 0,1,2,3 a,b,c,e
   osdmap e82502: 50 osds: 48 up, 48 in

http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing

If you have down OSDs that don't get marked out, that would certainly
cause problems. Have you tried restarting the failed OSDs?

What do the logs look like for osd.15 and osd.25?

On Fri, May 17, 2013 at 1:31 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote:
 Hi,

 thanks for your answer. In fact I have several different problems, which
 I tried to solve separatly :

 1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
 lost.
 2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
 monitors running.
 3) I have 4 old inconsistent PG that I can't repair.


 So the status :

health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
 inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
 noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
monmap e7: 5 mons at
 {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
  election epoch 2584, quorum 0,1,2,3 a,b,c,e
osdmap e82502: 50 osds: 48 up, 48 in
 pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
 +scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
 +scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
 137KB/s rd, 1852KB/s wr, 199op/s
mdsmap e1: 0/0/1 up



 The tree :

 # idweight  type name   up/down reweight
 -8  14.26   root SSDroot
 -27 8   datacenter SSDrbx2
 -26 8   room SSDs25
 -25 8   net SSD188-165-12
 -24 8   rack SSD25B09
 -23 8   host lyll
 46  2   osd.46  up
   1
 47  2   osd.47  up
   1
 48  2   osd.48  up
   1
 49  2   osd.49  up
   1
 -10 4.26datacenter SSDrbx3
 -12 2   room SSDs43
 -13 2   net SSD178-33-122
 -16 2   rack SSD43S01
 -17 2   host kaino
 42  1   osd.42  up
   1
 43  1   osd.43  up
   1
 -22 2.26room SSDs45
 -21 2.26net SSD5-135-138
 -20 2.26rack SSD45F01
 -19 2.26host taman
 44  1.13osd.44  up
   1
 45  1.13osd.45  up
   1
 -9  2   datacenter SSDrbx4
 -11 2   room SSDs52
 -14 2   net SSD176-31-226
 -15 2   rack SSD52B09
 -18 2   host dragan
 40  1   osd.40  up
   1
 41  1   osd.41  up
   1
 -1  33.43   root SASroot
 -10015.9datacenter SASrbx1
 -90 15.9room SASs15
 -72 15.9net SAS188-165-15
 -40 8   rack SAS15B01
 -3  8   host brontes
 0   1   osd.0   up
   1
 1   1   osd.1   up
   1
 2   1   osd.2   up
   1
 3   1   osd.3   up
   1
 4   1   osd.4   up
   1
 5   1   osd.5   up
   1
 6   1   osd.6   up
   1
 7   1   osd.7   up
   1
 -41 7.9 rack SAS15B02
 -6  7.9 host alim
 24  1   osd.24  up
   1
 25  1   osd.25  down  
   0
 26 

Re: [ceph-users] PG down incomplete

2013-05-17 Thread John Wilkins
Another thing... since your osd.10 is near full, your cluster may be
fairly close to capacity for the purposes of rebalancing.  Have a look
at:

http://ceph.com/docs/master/rados/configuration/mon-config-ref/#storage-capacity
http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#no-free-drive-space

Maybe we can get some others to look at this.  It's not clear to me
why the other OSD crashes after you take osd.25 out. It could be
capacity, but that shouldn't make it crash. Have you tried adding more
OSDs to increase capacity?



On Fri, May 17, 2013 at 11:27 AM, John Wilkins john.wilk...@inktank.com wrote:
 It looks like you have the noout flag set:

 noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
monmap e7: 5 mons at
 {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
 election epoch 2584, quorum 0,1,2,3 a,b,c,e
osdmap e82502: 50 osds: 48 up, 48 in

 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing

 If you have down OSDs that don't get marked out, that would certainly
 cause problems. Have you tried restarting the failed OSDs?

 What do the logs look like for osd.15 and osd.25?

 On Fri, May 17, 2013 at 1:31 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote:
 Hi,

 thanks for your answer. In fact I have several different problems, which
 I tried to solve separatly :

 1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
 lost.
 2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
 monitors running.
 3) I have 4 old inconsistent PG that I can't repair.


 So the status :

health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
 inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
 noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
monmap e7: 5 mons at
 {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
  election epoch 2584, quorum 0,1,2,3 a,b,c,e
osdmap e82502: 50 osds: 48 up, 48 in
 pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
 +scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
 +scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
 137KB/s rd, 1852KB/s wr, 199op/s
mdsmap e1: 0/0/1 up



 The tree :

 # idweight  type name   up/down reweight
 -8  14.26   root SSDroot
 -27 8   datacenter SSDrbx2
 -26 8   room SSDs25
 -25 8   net SSD188-165-12
 -24 8   rack SSD25B09
 -23 8   host lyll
 46  2   osd.46  up   
1
 47  2   osd.47  up   
1
 48  2   osd.48  up   
1
 49  2   osd.49  up   
1
 -10 4.26datacenter SSDrbx3
 -12 2   room SSDs43
 -13 2   net SSD178-33-122
 -16 2   rack SSD43S01
 -17 2   host kaino
 42  1   osd.42  up   
1
 43  1   osd.43  up   
1
 -22 2.26room SSDs45
 -21 2.26net SSD5-135-138
 -20 2.26rack SSD45F01
 -19 2.26host taman
 44  1.13osd.44  up   
1
 45  1.13osd.45  up   
1
 -9  2   datacenter SSDrbx4
 -11 2   room SSDs52
 -14 2   net SSD176-31-226
 -15 2   rack SSD52B09
 -18 2   host dragan
 40  1   osd.40  up   
1
 41  1   osd.41  up   
1
 -1  33.43   root SASroot
 -10015.9datacenter SASrbx1
 -90 15.9room SASs15
 -72 15.9net SAS188-165-15
 -40 8   rack SAS15B01
 -3  8   host brontes
 0   1   osd.0   up   
1
 1   1   osd.1   up   
1
 2   1   osd.2   up   
1
 3   1   osd.3   up   
1
 

Re: [ceph-users] PG down incomplete

2013-05-17 Thread Olivier Bonvalet
Yes, I set the noout flag to avoid the auto balancing of the osd.25,
which will crash all OSD of this host (already tried several times).

Le vendredi 17 mai 2013 à 11:27 -0700, John Wilkins a écrit :
 It looks like you have the noout flag set:
 
 noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
monmap e7: 5 mons at
 {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
 election epoch 2584, quorum 0,1,2,3 a,b,c,e
osdmap e82502: 50 osds: 48 up, 48 in
 
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing
 
 If you have down OSDs that don't get marked out, that would certainly
 cause problems. Have you tried restarting the failed OSDs?
 
 What do the logs look like for osd.15 and osd.25?
 
 On Fri, May 17, 2013 at 1:31 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote:
  Hi,
 
  thanks for your answer. In fact I have several different problems, which
  I tried to solve separatly :
 
  1) I loose 2 OSD, and some pools have only 2 replicas. So some data was
  lost.
  2) One monitor refuse the Cuttlefish upgrade, so I only have 4 of 5
  monitors running.
  3) I have 4 old inconsistent PG that I can't repair.
 
 
  So the status :
 
 health HEALTH_ERR 15 pgs incomplete; 4 pgs inconsistent; 15 pgs stuck
  inactive; 15 pgs stuck unclean; 1 near full osd(s); 19 scrub errors;
  noout flag(s) set; 1 mons down, quorum 0,1,2,3 a,b,c,e
 monmap e7: 5 mons at
  {a=10.0.0.1:6789/0,b=10.0.0.2:6789/0,c=10.0.0.5:6789/0,d=10.0.0.6:6789/0,e=10.0.0.3:6789/0},
   election epoch 2584, quorum 0,1,2,3 a,b,c,e
 osdmap e82502: 50 osds: 48 up, 48 in
  pgmap v12807617: 7824 pgs: 7803 active+clean, 1 active+clean
  +scrubbing, 15 incomplete, 4 active+clean+inconsistent, 1 active+clean
  +scrubbing+deep; 5676 GB data, 18948 GB used, 18315 GB / 37263 GB avail;
  137KB/s rd, 1852KB/s wr, 199op/s
 mdsmap e1: 0/0/1 up
 
 
 
  The tree :
 
  # idweight  type name   up/down reweight
  -8  14.26   root SSDroot
  -27 8   datacenter SSDrbx2
  -26 8   room SSDs25
  -25 8   net SSD188-165-12
  -24 8   rack SSD25B09
  -23 8   host lyll
  46  2   osd.46  up  
  1
  47  2   osd.47  up  
  1
  48  2   osd.48  up  
  1
  49  2   osd.49  up  
  1
  -10 4.26datacenter SSDrbx3
  -12 2   room SSDs43
  -13 2   net SSD178-33-122
  -16 2   rack SSD43S01
  -17 2   host kaino
  42  1   osd.42  up  
  1
  43  1   osd.43  up  
  1
  -22 2.26room SSDs45
  -21 2.26net SSD5-135-138
  -20 2.26rack SSD45F01
  -19 2.26host taman
  44  1.13osd.44  up  
  1
  45  1.13osd.45  up  
  1
  -9  2   datacenter SSDrbx4
  -11 2   room SSDs52
  -14 2   net SSD176-31-226
  -15 2   rack SSD52B09
  -18 2   host dragan
  40  1   osd.40  up  
  1
  41  1   osd.41  up  
  1
  -1  33.43   root SASroot
  -10015.9datacenter SASrbx1
  -90 15.9room SASs15
  -72 15.9net SAS188-165-15
  -40 8   rack SAS15B01
  -3  8   host brontes
  0   1   osd.0   up  
  1
  1   1   osd.1   up  
  1
  2   1   osd.2   up  
  1
  3   1   osd.3   up  
  1
  4   1   osd.4   up  
  1
  5   1   osd.5   up  
  1
  6   1   osd.6   up  
  1
  7   1   osd.7   up  
 

Re: [ceph-users] PG down incomplete

2013-05-16 Thread Olivier Bonvalet
Le mercredi 15 mai 2013 à 00:15 +0200, Olivier Bonvalet a écrit :
 Hi,
 
 I have some PG in state down and/or incomplete on my cluster, because I
 loose 2 OSD and a pool was having only 2 replicas. So of course that
 data is lost.
 
 My problem now is that I can't retreive a HEALTH_OK status : if I try
 to remove, read or overwrite the corresponding RBD images, near all OSD
 hang (well... they don't do anything and requests stay in a growing
 queue, until the production will be done).
 
 So, what can I do to remove that corrupts images ?
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

Up. Nobody can help me on that problem ?

Thanks,

Olivier

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG down incomplete

2013-05-14 Thread Olivier Bonvalet
Hi,

I have some PG in state down and/or incomplete on my cluster, because I
loose 2 OSD and a pool was having only 2 replicas. So of course that
data is lost.

My problem now is that I can't retreive a HEALTH_OK status : if I try
to remove, read or overwrite the corresponding RBD images, near all OSD
hang (well... they don't do anything and requests stay in a growing
queue, until the production will be done).

So, what can I do to remove that corrupts images ?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com