Re: [ceph-users] recovery process stops

2014-10-25 Thread Harald Rößler
Anyone an idea to solver the situation?
Thanks for any advise.

Kind Regards
Harald Rößler


 Am 23.10.2014 um 18:56 schrieb Harald Rößler harald.roess...@btd.de:

 @Wido: sorry I don’t understand what you mean 100%, generated some output 
 which may helps.


 Ok the pool:

 pool 3 'bcf' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins 
 pg_num 832 pgp_num 832 last_change 8000 owner 0


 all remapping pg have an temp entry:

 pg_temp 3.1 [14,20,0]
 pg_temp 3.c [1,7,23]
 pg_temp 3.22 [15,21,23]



 3.22429 0   2   0   1654296576  0   0   
 active+remapped 2014-10-23 03:25:03.180505  8608'363836897  
 8608'377970131  [15,21] [15,21,23]  3578'354650024  2014-10-16 
 04:06:39.133104  3578'354650024  2014-10-16 04:06:39.133104

 the crush rules.

 # rules
 rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
 }
 rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
 }
 rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
 }


 ceph pg 3.22 query




 { state: active+remapped,
  epoch: 8608,
  up: [
15,
21],
  acting: [
15,
21,
23],
  info: { pgid: 3.22,
  last_update: 8608'363845313,
  last_complete: 8608'363845313,
  log_tail: 8608'363842312,
  last_backfill: MAX,
  purged_snaps: [1~1,3~3,8~6,f~31,42~1,44~3,48~f,58~1,5a~2],
  history: { epoch_created: 140,
  last_epoch_started: 8576,
  last_epoch_clean: 8576,
  last_epoch_split: 0,
  same_up_since: 8340,
  same_interval_since: 8575,
  same_primary_since: 7446,
  last_scrub: 3578'354650024,
  last_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_deep_scrub: 3578'354650024,
  last_deep_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_clean_scrub_stamp: 2014-10-16 04:06:39.133104},
  stats: { version: 8608'363845313,
  reported: 8608'377978685,
  state: active+remapped,
  last_fresh: 2014-10-23 18:55:07.582844,
  last_change: 2014-10-23 03:25:03.180505,
  last_active: 2014-10-23 18:55:07.582844,
  last_clean: 2014-10-20 07:51:21.330669,
  last_became_active: 2013-07-14 07:20:30.173508,
  last_unstale: 2014-10-23 18:55:07.582844,
  mapping_epoch: 8370,
  log_start: 8608'363842312,
  ondisk_log_start: 8608'363842312,
  created: 140,
  last_epoch_clean: 8576,
  parent: 0.0,
  parent_split_bits: 0,
  last_scrub: 3578'354650024,
  last_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_deep_scrub: 3578'354650024,
  last_deep_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_clean_scrub_stamp: 2014-10-16 04:06:39.133104,
  log_size: 0,
  ondisk_log_size: 0,
  stats_invalid: 0,
  stat_sum: { num_bytes: 1654296576,
  num_objects: 429,
  num_object_clones: 28,
  num_object_copies: 0,
  num_objects_missing_on_primary: 0,
  num_objects_degraded: 0,
  num_objects_unfound: 0,
  num_read: 8053865,
  num_read_kb: 124022900,
  num_write: 363844886,
  num_write_kb: 2083536824,
  num_scrub_errors: 0,
  num_shallow_scrub_errors: 0,
  num_deep_scrub_errors: 0,
  num_objects_recovered: 2777,
  num_bytes_recovered: 11138282496,
  num_keys_recovered: 0},
  stat_cat_sum: {},
  up: [
15,
21],
  acting: [
15,
21,
23]},
  empty: 0,
  dne: 0,
  incomplete: 0,
  last_epoch_started: 8576},
  recovery_state: [
{ name: Started\/Primary\/Active,
  enter_time: 2014-10-23 03:25:03.179759,
  might_have_unfound: [],
  recovery_progress: { backfill_target: -1,
  waiting_on_backfill: 0,
  backfill_pos: 0\/\/0\/\/-1,
  backfill_info: { begin: 0\/\/0\/\/-1,
  end: 0\/\/0\/\/-1,
  objects: []},
  peer_backfill_info: { begin: 0\/\/0\/\/-1,
  end: 0\/\/0\/\/-1,
  objects: []},
  backfills_in_flight: [],
  pull_from_peer: [],
  pushing: []},
  scrub: { scrubber.epoch_start: 0,
  scrubber.active: 0,
  scrubber.block_writes: 0,
  scrubber.finalizing: 0,
  scrubber.waiting_on: 0,
  

Re: [ceph-users] recovery process stops

2014-10-23 Thread Harald Rößler
Hi all

the procedure does not work for me, have still 47 active+remapped pg. Anyone 
have an idea how to fix this issue.
@Wido: now my cluster have a usage less than 80% - thanks for your advice.

Harry


Am 21.10.2014 um 22:38 schrieb Craig Lewis 
cle...@centraldesktop.commailto:cle...@centraldesktop.com:

In that case, take a look at ceph pg dump | grep remapped.  In the up or active 
column, there should be one or two common OSDs between the stuck PGs.

Try restarting those OSD daemons.  I've had a few OSDs get stuck scheduling 
recovery, particularly around toofull situations.

I've also had Robert's experience of stuck operations becoming unstuck over 
night.


On Tue, Oct 21, 2014 at 12:02 PM, Harald Rößler 
harald.roess...@btd.demailto:harald.roess...@btd.de wrote:
After more than 10 hours the same situation, I don’t think it will fix self 
over time. How I can find out what is the problem.


Am 21.10.2014 um 17:28 schrieb Craig Lewis 
cle...@centraldesktop.commailto:cle...@centraldesktop.com:

That will fix itself over time.  remapped just means that Ceph is moving the 
data around.  It's normal to see PGs in the remapped and/or backfilling state 
after OSD restarts.

They should go down steadily over time.  How long depends on how much data is 
in the PGs, how fast your hardware is, how many OSDs are affected, and how much 
you allow recovery to impact cluster performance.  Mine currently take about 20 
minutes per PG.  If all 47 are on the same OSD, it'll be a while.  If they're 
evenly split between multiple OSDs, parallelism will speed that up.

On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler 
harald.roess...@btd.demailto:harald.roess...@btd.de wrote:
Hi all,

thank you for your support, now the file system is not degraded any more. Now I 
have a minus degrading :-)

2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281 
active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB / 6178 GB 
avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded (-0.123%)

but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery 
-1638/1329293 degraded (-0.123%)

I think this warning is reported because there are 47 active+remapped objects, 
some ideas how to fix that now?

Kind Regards
Harald Roessler


Am 21.10.2014 um 01:03 schrieb Craig Lewis 
cle...@centraldesktop.commailto:cle...@centraldesktop.com:

I've been in a state where reweight-by-utilization was deadlocked (not the 
daemons, but the remap scheduling).  After successive osd reweight commands, 
two OSDs wanted to swap PGs, but they were both toofull.  I ended up 
temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the 
impediment, and everything finished remapping.  Everything went smoothly, and I 
changed it back when all the remapping finished.

Just be careful if you need to get close to mon_osd_full_ratio.  Ceph does 
greater-than on these percentages, not greater-than-equal.  You really don't 
want the disks to get greater-than mon_osd_full_ratio, because all external IO 
will stop until you resolve that.


On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master 
keks...@gmail.commailto:keks...@gmail.com wrote:
You can set lower weight on full osds, or try changing the osd_near_full_ratio 
parameter in your cluster from 85 to for example 89. But i don't know what can 
go wrong when you do that.


2014-10-20 17:12 GMT+02:00 Wido den Hollander 
w...@42on.commailto:w...@42on.com:
On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea?


If the disks are all full, then, no.

Sorry to say this, but it came down to poor capacity management. Never
let any disk in your cluster fill over 80% to prevent these situations.

Wido

 Harald Rößler


 Am 20.10.2014 um 16:45 schrieb Wido den Hollander 
 w...@42on.commailto:w...@42on.com:

 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.


 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.

 Wido

 Regards
 Harald Rößler



 Am 20.10.2014 um 14:51 schrieb Wido den Hollander 
 w...@42on.commailto:w...@42on.com:

 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.


 See this: 2 active+degraded+remapped+backfill_toofull

 156 pgs 

Re: [ceph-users] recovery process stops

2014-10-23 Thread Wido den Hollander
On 10/23/2014 05:33 PM, Harald Rößler wrote:
 Hi all
 
 the procedure does not work for me, have still 47 active+remapped pg. Anyone 
 have an idea how to fix this issue.

If you look at those PGs using ceph osd pg dump, what is their prefix?

They should start with a number and that number corresponds back to a
pool ID which you can see with ceph osd dump|grep pool

Could it be that that specific pool is using a special crush rule?

Wido

 @Wido: now my cluster have a usage less than 80% - thanks for your advice.
 
 Harry
 
 
 Am 21.10.2014 um 22:38 schrieb Craig Lewis 
 cle...@centraldesktop.commailto:cle...@centraldesktop.com:
 
 In that case, take a look at ceph pg dump | grep remapped.  In the up or 
 active column, there should be one or two common OSDs between the stuck PGs.
 
 Try restarting those OSD daemons.  I've had a few OSDs get stuck scheduling 
 recovery, particularly around toofull situations.
 
 I've also had Robert's experience of stuck operations becoming unstuck over 
 night.
 
 
 On Tue, Oct 21, 2014 at 12:02 PM, Harald Rößler 
 harald.roess...@btd.demailto:harald.roess...@btd.de wrote:
 After more than 10 hours the same situation, I don’t think it will fix self 
 over time. How I can find out what is the problem.
 
 
 Am 21.10.2014 um 17:28 schrieb Craig Lewis 
 cle...@centraldesktop.commailto:cle...@centraldesktop.com:
 
 That will fix itself over time.  remapped just means that Ceph is moving the 
 data around.  It's normal to see PGs in the remapped and/or backfilling state 
 after OSD restarts.
 
 They should go down steadily over time.  How long depends on how much data is 
 in the PGs, how fast your hardware is, how many OSDs are affected, and how 
 much you allow recovery to impact cluster performance.  Mine currently take 
 about 20 minutes per PG.  If all 47 are on the same OSD, it'll be a while.  
 If they're evenly split between multiple OSDs, parallelism will speed that up.
 
 On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler 
 harald.roess...@btd.demailto:harald.roess...@btd.de wrote:
 Hi all,
 
 thank you for your support, now the file system is not degraded any more. Now 
 I have a minus degrading :-)
 
 2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281 
 active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB / 6178 
 GB avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded (-0.123%)
 
 but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery 
 -1638/1329293 degraded (-0.123%)
 
 I think this warning is reported because there are 47 active+remapped 
 objects, some ideas how to fix that now?
 
 Kind Regards
 Harald Roessler
 
 
 Am 21.10.2014 um 01:03 schrieb Craig Lewis 
 cle...@centraldesktop.commailto:cle...@centraldesktop.com:
 
 I've been in a state where reweight-by-utilization was deadlocked (not the 
 daemons, but the remap scheduling).  After successive osd reweight commands, 
 two OSDs wanted to swap PGs, but they were both toofull.  I ended up 
 temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the 
 impediment, and everything finished remapping.  Everything went smoothly, and 
 I changed it back when all the remapping finished.
 
 Just be careful if you need to get close to mon_osd_full_ratio.  Ceph does 
 greater-than on these percentages, not greater-than-equal.  You really don't 
 want the disks to get greater-than mon_osd_full_ratio, because all external 
 IO will stop until you resolve that.
 
 
 On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master 
 keks...@gmail.commailto:keks...@gmail.com wrote:
 You can set lower weight on full osds, or try changing the 
 osd_near_full_ratio parameter in your cluster from 85 to for example 89. But 
 i don't know what can go wrong when you do that.
 
 
 2014-10-20 17:12 GMT+02:00 Wido den Hollander 
 w...@42on.commailto:w...@42on.com:
 On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea?

 
 If the disks are all full, then, no.
 
 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.
 
 Wido
 
 Harald Rößler


 Am 20.10.2014 um 16:45 schrieb Wido den Hollander 
 w...@42on.commailto:w...@42on.com:

 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.


 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the 

Re: [ceph-users] recovery process stops

2014-10-23 Thread Harald Rößler
@Wido: sorry I don’t understand what you mean 100%, generated some output which 
may helps.


Ok the pool:

pool 3 'bcf' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 
832 pgp_num 832 last_change 8000 owner 0


all remapping pg have an temp entry:

pg_temp 3.1 [14,20,0]
pg_temp 3.c [1,7,23]
pg_temp 3.22 [15,21,23]



3.22429 0   2   0   1654296576  0   0   
active+remapped 2014-10-23 03:25:03.180505  8608'363836897  8608'377970131  
[15,21] [15,21,23]  3578'354650024  2014-10-16 04:06:39.133104  
3578'354650024  2014-10-16 04:06:39.133104

the crush rules.

# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}


ceph pg 3.22 query




{ state: active+remapped,
  epoch: 8608,
  up: [
15,
21],
  acting: [
15,
21,
23],
  info: { pgid: 3.22,
  last_update: 8608'363845313,
  last_complete: 8608'363845313,
  log_tail: 8608'363842312,
  last_backfill: MAX,
  purged_snaps: [1~1,3~3,8~6,f~31,42~1,44~3,48~f,58~1,5a~2],
  history: { epoch_created: 140,
  last_epoch_started: 8576,
  last_epoch_clean: 8576,
  last_epoch_split: 0,
  same_up_since: 8340,
  same_interval_since: 8575,
  same_primary_since: 7446,
  last_scrub: 3578'354650024,
  last_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_deep_scrub: 3578'354650024,
  last_deep_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_clean_scrub_stamp: 2014-10-16 04:06:39.133104},
  stats: { version: 8608'363845313,
  reported: 8608'377978685,
  state: active+remapped,
  last_fresh: 2014-10-23 18:55:07.582844,
  last_change: 2014-10-23 03:25:03.180505,
  last_active: 2014-10-23 18:55:07.582844,
  last_clean: 2014-10-20 07:51:21.330669,
  last_became_active: 2013-07-14 07:20:30.173508,
  last_unstale: 2014-10-23 18:55:07.582844,
  mapping_epoch: 8370,
  log_start: 8608'363842312,
  ondisk_log_start: 8608'363842312,
  created: 140,
  last_epoch_clean: 8576,
  parent: 0.0,
  parent_split_bits: 0,
  last_scrub: 3578'354650024,
  last_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_deep_scrub: 3578'354650024,
  last_deep_scrub_stamp: 2014-10-16 04:06:39.133104,
  last_clean_scrub_stamp: 2014-10-16 04:06:39.133104,
  log_size: 0,
  ondisk_log_size: 0,
  stats_invalid: 0,
  stat_sum: { num_bytes: 1654296576,
  num_objects: 429,
  num_object_clones: 28,
  num_object_copies: 0,
  num_objects_missing_on_primary: 0,
  num_objects_degraded: 0,
  num_objects_unfound: 0,
  num_read: 8053865,
  num_read_kb: 124022900,
  num_write: 363844886,
  num_write_kb: 2083536824,
  num_scrub_errors: 0,
  num_shallow_scrub_errors: 0,
  num_deep_scrub_errors: 0,
  num_objects_recovered: 2777,
  num_bytes_recovered: 11138282496,
  num_keys_recovered: 0},
  stat_cat_sum: {},
  up: [
15,
21],
  acting: [
15,
21,
23]},
  empty: 0,
  dne: 0,
  incomplete: 0,
  last_epoch_started: 8576},
  recovery_state: [
{ name: Started\/Primary\/Active,
  enter_time: 2014-10-23 03:25:03.179759,
  might_have_unfound: [],
  recovery_progress: { backfill_target: -1,
  waiting_on_backfill: 0,
  backfill_pos: 0\/\/0\/\/-1,
  backfill_info: { begin: 0\/\/0\/\/-1,
  end: 0\/\/0\/\/-1,
  objects: []},
  peer_backfill_info: { begin: 0\/\/0\/\/-1,
  end: 0\/\/0\/\/-1,
  objects: []},
  backfills_in_flight: [],
  pull_from_peer: [],
  pushing: []},
  scrub: { scrubber.epoch_start: 0,
  scrubber.active: 0,
  scrubber.block_writes: 0,
  scrubber.finalizing: 0,
  scrubber.waiting_on: 0,
  scrubber.waiting_on_whom: []}},
{ name: Started,
  enter_time: 2014-10-23 03:25:02.174216}]}


 Am 23.10.2014 um 17:36 schrieb Wido den Hollander w...@42on.com:

 On 10/23/2014 05:33 PM, 

Re: [ceph-users] recovery process stops

2014-10-21 Thread Harald Rößler
Hi all,

thank you for your support, now the file system is not degraded any more. Now I 
have a minus degrading :-)

2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281 
active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB / 6178 GB 
avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded (-0.123%)

but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery 
-1638/1329293 degraded (-0.123%)

I think this warning is reported because there are 47 active+remapped objects, 
some ideas how to fix that now?

Kind Regards
Harald Roessler


Am 21.10.2014 um 01:03 schrieb Craig Lewis 
cle...@centraldesktop.commailto:cle...@centraldesktop.com:

I've been in a state where reweight-by-utilization was deadlocked (not the 
daemons, but the remap scheduling).  After successive osd reweight commands, 
two OSDs wanted to swap PGs, but they were both toofull.  I ended up 
temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the 
impediment, and everything finished remapping.  Everything went smoothly, and I 
changed it back when all the remapping finished.

Just be careful if you need to get close to mon_osd_full_ratio.  Ceph does 
greater-than on these percentages, not greater-than-equal.  You really don't 
want the disks to get greater-than mon_osd_full_ratio, because all external IO 
will stop until you resolve that.


On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master 
keks...@gmail.commailto:keks...@gmail.com wrote:
You can set lower weight on full osds, or try changing the osd_near_full_ratio 
parameter in your cluster from 85 to for example 89. But i don't know what can 
go wrong when you do that.


2014-10-20 17:12 GMT+02:00 Wido den Hollander 
w...@42on.commailto:w...@42on.com:
On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea?


If the disks are all full, then, no.

Sorry to say this, but it came down to poor capacity management. Never
let any disk in your cluster fill over 80% to prevent these situations.

Wido

 Harald Rößler


 Am 20.10.2014 um 16:45 schrieb Wido den Hollander 
 w...@42on.commailto:w...@42on.com:

 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.


 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.

 Wido

 Regards
 Harald Rößler



 Am 20.10.2014 um 14:51 schrieb Wido den Hollander 
 w...@42on.commailto:w...@42on.com:

 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.


 See this: 2 active+degraded+remapped+backfill_toofull

 156 pgs backfill_toofull

 You have one or more OSDs which are to full and that causes recovery to
 stop.

 If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
 unclean; recovery 111487/1488290 degraded (7.491%)
  monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0http://10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0},
  election epoch 332, quorum 0,1,2 0,12,6
  osdmap e6748: 24 osds: 23 up, 23 in
   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 
 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)


 I have tried to restart all OSD in the cluster, but does not help to 
 finish the recovery of the cluster.

 Have someone any idea

 Kind Regards
 Harald Rößler



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
 

Re: [ceph-users] recovery process stops

2014-10-21 Thread Craig Lewis
That will fix itself over time.  remapped just means that Ceph is moving
the data around.  It's normal to see PGs in the remapped and/or backfilling
state after OSD restarts.

They should go down steadily over time.  How long depends on how much data
is in the PGs, how fast your hardware is, how many OSDs are affected, and
how much you allow recovery to impact cluster performance.  Mine currently
take about 20 minutes per PG.  If all 47 are on the same OSD, it'll be a
while.  If they're evenly split between multiple OSDs, parallelism will
speed that up.

On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler harald.roess...@btd.de
wrote:

 Hi all,

 thank you for your support, now the file system is not degraded any more.
 Now I have a minus degrading :-)

 2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281
 active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB /
 6178 GB avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded
 (-0.123%)

 but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery
 -1638/1329293 degraded (-0.123%)

 I think this warning is reported because there are 47 active+remapped
 objects, some ideas how to fix that now?

 Kind Regards
 Harald Roessler


 Am 21.10.2014 um 01:03 schrieb Craig Lewis cle...@centraldesktop.com:

 I've been in a state where reweight-by-utilization was deadlocked (not the
 daemons, but the remap scheduling).  After successive osd reweight
 commands, two OSDs wanted to swap PGs, but they were both toofull.  I ended
 up temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the
 impediment, and everything finished remapping.  Everything went smoothly,
 and I changed it back when all the remapping finished.

 Just be careful if you need to get close to mon_osd_full_ratio.  Ceph does
 greater-than on these percentages, not greater-than-equal.  You really
 don't want the disks to get greater-than mon_osd_full_ratio, because all
 external IO will stop until you resolve that.


 On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master keks...@gmail.com wrote:

 You can set lower weight on full osds, or try changing the
 osd_near_full_ratio parameter in your cluster from 85 to for example 89.
 But i don't know what can go wrong when you do that.


 2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com:

 On 10/20/2014 05:10 PM, Harald Rößler wrote:
  yes, tomorrow I will get the replacement of the failed disk, to get a
 new node with many disk will take a few days.
  No other idea?
 

 If the disks are all full, then, no.

 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.

 Wido

  Harald Rößler
 
 
  Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 04:43 PM, Harald Rößler wrote:
  Yes, I had some OSD which was near full, after that I tried to fix
 the problem with ceph osd reweight-by-utilization, but this does not
 help. After that I set the near full ratio to 88% with the idea that the
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At
 the same time I had a hardware failure of on disk. :-(. After that failure
 the recovery process start at degraded ~ 13%“ and stops at 7%.
  Honestly I am scared in the moment I am doing the wrong operation.
 
 
  Any chance of adding a new node with some fresh disks? Seems like you
  are operating on the storage capacity limit of the nodes and that your
  only remedy would be adding more spindles.
 
  Wido
 
  Regards
  Harald Rößler
 
 
 
  Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 02:45 PM, Harald Rößler wrote:
  Dear All
 
  I have in them moment a issue with my cluster. The recovery
 process stops.
 
 
  See this: 2 active+degraded+remapped+backfill_toofull
 
  156 pgs backfill_toofull
 
  You have one or more OSDs which are to full and that causes
 recovery to
  stop.
 
  If you add more capacity to the cluster recovery will continue and
 finish.
 
  ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4
 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck
 unclean; recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at {0=
 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0},
 election epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped,
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2
 

Re: [ceph-users] recovery process stops

2014-10-21 Thread Harald Rößler
After more than 10 hours the same situation, I don’t think it will fix self 
over time. How I can find out what is the problem.


Am 21.10.2014 um 17:28 schrieb Craig Lewis 
cle...@centraldesktop.commailto:cle...@centraldesktop.com:

That will fix itself over time.  remapped just means that Ceph is moving the 
data around.  It's normal to see PGs in the remapped and/or backfilling state 
after OSD restarts.

They should go down steadily over time.  How long depends on how much data is 
in the PGs, how fast your hardware is, how many OSDs are affected, and how much 
you allow recovery to impact cluster performance.  Mine currently take about 20 
minutes per PG.  If all 47 are on the same OSD, it'll be a while.  If they're 
evenly split between multiple OSDs, parallelism will speed that up.

On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler 
harald.roess...@btd.demailto:harald.roess...@btd.de wrote:
Hi all,

thank you for your support, now the file system is not degraded any more. Now I 
have a minus degrading :-)

2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281 
active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB / 6178 GB 
avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded (-0.123%)

but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery 
-1638/1329293 degraded (-0.123%)

I think this warning is reported because there are 47 active+remapped objects, 
some ideas how to fix that now?

Kind Regards
Harald Roessler


Am 21.10.2014 um 01:03 schrieb Craig Lewis 
cle...@centraldesktop.commailto:cle...@centraldesktop.com:

I've been in a state where reweight-by-utilization was deadlocked (not the 
daemons, but the remap scheduling).  After successive osd reweight commands, 
two OSDs wanted to swap PGs, but they were both toofull.  I ended up 
temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the 
impediment, and everything finished remapping.  Everything went smoothly, and I 
changed it back when all the remapping finished.

Just be careful if you need to get close to mon_osd_full_ratio.  Ceph does 
greater-than on these percentages, not greater-than-equal.  You really don't 
want the disks to get greater-than mon_osd_full_ratio, because all external IO 
will stop until you resolve that.


On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master 
keks...@gmail.commailto:keks...@gmail.com wrote:
You can set lower weight on full osds, or try changing the osd_near_full_ratio 
parameter in your cluster from 85 to for example 89. But i don't know what can 
go wrong when you do that.


2014-10-20 17:12 GMT+02:00 Wido den Hollander 
w...@42on.commailto:w...@42on.com:
On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea?


If the disks are all full, then, no.

Sorry to say this, but it came down to poor capacity management. Never
let any disk in your cluster fill over 80% to prevent these situations.

Wido

 Harald Rößler


 Am 20.10.2014 um 16:45 schrieb Wido den Hollander 
 w...@42on.commailto:w...@42on.com:

 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.


 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.

 Wido

 Regards
 Harald Rößler



 Am 20.10.2014 um 14:51 schrieb Wido den Hollander 
 w...@42on.commailto:w...@42on.com:

 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.


 See this: 2 active+degraded+remapped+backfill_toofull

 156 pgs backfill_toofull

 You have one or more OSDs which are to full and that causes recovery to
 stop.

 If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
 unclean; recovery 111487/1488290 degraded (7.491%)
  monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0http://10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0},
  election epoch 332, quorum 0,1,2 0,12,6
  osdmap e6748: 24 osds: 23 up, 23 in
   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 

Re: [ceph-users] recovery process stops

2014-10-21 Thread Robert LeBlanc
I've had issues magically fix themselves over night after waiting/trying
things for hours.

On Tue, Oct 21, 2014 at 1:02 PM, Harald Rößler harald.roess...@btd.de
wrote:

 After more than 10 hours the same situation, I don’t think it will fix
 self over time. How I can find out what is the problem.


 Am 21.10.2014 um 17:28 schrieb Craig Lewis cle...@centraldesktop.com:

 That will fix itself over time.  remapped just means that Ceph is moving
 the data around.  It's normal to see PGs in the remapped and/or backfilling
 state after OSD restarts.

 They should go down steadily over time.  How long depends on how much data
 is in the PGs, how fast your hardware is, how many OSDs are affected, and
 how much you allow recovery to impact cluster performance.  Mine currently
 take about 20 minutes per PG.  If all 47 are on the same OSD, it'll be a
 while.  If they're evenly split between multiple OSDs, parallelism will
 speed that up.

 On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler harald.roess...@btd.de
 wrote:

 Hi all,

 thank you for your support, now the file system is not degraded any more.
 Now I have a minus degrading :-)

 2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281
 active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB /
 6178 GB avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded
 (-0.123%)

 but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery
 -1638/1329293 degraded (-0.123%)

 I think this warning is reported because there are 47 active+remapped
 objects, some ideas how to fix that now?

 Kind Regards
 Harald Roessler


 Am 21.10.2014 um 01:03 schrieb Craig Lewis cle...@centraldesktop.com:

 I've been in a state where reweight-by-utilization was deadlocked (not
 the daemons, but the remap scheduling).  After successive osd reweight
 commands, two OSDs wanted to swap PGs, but they were both toofull.  I ended
 up temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the
 impediment, and everything finished remapping.  Everything went smoothly,
 and I changed it back when all the remapping finished.

 Just be careful if you need to get close to mon_osd_full_ratio.  Ceph
 does greater-than on these percentages, not greater-than-equal.  You really
 don't want the disks to get greater-than mon_osd_full_ratio, because all
 external IO will stop until you resolve that.


 On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master keks...@gmail.com
 wrote:

 You can set lower weight on full osds, or try changing the
 osd_near_full_ratio parameter in your cluster from 85 to for example 89.
 But i don't know what can go wrong when you do that.


 2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com:

 On 10/20/2014 05:10 PM, Harald Rößler wrote:
  yes, tomorrow I will get the replacement of the failed disk, to get a
 new node with many disk will take a few days.
  No other idea?
 

 If the disks are all full, then, no.

 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.

 Wido

  Harald Rößler
 
 
  Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 04:43 PM, Harald Rößler wrote:
  Yes, I had some OSD which was near full, after that I tried to fix
 the problem with ceph osd reweight-by-utilization, but this does not
 help. After that I set the near full ratio to 88% with the idea that the
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At
 the same time I had a hardware failure of on disk. :-(. After that failure
 the recovery process start at degraded ~ 13%“ and stops at 7%.
  Honestly I am scared in the moment I am doing the wrong operation.
 
 
  Any chance of adding a new node with some fresh disks? Seems like you
  are operating on the storage capacity limit of the nodes and that
 your
  only remedy would be adding more spindles.
 
  Wido
 
  Regards
  Harald Rößler
 
 
 
  Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 02:45 PM, Harald Rößler wrote:
  Dear All
 
  I have in them moment a issue with my cluster. The recovery
 process stops.
 
 
  See this: 2 active+degraded+remapped+backfill_toofull
 
  156 pgs backfill_toofull
 
  You have one or more OSDs which are to full and that causes
 recovery to
  stop.
 
  If you add more capacity to the cluster recovery will continue and
 finish.
 
  ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4
 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck
 unclean; recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at {0=
 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0},
 election epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96
 active+remapped+wait_backfill+backfill_toofull, 31 

Re: [ceph-users] recovery process stops

2014-10-21 Thread Craig Lewis
In that case, take a look at ceph pg dump | grep remapped.  In the up or
active column, there should be one or two common OSDs between the stuck PGs.

Try restarting those OSD daemons.  I've had a few OSDs get stuck scheduling
recovery, particularly around toofull situations.

I've also had Robert's experience of stuck operations becoming unstuck over
night.


On Tue, Oct 21, 2014 at 12:02 PM, Harald Rößler harald.roess...@btd.de
wrote:

 After more than 10 hours the same situation, I don’t think it will fix
 self over time. How I can find out what is the problem.


 Am 21.10.2014 um 17:28 schrieb Craig Lewis cle...@centraldesktop.com:

 That will fix itself over time.  remapped just means that Ceph is moving
 the data around.  It's normal to see PGs in the remapped and/or backfilling
 state after OSD restarts.

 They should go down steadily over time.  How long depends on how much data
 is in the PGs, how fast your hardware is, how many OSDs are affected, and
 how much you allow recovery to impact cluster performance.  Mine currently
 take about 20 minutes per PG.  If all 47 are on the same OSD, it'll be a
 while.  If they're evenly split between multiple OSDs, parallelism will
 speed that up.

 On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler harald.roess...@btd.de
 wrote:

 Hi all,

 thank you for your support, now the file system is not degraded any more.
 Now I have a minus degrading :-)

 2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281
 active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB /
 6178 GB avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded
 (-0.123%)

 but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery
 -1638/1329293 degraded (-0.123%)

 I think this warning is reported because there are 47 active+remapped
 objects, some ideas how to fix that now?

 Kind Regards
 Harald Roessler


 Am 21.10.2014 um 01:03 schrieb Craig Lewis cle...@centraldesktop.com:

 I've been in a state where reweight-by-utilization was deadlocked (not
 the daemons, but the remap scheduling).  After successive osd reweight
 commands, two OSDs wanted to swap PGs, but they were both toofull.  I ended
 up temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the
 impediment, and everything finished remapping.  Everything went smoothly,
 and I changed it back when all the remapping finished.

 Just be careful if you need to get close to mon_osd_full_ratio.  Ceph
 does greater-than on these percentages, not greater-than-equal.  You really
 don't want the disks to get greater-than mon_osd_full_ratio, because all
 external IO will stop until you resolve that.


 On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master keks...@gmail.com
 wrote:

 You can set lower weight on full osds, or try changing the
 osd_near_full_ratio parameter in your cluster from 85 to for example 89.
 But i don't know what can go wrong when you do that.


 2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com:

 On 10/20/2014 05:10 PM, Harald Rößler wrote:
  yes, tomorrow I will get the replacement of the failed disk, to get a
 new node with many disk will take a few days.
  No other idea?
 

 If the disks are all full, then, no.

 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.

 Wido

  Harald Rößler
 
 
  Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 04:43 PM, Harald Rößler wrote:
  Yes, I had some OSD which was near full, after that I tried to fix
 the problem with ceph osd reweight-by-utilization, but this does not
 help. After that I set the near full ratio to 88% with the idea that the
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At
 the same time I had a hardware failure of on disk. :-(. After that failure
 the recovery process start at degraded ~ 13%“ and stops at 7%.
  Honestly I am scared in the moment I am doing the wrong operation.
 
 
  Any chance of adding a new node with some fresh disks? Seems like you
  are operating on the storage capacity limit of the nodes and that
 your
  only remedy would be adding more spindles.
 
  Wido
 
  Regards
  Harald Rößler
 
 
 
  Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 02:45 PM, Harald Rößler wrote:
  Dear All
 
  I have in them moment a issue with my cluster. The recovery
 process stops.
 
 
  See this: 2 active+degraded+remapped+backfill_toofull
 
  156 pgs backfill_toofull
 
  You have one or more OSDs which are to full and that causes
 recovery to
  stop.
 
  If you add more capacity to the cluster recovery will continue and
 finish.
 
  ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4
 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck
 unclean; recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at {0=
 

Re: [ceph-users] recovery process stops

2014-10-20 Thread Wido den Hollander
On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process stops.
 

See this: 2 active+degraded+remapped+backfill_toofull

156 pgs backfill_toofull

You have one or more OSDs which are to full and that causes recovery to
stop.

If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election 
 epoch 332, quorum 0,1,2 0,12,6
osdmap e6748: 24 osds: 23 up, 23 in
 pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 
 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 
 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded 
 (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to finish 
 the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recovery process stops

2014-10-20 Thread Leszek Master
I think it's because you have too full osds like in warning message. I had
similiar problem recently and i did:

ceph osd reweight-by-utilization

But first read what this command does. It solved problem for me.

2014-10-20 14:45 GMT+02:00 Harald Rößler harald.roess...@btd.de:

 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.

 ceph -s
health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean;
 recovery 111487/1488290 degraded (7.491%)
monmap e2: 3 mons at {0=
 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election
 epoch 332, quorum 0,1,2 0,12,6
osdmap e6748: 24 osds: 23 up, 23 in
 pgmap v43314672: 3328 pgs: 3031 active+clean, 43
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped,
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2
 active+degraded+remapped+backfill_toofull, 2
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290
 degraded (7.491%)


 I have tried to restart all OSD in the cluster, but does not help to
 finish the recovery of the cluster.

 Have someone any idea

 Kind Regards
 Harald Rößler



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recovery process stops

2014-10-20 Thread Harald Rößler
Yes, I had some OSD which was near full, after that I tried to fix the problem 
with ceph osd reweight-by-utilization, but this does not help. After that I 
set the near full ratio to 88% with the idea that the remapping would fix the 
issue. Also a restart of the OSD doesn’t help. At the same time I had a 
hardware failure of on disk. :-(. After that failure the recovery process start 
at degraded ~ 13%“ and stops at 7%.
Honestly I am scared in the moment I am doing the wrong operation.

Regards
Harald Rößler   
 


 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process stops.
 
 
 See this: 2 active+degraded+remapped+backfill_toofull
 
 156 pgs backfill_toofull
 
 You have one or more OSDs which are to full and that causes recovery to
 stop.
 
 If you add more capacity to the cluster recovery will continue and finish.
 
 ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election 
 epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 
 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 
 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded 
 (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to finish 
 the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recovery process stops

2014-10-20 Thread Wido den Hollander
On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the remapping 
 would fix the issue. Also a restart of the OSD doesn’t help. At the same time 
 I had a hardware failure of on disk. :-(. After that failure the recovery 
 process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.
 

Any chance of adding a new node with some fresh disks? Seems like you
are operating on the storage capacity limit of the nodes and that your
only remedy would be adding more spindles.

Wido

 Regards
 Harald Rößler 
  
 
 
 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:

 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.


 See this: 2 active+degraded+remapped+backfill_toofull

 156 pgs backfill_toofull

 You have one or more OSDs which are to full and that causes recovery to
 stop.

 If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election 
 epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB 
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)


 I have tried to restart all OSD in the cluster, but does not help to finish 
 the recovery of the cluster.

 Have someone any idea

 Kind Regards
 Harald Rößler   



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recovery process stops

2014-10-20 Thread Wido den Hollander
On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea? 
 

If the disks are all full, then, no.

Sorry to say this, but it came down to poor capacity management. Never
let any disk in your cluster fill over 80% to prevent these situations.

Wido

 Harald Rößler 
 
 
 Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:

 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.


 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.

 Wido

 Regards
 Harald Rößler   



 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:

 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All

 I have in them moment a issue with my cluster. The recovery process stops.


 See this: 2 active+degraded+remapped+backfill_toofull

 156 pgs backfill_toofull

 You have one or more OSDs which are to full and that causes recovery to
 stop.

 If you add more capacity to the cluster recovery will continue and finish.

 ceph -s
  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
 unclean; recovery 111487/1488290 degraded (7.491%)
  monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, 
 election epoch 332, quorum 0,1,2 0,12,6
  osdmap e6748: 24 osds: 23 up, 23 in
   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 
 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)


 I have tried to restart all OSD in the cluster, but does not help to 
 finish the recovery of the cluster.

 Have someone any idea

 Kind Regards
 Harald Rößler 



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 


-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recovery process stops

2014-10-20 Thread Harald Rößler
Yes I agree 100%, but actual every disk have a maximum of 86% usage, there 
should a way to recover the cluster. To set the near full ratio to higher than 
85% should be only a short term solution. New disk for higher capacity are 
already ordered, I only don’t like degraded situation, for a week or more.
Also one of the VM’s doesn’t start because an slow request warning.

Thanks for your advise.
Harald Rößler   


 Am 20.10.2014 um 17:12 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 05:10 PM, Harald Rößler wrote:
 yes, tomorrow I will get the replacement of the failed disk, to get a new 
 node with many disk will take a few days.
 No other idea? 
 
 
 If the disks are all full, then, no.
 
 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.
 
 Wido
 
 Harald Rößler
 
 
 Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the 
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At 
 the same time I had a hardware failure of on disk. :-(. After that failure 
 the recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.
 
 
 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.
 
 Wido
 
 Regards
 Harald Rößler  
 
 
 
 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process 
 stops.
 
 
 See this: 2 active+degraded+remapped+backfill_toofull
 
 156 pgs backfill_toofull
 
 You have one or more OSDs which are to full and that causes recovery to
 stop.
 
 If you add more capacity to the cluster recovery will continue and finish.
 
 ceph -s
 health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck 
 unclean; recovery 111487/1488290 degraded (7.491%)
 monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, 
 election epoch 332, quorum 0,1,2 0,12,6
 osdmap e6748: 24 osds: 23 up, 23 in
  pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 
 active+recovery_wait+remapped, 21 
 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 
 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to 
 finish the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] recovery process stops

2014-10-20 Thread Leszek Master
You can set lower weight on full osds, or try changing the
osd_near_full_ratio parameter in your cluster from 85 to for example 89.
But i don't know what can go wrong when you do that.

2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com:

 On 10/20/2014 05:10 PM, Harald Rößler wrote:
  yes, tomorrow I will get the replacement of the failed disk, to get a
 new node with many disk will take a few days.
  No other idea?
 

 If the disks are all full, then, no.

 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.

 Wido

  Harald Rößler
 
 
  Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 04:43 PM, Harald Rößler wrote:
  Yes, I had some OSD which was near full, after that I tried to fix the
 problem with ceph osd reweight-by-utilization, but this does not help.
 After that I set the near full ratio to 88% with the idea that the
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At
 the same time I had a hardware failure of on disk. :-(. After that failure
 the recovery process start at degraded ~ 13%“ and stops at 7%.
  Honestly I am scared in the moment I am doing the wrong operation.
 
 
  Any chance of adding a new node with some fresh disks? Seems like you
  are operating on the storage capacity limit of the nodes and that your
  only remedy would be adding more spindles.
 
  Wido
 
  Regards
  Harald Rößler
 
 
 
  Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 02:45 PM, Harald Rößler wrote:
  Dear All
 
  I have in them moment a issue with my cluster. The recovery process
 stops.
 
 
  See this: 2 active+degraded+remapped+backfill_toofull
 
  156 pgs backfill_toofull
 
  You have one or more OSDs which are to full and that causes recovery
 to
  stop.
 
  If you add more capacity to the cluster recovery will continue and
 finish.
 
  ceph -s
   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4
 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck
 unclean; recovery 111487/1488290 degraded (7.491%)
   monmap e2: 3 mons at {0=
 10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election
 epoch 332, quorum 0,1,2 0,12,6
   osdmap e6748: 24 osds: 23 up, 23 in
pgmap v43314672: 3328 pgs: 3031 active+clean, 43
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19
 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped,
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2
 active+degraded+remapped+backfill_toofull, 2
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290
 degraded (7.491%)
 
 
  I have tried to restart all OSD in the cluster, but does not help to
 finish the recovery of the cluster.
 
  Have someone any idea
 
  Kind Regards
  Harald Rößler
 
 
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
  --
  Wido den Hollander
  Ceph consultant and trainer
  42on B.V.
 
  Phone: +31 (0)20 700 9902
  Skype: contact42on
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
  --
  Wido den Hollander
  Ceph consultant and trainer
  42on B.V.
 
  Phone: +31 (0)20 700 9902
  Skype: contact42on
 


 --
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



2014-10-20 17:12 GMT+02:00 Wido den Hollander w...@42on.com:

 On 10/20/2014 05:10 PM, Harald Rößler wrote:
  yes, tomorrow I will get the replacement of the failed disk, to get a
 new node with many disk will take a few days.
  No other idea?
 

 If the disks are all full, then, no.

 Sorry to say this, but it came down to poor capacity management. Never
 let any disk in your cluster fill over 80% to prevent these situations.

 Wido

  Harald Rößler
 
 
  Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
  On 10/20/2014 04:43 PM, Harald Rößler wrote:
  Yes, I had some OSD which was near full, after that I tried to fix the
 problem with ceph osd reweight-by-utilization, but this does not help.
 After that I set the near full ratio to 88% with the idea that the
 remapping would fix the issue. Also a restart of the OSD doesn’t help. At
 the same time I 

Re: [ceph-users] recovery process stops

2014-10-20 Thread Harald Rößler
yes, tomorrow I will get the replacement of the failed disk, to get a new node 
with many disk will take a few days.
No other idea? 

Harald Rößler   


 Am 20.10.2014 um 16:45 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 04:43 PM, Harald Rößler wrote:
 Yes, I had some OSD which was near full, after that I tried to fix the 
 problem with ceph osd reweight-by-utilization, but this does not help. 
 After that I set the near full ratio to 88% with the idea that the remapping 
 would fix the issue. Also a restart of the OSD doesn’t help. At the same 
 time I had a hardware failure of on disk. :-(. After that failure the 
 recovery process start at degraded ~ 13%“ and stops at 7%.
 Honestly I am scared in the moment I am doing the wrong operation.
 
 
 Any chance of adding a new node with some fresh disks? Seems like you
 are operating on the storage capacity limit of the nodes and that your
 only remedy would be adding more spindles.
 
 Wido
 
 Regards
 Harald Rößler
 
 
 
 Am 20.10.2014 um 14:51 schrieb Wido den Hollander w...@42on.com:
 
 On 10/20/2014 02:45 PM, Harald Rößler wrote:
 Dear All
 
 I have in them moment a issue with my cluster. The recovery process stops.
 
 
 See this: 2 active+degraded+remapped+backfill_toofull
 
 156 pgs backfill_toofull
 
 You have one or more OSDs which are to full and that causes recovery to
 stop.
 
 If you add more capacity to the cluster recovery will continue and finish.
 
 ceph -s
  health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs 
 backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; 
 recovery 111487/1488290 degraded (7.491%)
  monmap e2: 3 mons at 
 {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, 
 election epoch 332, quorum 0,1,2 0,12,6
  osdmap e6748: 24 osds: 23 up, 23 in
   pgmap v43314672: 3328 pgs: 3031 active+clean, 43 
 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 
 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 
 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 
 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 
 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 
 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 
 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 
 active+degraded+remapped+backfill_toofull, 2 
 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB 
 / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 
 degraded (7.491%)
 
 
 I have tried to restart all OSD in the cluster, but does not help to 
 finish the recovery of the cluster.
 
 Have someone any idea
 
 Kind Regards
 Harald Rößler  
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 -- 
 Wido den Hollander
 Ceph consultant and trainer
 42on B.V.
 
 Phone: +31 (0)20 700 9902
 Skype: contact42on

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com