Re: [ceph-users] Negative number of objects degraded for extended period of time

2014-11-17 Thread Craig Lewis
Well, after 4 days, this is probably moot.  Hopefully it's finished
backfilling, and your problem is gone.

If not, I believe that if you fix those backfill_toofull, the negative
numbers will start approaching zero.  I seem to recall that negative
degraded is a special case of degraded, but I don't remember exactly, and
can't find any references.  I have seen it before, and it went away when my
cluster became healthy.

As long as you still have OSDs completing their backfilling, I'd let it
run.

If you get to the point that all of the backfills are done, and you're left
with only wait_backfill+backfill_toofull, then you can bump
osd_backfill_full_ratio, mon_osd_nearfull_ratio, and maybe
osd_failsafe_nearfull_ratio.
 If you do, be careful, and only bump them just enough to let them start
backfilling.  If you set them to 0.99, bad things will happen.




On Thu, Nov 13, 2014 at 7:57 AM, Fred Yang frederic.y...@gmail.com wrote:

 Hi,

 The Ceph cluster we are running have few OSDs approaching to 95% 1+ weeks
 ago so I ran a reweight to balance it out, in the meantime, instructing
 application to purge data not required. But after large amount of data
 purge issued from application side(all OSDs' usage dropped below 20%), the
 cluster fall into this weird state for days, the objects degraded remain
 negative for more than 7 days, I'm seeing some IOs going on on OSDs
 consistently, but the number(negative) objects degraded does not change
 much:

 2014-11-13 10:43:07.237292 mon.0 [INF] pgmap v5935301: 44816 pgs: 44713
 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 30172 kB/s wr, 58 op/s;
 -13582/1468299 objects degraded (-0.925%)
 2014-11-13 10:43:08.248232 mon.0 [INF] pgmap v5935302: 44816 pgs: 44713
 active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
 active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
 active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
 GB data, 2985 GB used, 17123 GB / 20109 GB avail; 26459 kB/s wr, 51 op/s;
 -13582/1468303 objects degraded (-0.925%)

 Any idea what might be happening here? It
 seems active+remapped+wait_backfill+backfill_toofull stuck?

  osdmap e43029: 36 osds: 36 up, 36 in
   pgmap v5935658: 44816 pgs, 32 pools, 1488 GB data, 714 kobjects
 3017 GB used, 17092 GB / 20109 GB avail
 -13438/1475773 objects degraded (-0.911%)
44713 active+clean
1 active+backfilling
   20 active+remapped+wait_backfill
   27 active+remapped+wait_backfill+backfill_toofull
   11 active+recovery_wait
   33 active+remapped+backfilling
   11 active+wait_backfill+backfill_toofull
   client io 478 B/s rd, 40170 kB/s wr, 80 op/s

 The cluster is running on v0.72.2, we are planning to upgrade cluster to
 firefly, but I would like to get the cluster state clean first before the
 upgrade.

 Thanks,
 Fred

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Negative number of objects degraded for extended period of time

2014-11-13 Thread Fred Yang
Hi,

The Ceph cluster we are running have few OSDs approaching to 95% 1+ weeks
ago so I ran a reweight to balance it out, in the meantime, instructing
application to purge data not required. But after large amount of data
purge issued from application side(all OSDs' usage dropped below 20%), the
cluster fall into this weird state for days, the objects degraded remain
negative for more than 7 days, I'm seeing some IOs going on on OSDs
consistently, but the number(negative) objects degraded does not change
much:

2014-11-13 10:43:07.237292 mon.0 [INF] pgmap v5935301: 44816 pgs: 44713
active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
GB data, 2985 GB used, 17123 GB / 20109 GB avail; 30172 kB/s wr, 58 op/s;
-13582/1468299 objects degraded (-0.925%)
2014-11-13 10:43:08.248232 mon.0 [INF] pgmap v5935302: 44816 pgs: 44713
active+clean, 1 active+backfilling, 20 active+remapped+wait_backfill, 27
active+remapped+wait_backfill+backfill_toofull, 11 active+recovery_wait, 33
active+remapped+backfilling, 11 active+wait_backfill+backfill_toofull; 1473
GB data, 2985 GB used, 17123 GB / 20109 GB avail; 26459 kB/s wr, 51 op/s;
-13582/1468303 objects degraded (-0.925%)

Any idea what might be happening here? It
seems active+remapped+wait_backfill+backfill_toofull stuck?

 osdmap e43029: 36 osds: 36 up, 36 in
  pgmap v5935658: 44816 pgs, 32 pools, 1488 GB data, 714 kobjects
3017 GB used, 17092 GB / 20109 GB avail
-13438/1475773 objects degraded (-0.911%)
   44713 active+clean
   1 active+backfilling
  20 active+remapped+wait_backfill
  27 active+remapped+wait_backfill+backfill_toofull
  11 active+recovery_wait
  33 active+remapped+backfilling
  11 active+wait_backfill+backfill_toofull
  client io 478 B/s rd, 40170 kB/s wr, 80 op/s

The cluster is running on v0.72.2, we are planning to upgrade cluster to
firefly, but I would like to get the cluster state clean first before the
upgrade.

Thanks,
Fred
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com