subject:"\[ceph\-users\] Blocked Requests"

Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

2018-05-17 Thread Kevin Olbrich

Hi!

@Paul
Thanks! I know, I read the whole topic about size 2 some months ago. But
this has not been my decision, I had to set it up like that.

In the meantime, I did a reboot of node1001 and node1002 with flag "noout"
set and now peering has finished and only 0.0x% are rebalanced.
IO is flowing again. This happend as soon as the OSD was down (not out).

This looks very much like a bug for me, isn't it? Restarting an OSD to
"repair" crush?
Also I did query the pg but it did not show any error. It just lists stats
and that the pg was active since 8:40 this morning.
There are row(s) with "blocked by" but no value, is that supposed to be
filled with data?

Kind regards,
Kevin



2018-05-17 16:45 GMT+02:00 Paul Emmerich :

> Check ceph pg query, it will (usually) tell you why something is stuck
> inactive.
>
> Also: never do min_size 1.
>
>
> Paul
>
>
> 2018-05-17 15:48 GMT+02:00 Kevin Olbrich :
>
>> I was able to obtain another NVMe to get the HDDs in node1004 into the
>> cluster.
>> The number of disks (all 1TB) is now balanced between racks, still some
>> inactive PGs:
>>
>>   data:
>> pools:   2 pools, 1536 pgs
>> objects: 639k objects, 2554 GB
>> usage:   5167 GB used, 14133 GB / 19300 GB avail
>> pgs: 1.562% pgs not active
>>  1183/1309952 objects degraded (0.090%)
>>  199660/1309952 objects misplaced (15.242%)
>>  1072 active+clean
>>  405  active+remapped+backfill_wait
>>  35   active+remapped+backfilling
>>  21   activating+remapped
>>  3activating+undersized+degraded+remapped
>>
>>
>>
>> ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
>>  -1   18.85289 root default
>> -16   18.85289 datacenter dc01
>> -19   18.85289 pod dc01-agg01
>> -108.98700 rack dc01-rack02
>>  -44.03899 host node1001
>>   0   hdd  0.90999 osd.0 up  1.0 1.0
>>   1   hdd  0.90999 osd.1 up  1.0 1.0
>>   5   hdd  0.90999 osd.5 up  1.0 1.0
>>   2   ssd  0.43700 osd.2 up  1.0 1.0
>>   3   ssd  0.43700 osd.3 up  1.0 1.0
>>   4   ssd  0.43700 osd.4 up  1.0 1.0
>>  -74.94899 host node1002
>>   9   hdd  0.90999 osd.9 up  1.0 1.0
>>  10   hdd  0.90999 osd.10up  1.0 1.0
>>  11   hdd  0.90999 osd.11up  1.0 1.0
>>  12   hdd  0.90999 osd.12up  1.0 1.0
>>   6   ssd  0.43700 osd.6 up  1.0 1.0
>>   7   ssd  0.43700 osd.7 up  1.0 1.0
>>   8   ssd  0.43700 osd.8 up  1.0 1.0
>> -119.86589 rack dc01-rack03
>> -225.38794 host node1003
>>  17   hdd  0.90999 osd.17up  1.0 1.0
>>  18   hdd  0.90999 osd.18up  1.0 1.0
>>  24   hdd  0.90999 osd.24up  1.0 1.0
>>  26   hdd  0.90999 osd.26up  1.0 1.0
>>  13   ssd  0.43700 osd.13up  1.0 1.0
>>  14   ssd  0.43700 osd.14up  1.0 1.0
>>  15   ssd  0.43700 osd.15up  1.0 1.0
>>  16   ssd  0.43700 osd.16up  1.0 1.0
>> -254.47795 host node1004
>>  23   hdd  0.90999 osd.23up  1.0 1.0
>>  25   hdd  0.90999 osd.25up  1.0 1.0
>>  27   hdd  0.90999 osd.27up  1.0 1.0
>>  19   ssd  0.43700 osd.19up  1.0 1.0
>>  20   ssd  0.43700 osd.20up  1.0 1.0
>>  21   ssd  0.43700 osd.21up  1.0 1.0
>>  22   ssd  0.43700 osd.22up  1.0 1.0
>>
>>
>> Pools are size 2, min_size 1 during setup.
>>
>> The count of PGs in activate state are related to the weight of OSDs but
>> why are they failing to proceed to active+clean or active+remapped?
>>
>> Kind regards,
>> Kevin
>>
>> 2018-05-17 14:05 GMT+02:00 Kevin Olbrich :
>>
>>> Ok, I just waited some time but I still got some "activating" issues:
>>>
>>>   data:
>>> pools:   2 pools, 1536 pgs
>>> objects: 639k objects, 2554 GB
>>> usage:   5194 GB used, 11312 GB / 16506 GB avail
>>> pgs: 7.943% pgs not active
>>>  5567/1309948 objects degraded (0.425%)
>>>  195386/1309948 objects misplaced (14.916%)
>>>  1147

Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

2018-05-17 Thread Paul Emmerich

Check ceph pg query, it will (usually) tell you why something is stuck
inactive.

Also: never do min_size 1.


Paul


2018-05-17 15:48 GMT+02:00 Kevin Olbrich :

> I was able to obtain another NVMe to get the HDDs in node1004 into the
> cluster.
> The number of disks (all 1TB) is now balanced between racks, still some
> inactive PGs:
>
>   data:
> pools:   2 pools, 1536 pgs
> objects: 639k objects, 2554 GB
> usage:   5167 GB used, 14133 GB / 19300 GB avail
> pgs: 1.562% pgs not active
>  1183/1309952 objects degraded (0.090%)
>  199660/1309952 objects misplaced (15.242%)
>  1072 active+clean
>  405  active+remapped+backfill_wait
>  35   active+remapped+backfilling
>  21   activating+remapped
>  3activating+undersized+degraded+remapped
>
>
>
> ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
>  -1   18.85289 root default
> -16   18.85289 datacenter dc01
> -19   18.85289 pod dc01-agg01
> -108.98700 rack dc01-rack02
>  -44.03899 host node1001
>   0   hdd  0.90999 osd.0 up  1.0 1.0
>   1   hdd  0.90999 osd.1 up  1.0 1.0
>   5   hdd  0.90999 osd.5 up  1.0 1.0
>   2   ssd  0.43700 osd.2 up  1.0 1.0
>   3   ssd  0.43700 osd.3 up  1.0 1.0
>   4   ssd  0.43700 osd.4 up  1.0 1.0
>  -74.94899 host node1002
>   9   hdd  0.90999 osd.9 up  1.0 1.0
>  10   hdd  0.90999 osd.10up  1.0 1.0
>  11   hdd  0.90999 osd.11up  1.0 1.0
>  12   hdd  0.90999 osd.12up  1.0 1.0
>   6   ssd  0.43700 osd.6 up  1.0 1.0
>   7   ssd  0.43700 osd.7 up  1.0 1.0
>   8   ssd  0.43700 osd.8 up  1.0 1.0
> -119.86589 rack dc01-rack03
> -225.38794 host node1003
>  17   hdd  0.90999 osd.17up  1.0 1.0
>  18   hdd  0.90999 osd.18up  1.0 1.0
>  24   hdd  0.90999 osd.24up  1.0 1.0
>  26   hdd  0.90999 osd.26up  1.0 1.0
>  13   ssd  0.43700 osd.13up  1.0 1.0
>  14   ssd  0.43700 osd.14up  1.0 1.0
>  15   ssd  0.43700 osd.15up  1.0 1.0
>  16   ssd  0.43700 osd.16up  1.0 1.0
> -254.47795 host node1004
>  23   hdd  0.90999 osd.23up  1.0 1.0
>  25   hdd  0.90999 osd.25up  1.0 1.0
>  27   hdd  0.90999 osd.27up  1.0 1.0
>  19   ssd  0.43700 osd.19up  1.0 1.0
>  20   ssd  0.43700 osd.20up  1.0 1.0
>  21   ssd  0.43700 osd.21up  1.0 1.0
>  22   ssd  0.43700 osd.22up  1.0 1.0
>
>
> Pools are size 2, min_size 1 during setup.
>
> The count of PGs in activate state are related to the weight of OSDs but
> why are they failing to proceed to active+clean or active+remapped?
>
> Kind regards,
> Kevin
>
> 2018-05-17 14:05 GMT+02:00 Kevin Olbrich :
>
>> Ok, I just waited some time but I still got some "activating" issues:
>>
>>   data:
>> pools:   2 pools, 1536 pgs
>> objects: 639k objects, 2554 GB
>> usage:   5194 GB used, 11312 GB / 16506 GB avail
>> pgs: 7.943% pgs not active
>>  5567/1309948 objects degraded (0.425%)
>>  195386/1309948 objects misplaced (14.916%)
>>  1147 active+clean
>>  235  active+remapped+backfill_wait
>> * 107  activating+remapped*
>>  32   active+remapped+backfilling
>> * 15   activating+undersized+degraded+remapped*
>>
>> I set these settings during runtime:
>> ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
>> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
>> ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800'
>> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
>>
>> Sure, mon_max_pg_per_osd is oversized but this is just temporary.
>> Calculated PGs per OSD is 200.
>>
>> I searched the net and the bugtracker but most posts suggest
>> osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I
>> got more stuck PGs.
>>
>> Any more hints?
>>
>> Kind regards.
>> Kevin
>>
>> 2018-05-17 13:37

Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

2018-05-17 Thread Kevin Olbrich

I was able to obtain another NVMe to get the HDDs in node1004 into the
cluster.
The number of disks (all 1TB) is now balanced between racks, still some
inactive PGs:

  data:
pools:   2 pools, 1536 pgs
objects: 639k objects, 2554 GB
usage:   5167 GB used, 14133 GB / 19300 GB avail
pgs: 1.562% pgs not active
 1183/1309952 objects degraded (0.090%)
 199660/1309952 objects misplaced (15.242%)
 1072 active+clean
 405  active+remapped+backfill_wait
 35   active+remapped+backfilling
 21   activating+remapped
 3activating+undersized+degraded+remapped



ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
 -1   18.85289 root default
-16   18.85289 datacenter dc01
-19   18.85289 pod dc01-agg01
-108.98700 rack dc01-rack02
 -44.03899 host node1001
  0   hdd  0.90999 osd.0 up  1.0 1.0
  1   hdd  0.90999 osd.1 up  1.0 1.0
  5   hdd  0.90999 osd.5 up  1.0 1.0
  2   ssd  0.43700 osd.2 up  1.0 1.0
  3   ssd  0.43700 osd.3 up  1.0 1.0
  4   ssd  0.43700 osd.4 up  1.0 1.0
 -74.94899 host node1002
  9   hdd  0.90999 osd.9 up  1.0 1.0
 10   hdd  0.90999 osd.10up  1.0 1.0
 11   hdd  0.90999 osd.11up  1.0 1.0
 12   hdd  0.90999 osd.12up  1.0 1.0
  6   ssd  0.43700 osd.6 up  1.0 1.0
  7   ssd  0.43700 osd.7 up  1.0 1.0
  8   ssd  0.43700 osd.8 up  1.0 1.0
-119.86589 rack dc01-rack03
-225.38794 host node1003
 17   hdd  0.90999 osd.17up  1.0 1.0
 18   hdd  0.90999 osd.18up  1.0 1.0
 24   hdd  0.90999 osd.24up  1.0 1.0
 26   hdd  0.90999 osd.26up  1.0 1.0
 13   ssd  0.43700 osd.13up  1.0 1.0
 14   ssd  0.43700 osd.14up  1.0 1.0
 15   ssd  0.43700 osd.15up  1.0 1.0
 16   ssd  0.43700 osd.16up  1.0 1.0
-254.47795 host node1004
 23   hdd  0.90999 osd.23up  1.0 1.0
 25   hdd  0.90999 osd.25up  1.0 1.0
 27   hdd  0.90999 osd.27up  1.0 1.0
 19   ssd  0.43700 osd.19up  1.0 1.0
 20   ssd  0.43700 osd.20up  1.0 1.0
 21   ssd  0.43700 osd.21up  1.0 1.0
 22   ssd  0.43700 osd.22up  1.0 1.0


Pools are size 2, min_size 1 during setup.

The count of PGs in activate state are related to the weight of OSDs but
why are they failing to proceed to active+clean or active+remapped?

Kind regards,
Kevin

2018-05-17 14:05 GMT+02:00 Kevin Olbrich :

> Ok, I just waited some time but I still got some "activating" issues:
>
>   data:
> pools:   2 pools, 1536 pgs
> objects: 639k objects, 2554 GB
> usage:   5194 GB used, 11312 GB / 16506 GB avail
> pgs: 7.943% pgs not active
>  5567/1309948 objects degraded (0.425%)
>  195386/1309948 objects misplaced (14.916%)
>  1147 active+clean
>  235  active+remapped+backfill_wait
> * 107  activating+remapped*
>  32   active+remapped+backfilling
> * 15   activating+undersized+degraded+remapped*
>
> I set these settings during runtime:
> ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800'
> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
>
> Sure, mon_max_pg_per_osd is oversized but this is just temporary.
> Calculated PGs per OSD is 200.
>
> I searched the net and the bugtracker but most posts suggest
> osd_max_pg_per_osd_hard_ratio = 32 to fix this issue but this time, I got
> more stuck PGs.
>
> Any more hints?
>
> Kind regards.
> Kevin
>
> 2018-05-17 13:37 GMT+02:00 Kevin Olbrich :
>
>> PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by
>> default, will place 200 PGs on each OSD.
>> I read about the protection in the docs and later noticed that I better
>> had only placed 100 PGs.
>>
>>
>> 2018-05-17 13:35 GMT+02:00 Kevin Olbrich :
>>
>>> Hi!

Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

2018-05-17 Thread Kevin Olbrich

Ok, I just waited some time but I still got some "activating" issues:

  data:
pools:   2 pools, 1536 pgs
objects: 639k objects, 2554 GB
usage:   5194 GB used, 11312 GB / 16506 GB avail
pgs: 7.943% pgs not active
 5567/1309948 objects degraded (0.425%)
 195386/1309948 objects misplaced (14.916%)
 1147 active+clean
 235  active+remapped+backfill_wait
* 107  activating+remapped*
 32   active+remapped+backfilling
* 15   activating+undersized+degraded+remapped*

I set these settings during runtime:
ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
ceph tell 'mon.*' injectargs '--mon_max_pg_per_osd 800'
ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'

Sure, mon_max_pg_per_osd is oversized but this is just temporary.
Calculated PGs per OSD is 200.

I searched the net and the bugtracker but most posts suggest
osd_max_pg_per_osd_hard_ratio
= 32 to fix this issue but this time, I got more stuck PGs.

Any more hints?

Kind regards.
Kevin

2018-05-17 13:37 GMT+02:00 Kevin Olbrich :

> PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by
> default, will place 200 PGs on each OSD.
> I read about the protection in the docs and later noticed that I better
> had only placed 100 PGs.
>
>
> 2018-05-17 13:35 GMT+02:00 Kevin Olbrich :
>
>> Hi!
>>
>> Thanks for your quick reply.
>> Before I read your mail, i applied the following conf to my OSDs:
>> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
>>
>> Status is now:
>>   data:
>> pools:   2 pools, 1536 pgs
>> objects: 639k objects, 2554 GB
>> usage:   5211 GB used, 11295 GB / 16506 GB avail
>> pgs: 7.943% pgs not active
>>  5567/1309948 objects degraded (0.425%)
>>  252327/1309948 objects misplaced (19.262%)
>>  1030 active+clean
>>  351  active+remapped+backfill_wait
>>  107  activating+remapped
>>  33   active+remapped+backfilling
>>  15   activating+undersized+degraded+remapped
>>
>> A little bit better but still some non-active PGs.
>> I will investigate your other hints!
>>
>> Thanks
>> Kevin
>>
>> 2018-05-17 13:30 GMT+02:00 Burkhard Linke > bio.uni-giessen.de>:
>>
>>> Hi,
>>>
>>>
>>>
>>> On 05/17/2018 01:09 PM, Kevin Olbrich wrote:
>>>
 Hi!

 Today I added some new OSDs (nearly doubled) to my luminous cluster.
 I then changed pg(p)_num from 256 to 1024 for that pool because it was
 complaining about to few PGs. (I noticed that should better have been
 small
 changes).

 This is the current status:

  health: HEALTH_ERR
  336568/1307562 objects misplaced (25.740%)
  Reduced data availability: 128 pgs inactive, 3 pgs
 peering, 1
 pg stale
  Degraded data redundancy: 6985/1307562 objects degraded
 (0.534%), 19 pgs degraded, 19 pgs undersized
  107 slow requests are blocked > 32 sec
  218 stuck requests are blocked > 4096 sec

data:
  pools:   2 pools, 1536 pgs
  objects: 638k objects, 2549 GB
  usage:   5210 GB used, 11295 GB / 16506 GB avail
  pgs: 0.195% pgs unknown
   8.138% pgs not active
   6985/1307562 objects degraded (0.534%)
   336568/1307562 objects misplaced (25.740%)
   855 active+clean
   517 active+remapped+backfill_wait
   107 activating+remapped
   31  active+remapped+backfilling
   15  activating+undersized+degraded+remapped
   4   active+undersized+degraded+remapped+backfilling
   3   unknown
   3   peering
   1   stale+active+clean

>>>
>>> You need to resolve the unknown/peering/activating pgs first. You have
>>> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25
>>> OSDs and the heterogenous host sizes, I assume that some OSDs hold more
>>> than 200 PGs. There's a threshold for the number of PGs; reaching this
>>> threshold keeps the OSDs from accepting new PGs.
>>>
>>> Try to increase the threshold  (mon_max_pg_per_osd /
>>> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about
>>> the exact one, consult the documentation) to allow more PGs on the OSDs. If
>>> this is the cause of the problem, the peering and activating states should
>>> be resolved within a short time.
>>>
>>> You can also check the number of PGs per OSD with 'ceph osd df'; the
>>> last column is the current number of PGs.
>>>
>>>

 OSD tree:

 ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
   -1   16.12177 root default
 -16

Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

2018-05-17 Thread Kevin Olbrich

PS: Cluster currently is size 2, I used PGCalc on Ceph website which, by
default, will place 200 PGs on each OSD.
I read about the protection in the docs and later noticed that I better had
only placed 100 PGs.


2018-05-17 13:35 GMT+02:00 Kevin Olbrich :

> Hi!
>
> Thanks for your quick reply.
> Before I read your mail, i applied the following conf to my OSDs:
> ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'
>
> Status is now:
>   data:
> pools:   2 pools, 1536 pgs
> objects: 639k objects, 2554 GB
> usage:   5211 GB used, 11295 GB / 16506 GB avail
> pgs: 7.943% pgs not active
>  5567/1309948 objects degraded (0.425%)
>  252327/1309948 objects misplaced (19.262%)
>  1030 active+clean
>  351  active+remapped+backfill_wait
>  107  activating+remapped
>  33   active+remapped+backfilling
>  15   activating+undersized+degraded+remapped
>
> A little bit better but still some non-active PGs.
> I will investigate your other hints!
>
> Thanks
> Kevin
>
> 2018-05-17 13:30 GMT+02:00 Burkhard Linke  bio.uni-giessen.de>:
>
>> Hi,
>>
>>
>>
>> On 05/17/2018 01:09 PM, Kevin Olbrich wrote:
>>
>>> Hi!
>>>
>>> Today I added some new OSDs (nearly doubled) to my luminous cluster.
>>> I then changed pg(p)_num from 256 to 1024 for that pool because it was
>>> complaining about to few PGs. (I noticed that should better have been
>>> small
>>> changes).
>>>
>>> This is the current status:
>>>
>>>  health: HEALTH_ERR
>>>  336568/1307562 objects misplaced (25.740%)
>>>  Reduced data availability: 128 pgs inactive, 3 pgs peering,
>>> 1
>>> pg stale
>>>  Degraded data redundancy: 6985/1307562 objects degraded
>>> (0.534%), 19 pgs degraded, 19 pgs undersized
>>>  107 slow requests are blocked > 32 sec
>>>  218 stuck requests are blocked > 4096 sec
>>>
>>>data:
>>>  pools:   2 pools, 1536 pgs
>>>  objects: 638k objects, 2549 GB
>>>  usage:   5210 GB used, 11295 GB / 16506 GB avail
>>>  pgs: 0.195% pgs unknown
>>>   8.138% pgs not active
>>>   6985/1307562 objects degraded (0.534%)
>>>   336568/1307562 objects misplaced (25.740%)
>>>   855 active+clean
>>>   517 active+remapped+backfill_wait
>>>   107 activating+remapped
>>>   31  active+remapped+backfilling
>>>   15  activating+undersized+degraded+remapped
>>>   4   active+undersized+degraded+remapped+backfilling
>>>   3   unknown
>>>   3   peering
>>>   1   stale+active+clean
>>>
>>
>> You need to resolve the unknown/peering/activating pgs first. You have
>> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25
>> OSDs and the heterogenous host sizes, I assume that some OSDs hold more
>> than 200 PGs. There's a threshold for the number of PGs; reaching this
>> threshold keeps the OSDs from accepting new PGs.
>>
>> Try to increase the threshold  (mon_max_pg_per_osd /
>> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about
>> the exact one, consult the documentation) to allow more PGs on the OSDs. If
>> this is the cause of the problem, the peering and activating states should
>> be resolved within a short time.
>>
>> You can also check the number of PGs per OSD with 'ceph osd df'; the last
>> column is the current number of PGs.
>>
>>
>>>
>>> OSD tree:
>>>
>>> ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
>>>   -1   16.12177 root default
>>> -16   16.12177 datacenter dc01
>>> -19   16.12177 pod dc01-agg01
>>> -108.98700 rack dc01-rack02
>>>   -44.03899 host node1001
>>>0   hdd  0.90999 osd.0 up  1.0 1.0
>>>1   hdd  0.90999 osd.1 up  1.0 1.0
>>>5   hdd  0.90999 osd.5 up  1.0 1.0
>>>2   ssd  0.43700 osd.2 up  1.0 1.0
>>>3   ssd  0.43700 osd.3 up  1.0 1.0
>>>4   ssd  0.43700 osd.4 up  1.0 1.0
>>>   -74.94899 host node1002
>>>9   hdd  0.90999 osd.9 up  1.0 1.0
>>>   10   hdd  0.90999 osd.10up  1.0 1.0
>>>   11   hdd  0.90999 osd.11up  1.0 1.0
>>>   12   hdd  0.90999 osd.12up  1.0 1.0
>>>6   ssd  0.43700 osd.6 up  1.0 1.0
>>>7   ssd  0.43700 osd.7 up  1.0 1.0
>>>8   ssd  0.43700 osd.8 up  1.0 1.0
>>> -117.13477 rack

Re: [ceph-users] Blocked requests activating+remapped afterextendingpg(p)_num

2018-05-17 Thread Kevin Olbrich

Hi!

Thanks for your quick reply.
Before I read your mail, i applied the following conf to my OSDs:
ceph tell 'osd.*' injectargs '--osd_max_pg_per_osd_hard_ratio 32'

Status is now:
  data:
pools:   2 pools, 1536 pgs
objects: 639k objects, 2554 GB
usage:   5211 GB used, 11295 GB / 16506 GB avail
pgs: 7.943% pgs not active
 5567/1309948 objects degraded (0.425%)
 252327/1309948 objects misplaced (19.262%)
 1030 active+clean
 351  active+remapped+backfill_wait
 107  activating+remapped
 33   active+remapped+backfilling
 15   activating+undersized+degraded+remapped

A little bit better but still some non-active PGs.
I will investigate your other hints!

Thanks
Kevin

2018-05-17 13:30 GMT+02:00 Burkhard Linke <
burkhard.li...@computational.bio.uni-giessen.de>:

> Hi,
>
>
>
> On 05/17/2018 01:09 PM, Kevin Olbrich wrote:
>
>> Hi!
>>
>> Today I added some new OSDs (nearly doubled) to my luminous cluster.
>> I then changed pg(p)_num from 256 to 1024 for that pool because it was
>> complaining about to few PGs. (I noticed that should better have been
>> small
>> changes).
>>
>> This is the current status:
>>
>>  health: HEALTH_ERR
>>  336568/1307562 objects misplaced (25.740%)
>>  Reduced data availability: 128 pgs inactive, 3 pgs peering, 1
>> pg stale
>>  Degraded data redundancy: 6985/1307562 objects degraded
>> (0.534%), 19 pgs degraded, 19 pgs undersized
>>  107 slow requests are blocked > 32 sec
>>  218 stuck requests are blocked > 4096 sec
>>
>>data:
>>  pools:   2 pools, 1536 pgs
>>  objects: 638k objects, 2549 GB
>>  usage:   5210 GB used, 11295 GB / 16506 GB avail
>>  pgs: 0.195% pgs unknown
>>   8.138% pgs not active
>>   6985/1307562 objects degraded (0.534%)
>>   336568/1307562 objects misplaced (25.740%)
>>   855 active+clean
>>   517 active+remapped+backfill_wait
>>   107 activating+remapped
>>   31  active+remapped+backfilling
>>   15  activating+undersized+degraded+remapped
>>   4   active+undersized+degraded+remapped+backfilling
>>   3   unknown
>>   3   peering
>>   1   stale+active+clean
>>
>
> You need to resolve the unknown/peering/activating pgs first. You have
> 1536 PGs, assuming replication size 3 this make 4608 PG copies. Given 25
> OSDs and the heterogenous host sizes, I assume that some OSDs hold more
> than 200 PGs. There's a threshold for the number of PGs; reaching this
> threshold keeps the OSDs from accepting new PGs.
>
> Try to increase the threshold  (mon_max_pg_per_osd /
> max_pg_per_osd_hard_ratio / osd_max_pg_per_osd_hard_ratio, not sure about
> the exact one, consult the documentation) to allow more PGs on the OSDs. If
> this is the cause of the problem, the peering and activating states should
> be resolved within a short time.
>
> You can also check the number of PGs per OSD with 'ceph osd df'; the last
> column is the current number of PGs.
>
>
>>
>> OSD tree:
>>
>> ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
>>   -1   16.12177 root default
>> -16   16.12177 datacenter dc01
>> -19   16.12177 pod dc01-agg01
>> -108.98700 rack dc01-rack02
>>   -44.03899 host node1001
>>0   hdd  0.90999 osd.0 up  1.0 1.0
>>1   hdd  0.90999 osd.1 up  1.0 1.0
>>5   hdd  0.90999 osd.5 up  1.0 1.0
>>2   ssd  0.43700 osd.2 up  1.0 1.0
>>3   ssd  0.43700 osd.3 up  1.0 1.0
>>4   ssd  0.43700 osd.4 up  1.0 1.0
>>   -74.94899 host node1002
>>9   hdd  0.90999 osd.9 up  1.0 1.0
>>   10   hdd  0.90999 osd.10up  1.0 1.0
>>   11   hdd  0.90999 osd.11up  1.0 1.0
>>   12   hdd  0.90999 osd.12up  1.0 1.0
>>6   ssd  0.43700 osd.6 up  1.0 1.0
>>7   ssd  0.43700 osd.7 up  1.0 1.0
>>8   ssd  0.43700 osd.8 up  1.0 1.0
>> -117.13477 rack dc01-rack03
>> -225.38678 host node1003
>>   17   hdd  0.90970 osd.17up  1.0 1.0
>>   18   hdd  0.90970 osd.18up  1.0 1.0
>>   24   hdd  0.90970 osd.24up  1.0 1.0
>>   26   hdd  0.90970 osd.26up  1.0 1.0
>>   13   ssd  0.43700

[ceph-users] Blocked requests activating+remapped after extending pg(p)_num

2018-05-17 Thread Kevin Olbrich

Hi!

Today I added some new OSDs (nearly doubled) to my luminous cluster.
I then changed pg(p)_num from 256 to 1024 for that pool because it was
complaining about to few PGs. (I noticed that should better have been small
changes).

This is the current status:

health: HEALTH_ERR
336568/1307562 objects misplaced (25.740%)
Reduced data availability: 128 pgs inactive, 3 pgs peering, 1
pg stale
Degraded data redundancy: 6985/1307562 objects degraded
(0.534%), 19 pgs degraded, 19 pgs undersized
107 slow requests are blocked > 32 sec
218 stuck requests are blocked > 4096 sec

  data:
pools:   2 pools, 1536 pgs
objects: 638k objects, 2549 GB
usage:   5210 GB used, 11295 GB / 16506 GB avail
pgs: 0.195% pgs unknown
 8.138% pgs not active
 6985/1307562 objects degraded (0.534%)
 336568/1307562 objects misplaced (25.740%)
 855 active+clean
 517 active+remapped+backfill_wait
 107 activating+remapped
 31  active+remapped+backfilling
 15  activating+undersized+degraded+remapped
 4   active+undersized+degraded+remapped+backfilling
 3   unknown
 3   peering
 1   stale+active+clean


OSD tree:

ID  CLASS WEIGHT   TYPE NAME STATUS REWEIGHT PRI-AFF
 -1   16.12177 root default
-16   16.12177 datacenter dc01
-19   16.12177 pod dc01-agg01
-108.98700 rack dc01-rack02
 -44.03899 host node1001
  0   hdd  0.90999 osd.0 up  1.0 1.0
  1   hdd  0.90999 osd.1 up  1.0 1.0
  5   hdd  0.90999 osd.5 up  1.0 1.0
  2   ssd  0.43700 osd.2 up  1.0 1.0
  3   ssd  0.43700 osd.3 up  1.0 1.0
  4   ssd  0.43700 osd.4 up  1.0 1.0
 -74.94899 host node1002
  9   hdd  0.90999 osd.9 up  1.0 1.0
 10   hdd  0.90999 osd.10up  1.0 1.0
 11   hdd  0.90999 osd.11up  1.0 1.0
 12   hdd  0.90999 osd.12up  1.0 1.0
  6   ssd  0.43700 osd.6 up  1.0 1.0
  7   ssd  0.43700 osd.7 up  1.0 1.0
  8   ssd  0.43700 osd.8 up  1.0 1.0
-117.13477 rack dc01-rack03
-225.38678 host node1003
 17   hdd  0.90970 osd.17up  1.0 1.0
 18   hdd  0.90970 osd.18up  1.0 1.0
 24   hdd  0.90970 osd.24up  1.0 1.0
 26   hdd  0.90970 osd.26up  1.0 1.0
 13   ssd  0.43700 osd.13up  1.0 1.0
 14   ssd  0.43700 osd.14up  1.0 1.0
 15   ssd  0.43700 osd.15up  1.0 1.0
 16   ssd  0.43700 osd.16up  1.0 1.0
-251.74799 host node1004
 19   ssd  0.43700 osd.19up  1.0 1.0
 20   ssd  0.43700 osd.20up  1.0 1.0
 21   ssd  0.43700 osd.21up  1.0 1.0
 22   ssd  0.43700 osd.22up  1.0 1.0


Crush rule is set to chooseleaf rack and (temporary!) to size 2.
Why are PGs stuck in peering and activating?
"ceph df" shows that only 1,5TB are used on the pool, residing on the hdd's
- which would perfectly fit the crush rule(?)

Is this only a problem during recovery and the cluster moves to OK after
rebalance or can I take any action to unblock IO on the hdd pool?
This is a pre-prod cluster, it does not have highest prio but I would
appreciate if we would be able to use it before rebalancing is completed.

Kind regards,
Kevin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked Requests

2018-04-25 Thread Shantur Rathore

Hi all,

So using ceph-ansible, i built the below mentioned cluster with 2 OSD
Nodes and 3 Mons
Just after creating osds i started benchmarking the performance using
"rbd bench" and "rados bench" and started seeing the performance drop.
Checking the status shows slow requests.


[root@storage-28-1 ~]# ceph -s
  cluster:
id: 009cbed0-e5a8-4b18-a313-098e55742e85
health: HEALTH_WARN
insufficient standby MDS daemons available
1264 slow requests are blocked > 32 sec

  services:
mon: 3 daemons, quorum storage-30,storage-29,storage-28-1
mgr: storage-30(active), standbys: storage-28-1, storage-29
mds: cephfs-3/3/3 up
{0=storage-30=up:active,1=storage-28-1=up:active,2=storage-29=up:active}
osd: 33 osds: 33 up, 33 in
tcmu-runner: 2 daemons active

  data:
pools:   3 pools, 1536 pgs
objects: 13289 objects, 42881 MB
usage:   102 GB used, 55229 GB / 55331 GB avail
pgs: 1536 active+clean

  io:
client:   1694 B/s rd, 1 op/s rd, 0 op/s wr



[root@storage-28-1 ~]# ceph health detail
HEALTH_WARN insufficient standby MDS daemons available; 904 slow
requests are blocked > 32 sec
MDS_INSUFFICIENT_STANDBY insufficient standby MDS daemons available
have 0; want 1 more
REQUEST_SLOW 904 slow requests are blocked > 32 sec
364 ops are blocked > 1048.58 sec
212 ops are blocked > 524.288 sec
164 ops are blocked > 262.144 sec
100 ops are blocked > 131.072 sec
64 ops are blocked > 65.536 sec
osd.11 has blocked requests > 524.288 sec
osds 9,32 have blocked requests > 1048.58 sec


osd 9 log : https://pastebin.com/ex41cFww

I see that from time to time different osds are reporting blocked
requests. I am not sure what could be the cause of this. Can anyone
help me fix this please.

[root@storage-28-1 ~]# ceph osd tree
ID CLASS WEIGHT   TYPE NAME   STATUS REWEIGHT PRI-AFF
-1   54.03387 root default
-3   27.83563 host storage-29
 2   hdd  1.63739 osd.2   up  1.0 1.0
 3   hdd  1.63739 osd.3   up  1.0 1.0
 4   hdd  1.63739 osd.4   up  1.0 1.0
 5   hdd  1.63739 osd.5   up  1.0 1.0
 6   hdd  1.63739 osd.6   up  1.0 1.0
 7   hdd  1.63739 osd.7   up  1.0 1.0
 8   hdd  1.63739 osd.8   up  1.0 1.0
 9   hdd  1.63739 osd.9   up  1.0 1.0
10   hdd  1.63739 osd.10  up  1.0 1.0
11   hdd  1.63739 osd.11  up  1.0 1.0
12   hdd  1.63739 osd.12  up  1.0 1.0
13   hdd  1.63739 osd.13  up  1.0 1.0
14   hdd  1.63739 osd.14  up  1.0 1.0
15   hdd  1.63739 osd.15  up  1.0 1.0
16   hdd  1.63739 osd.16  up  1.0 1.0
17   hdd  1.63739 osd.17  up  1.0 1.0
18   hdd  1.63739 osd.18  up  1.0 1.0
-5   26.19824 host storage-30
 0   hdd  1.63739 osd.0   up  1.0 1.0
 1   hdd  1.63739 osd.1   up  1.0 1.0
19   hdd  1.63739 osd.19  up  1.0 1.0
20   hdd  1.63739 osd.20  up  1.0 1.0
21   hdd  1.63739 osd.21  up  1.0 1.0
22   hdd  1.63739 osd.22  up  1.0 1.0
23   hdd  1.63739 osd.23  up  1.0 1.0
24   hdd  1.63739 osd.24  up  1.0 1.0
25   hdd  1.63739 osd.25  up  1.0 1.0
26   hdd  1.63739 osd.26  up  1.0 1.0
27   hdd  1.63739 osd.27  up  1.0 1.0
28   hdd  1.63739 osd.28  up  1.0 1.0
29   hdd  1.63739 osd.29  up  1.0 1.0
30   hdd  1.63739 osd.30  up  1.0 1.0
31   hdd  1.63739 osd.31  up  1.0 1.0
32   hdd  1.63739 osd.32  up  1.0 1.0

thanks


On Fri, Apr 20, 2018 at 10:24 AM, Shantur Rathore
 wrote:
>
> Thanks Alfredo.  I will use ceph-volume.
>
> On Thu, Apr 19, 2018 at 4:24 PM, Alfredo Deza  wrote:
>>
>> On Thu, Apr 19, 2018 at 11:10 AM, Shantur Rathore
>>  wrote:
>> > Hi,
>> >
>> > I am building my first Ceph cluster from hardware leftover from a previous
>> > project. I have been reading a lot of Ceph documentation but need some help
>> > to make sure I going the right way.
>> > To set the stage below is what I have
>> >
>> > Rack-1
>> >
>> > 1 x HP DL360 G9 with
>> >- 256 GB Memory
>> >- 5 x 300GB HDD
>> >- 2 x HBA SAS
>> >- 4 x 10GBe Networking Card
>> >
>> > 1 x SuperMicro chassis with 17 x HP Enterprise 400GB SSD and 17 x HP
>> > Enterprise 1.7TB HDD
>> > Chassis and HP server are connected with 2 x SAS HBA for redundancy.
>> >
>> >
>> >

Re: [ceph-users] Blocked requests

2017-12-14 Thread Fulvio Galeazzi


Hallo Matthew, thanks for your feedback!
  Please clarify one point: you mean that you recreated the pool as an 
erasure-coded one, or that you recreated it as a regular replicated one? 
I mean, you now have an erasure-coded pool in production as a gnocchi 
backend?


  In any case, from the instability you mention, experimenting with 
BlueStore looks like a better alternative.


  Thanks again

Fulvio

 Original Message 
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud <mattstr...@overstock.com>
To: Fulvio Galeazzi <fulvio.galea...@garr.it>, Brian Andrus 
<brian.and...@dreamhost.com>

CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Date: 12/13/2017 5:05 PM


We fixed it by destroying the pool and recreating it though this isn’t really a 
fix. Come to find out ceph has a weakness for small high change rate objects 
(the behavior that gnocchi displays). The cluster will keep going fine until an 
event (aka a reboot, osd failure, etc) happens. I haven’t been able to find 
another solution.

I have heard that BlueStore handles this better, but that wasn’t stable on the 
release we are on.

Thanks,
Matthew Stroud

On 12/13/17, 3:56 AM, "Fulvio Galeazzi" <fulvio.galea...@garr.it> wrote:

 Hallo Matthew,
  I am now facing the same issue and found this message of yours.
Were you eventually able to figure what the problem is, with
 erasure-coded pools?

 At first sight, the bugzilla page linked by Brian does not seem to
 specifically mention erasure-coded pools...

Thanks for your help

 Fulvio

  Original Message ----
 Subject: Re: [ceph-users] Blocked requests
 From: Matthew Stroud <mattstr...@overstock.com>
 To: Brian Andrus <brian.and...@dreamhost.com>
 CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
 Date: 09/07/2017 11:01 PM

 > After some troubleshooting, the issues appear to be caused by gnocchi
 > using rados. I’m trying to figure out why.
 >
 > Thanks,
 >
 > Matthew Stroud
 >
 > *From: *Brian Andrus <brian.and...@dreamhost.com>
 > *Date: *Thursday, September 7, 2017 at 1:53 PM
 > *To: *Matthew Stroud <mattstr...@overstock.com>
 > *Cc: *David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com"
 > <ceph-users@lists.ceph.com>
 > *Subject: *Re: [ceph-users] Blocked requests
 >
 > "ceph osd blocked-by" can do the same thing as that provided script.
 >
 > Can you post relevant osd.10 logs and a pg dump of an affected placement
 > group? Specifically interested in recovery_state section.
 >
 > Hopefully you were careful in how you were rebooting OSDs, and not
 > rebooting multiple in the same failure domain before recovery was able
 > to occur.
 >
 > On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud
 > <mattstr...@overstock.com <mailto:mattstr...@overstock.com>> wrote:
 >
 > Here is the output of your snippet:
 >
 > [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
 >
 >6 osd.10
 >
 > 52  ops are blocked > 4194.3   sec on osd.17
 >
 > 9   ops are blocked > 2097.15  sec on osd.10
 >
 > 4   ops are blocked > 1048.58  sec on osd.10
 >
 > 39  ops are blocked > 262.144  sec on osd.10
 >
 > 19  ops are blocked > 131.072  sec on osd.10
 >
 > 6   ops are blocked > 65.536   sec on osd.10
 >
 > 2   ops are blocked > 32.768   sec on osd.10
 >
 > Here is some backfilling info:
 >
 > [root@mon01 ceph-conf]# ceph status
 >
 >  cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
 >
 >   health HEALTH_WARN
 >
 >  5 pgs backfilling
 >
 >  5 pgs degraded
 >
 >  5 pgs stuck degraded
 >
 >  5 pgs stuck unclean
 >
 >  5 pgs stuck undersized
 >
 >  5 pgs undersized
 >
 >  122 requests are blocked > 32 sec
 >
 >  recovery 2361/1097929 objects degraded (0.215%)
 >
 >  recovery 5578/1097929 objects misplaced (0.508%)
 >
 >   monmap e1: 3 mons at
 > 
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0
 > 
<http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>}
 >
 >  election epoch 58, quor

Re: [ceph-users] Blocked requests

2017-12-13 Thread Matthew Stroud

We fixed it by destroying the pool and recreating it though this isn’t really a 
fix. Come to find out ceph has a weakness for small high change rate objects 
(the behavior that gnocchi displays). The cluster will keep going fine until an 
event (aka a reboot, osd failure, etc) happens. I haven’t been able to find 
another solution.

I have heard that BlueStore handles this better, but that wasn’t stable on the 
release we are on.

Thanks,
Matthew Stroud

On 12/13/17, 3:56 AM, "Fulvio Galeazzi" <fulvio.galea...@garr.it> wrote:

Hallo Matthew,
 I am now facing the same issue and found this message of yours.
   Were you eventually able to figure what the problem is, with
erasure-coded pools?

At first sight, the bugzilla page linked by Brian does not seem to
specifically mention erasure-coded pools...

   Thanks for your help

Fulvio

 Original Message 
    Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud <mattstr...@overstock.com>
To: Brian Andrus <brian.and...@dreamhost.com>
CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Date: 09/07/2017 11:01 PM

> After some troubleshooting, the issues appear to be caused by gnocchi
> using rados. I’m trying to figure out why.
>
> Thanks,
>
> Matthew Stroud
>
> *From: *Brian Andrus <brian.and...@dreamhost.com>
> *Date: *Thursday, September 7, 2017 at 1:53 PM
> *To: *Matthew Stroud <mattstr...@overstock.com>
> *Cc: *David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com"
> <ceph-users@lists.ceph.com>
> *Subject: *Re: [ceph-users] Blocked requests
>
> "ceph osd blocked-by" can do the same thing as that provided script.
>
> Can you post relevant osd.10 logs and a pg dump of an affected placement
> group? Specifically interested in recovery_state section.
>
> Hopefully you were careful in how you were rebooting OSDs, and not
> rebooting multiple in the same failure domain before recovery was able
> to occur.
>
> On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud
> <mattstr...@overstock.com <mailto:mattstr...@overstock.com>> wrote:
>
> Here is the output of your snippet:
>
> [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
>
>6 osd.10
>
> 52  ops are blocked > 4194.3   sec on osd.17
>
> 9   ops are blocked > 2097.15  sec on osd.10
>
> 4   ops are blocked > 1048.58  sec on osd.10
>
> 39  ops are blocked > 262.144  sec on osd.10
>
> 19  ops are blocked > 131.072  sec on osd.10
>
> 6   ops are blocked > 65.536   sec on osd.10
>
> 2   ops are blocked > 32.768   sec on osd.10
>
> Here is some backfilling info:
>
> [root@mon01 ceph-conf]# ceph status
>
>  cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
>
>   health HEALTH_WARN
>
>  5 pgs backfilling
>
>  5 pgs degraded
>
>  5 pgs stuck degraded
>
>  5 pgs stuck unclean
>
>  5 pgs stuck undersized
>
>  5 pgs undersized
>
>  122 requests are blocked > 32 sec
>
>  recovery 2361/1097929 objects degraded (0.215%)
>
>  recovery 5578/1097929 objects misplaced (0.508%)
>
>   monmap e1: 3 mons at
> 
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0
> 
<http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>}
>
>  election epoch 58, quorum 0,1,2 mon01,mon02,mon03
>
>   osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
>
>  flags sortbitwise,require_jewel_osds
>
>pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
>
>  1005 GB used, 20283 GB / 21288 GB avail
>
>  2361/1097929 objects degraded (0.215%)
>
>  5578/1097929 objects misplaced (0.508%)
>
>  2587 active+clean
>
> 5 active+undersized+degraded+remapped+backfilling
>
> [root@mon01 ceph-conf]# ceph pg dump_stuck unclean
>
> ok
>
> pg_stat state   up  up_primary  acting  acting_primary
>
> 3.5c2   active+und

Re: [ceph-users] Blocked requests

2017-12-13 Thread Fulvio Galeazzi


Hallo Matthew,
I am now facing the same issue and found this message of yours.
  Were you eventually able to figure what the problem is, with 
erasure-coded pools?


At first sight, the bugzilla page linked by Brian does not seem to 
specifically mention erasure-coded pools...


  Thanks for your help

Fulvio

 Original Message 
Subject: Re: [ceph-users] Blocked requests
From: Matthew Stroud <mattstr...@overstock.com>
To: Brian Andrus <brian.and...@dreamhost.com>
CC: "ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
Date: 09/07/2017 11:01 PM

After some troubleshooting, the issues appear to be caused by gnocchi 
using rados. I’m trying to figure out why.


Thanks,

Matthew Stroud

*From: *Brian Andrus <brian.and...@dreamhost.com>
*Date: *Thursday, September 7, 2017 at 1:53 PM
*To: *Matthew Stroud <mattstr...@overstock.com>
*Cc: *David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>

*Subject: *Re: [ceph-users] Blocked requests

"ceph osd blocked-by" can do the same thing as that provided script.

Can you post relevant osd.10 logs and a pg dump of an affected placement 
group? Specifically interested in recovery_state section.


Hopefully you were careful in how you were rebooting OSDs, and not 
rebooting multiple in the same failure domain before recovery was able 
to occur.


On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud 
<mattstr...@overstock.com <mailto:mattstr...@overstock.com>> wrote:


Here is the output of your snippet:

[root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh

   6 osd.10

52  ops are blocked > 4194.3   sec on osd.17

9   ops are blocked > 2097.15  sec on osd.10

4   ops are blocked > 1048.58  sec on osd.10

39  ops are blocked > 262.144  sec on osd.10

19  ops are blocked > 131.072  sec on osd.10

6   ops are blocked > 65.536   sec on osd.10

2   ops are blocked > 32.768   sec on osd.10

Here is some backfilling info:

[root@mon01 ceph-conf]# ceph status

     cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155

  health HEALTH_WARN

     5 pgs backfilling

     5 pgs degraded

     5 pgs stuck degraded

     5 pgs stuck unclean

     5 pgs stuck undersized

     5 pgs undersized

     122 requests are blocked > 32 sec

     recovery 2361/1097929 objects degraded (0.215%)

     recovery 5578/1097929 objects misplaced (0.508%)

  monmap e1: 3 mons at
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0

<http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>}

     election epoch 58, quorum 0,1,2 mon01,mon02,mon03

  osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs

     flags sortbitwise,require_jewel_osds

   pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects

     1005 GB used, 20283 GB / 21288 GB avail

     2361/1097929 objects degraded (0.215%)

     5578/1097929 objects misplaced (0.508%)

     2587 active+clean

    5 active+undersized+degraded+remapped+backfilling

[root@mon01 ceph-conf]# ceph pg dump_stuck unclean

ok

pg_stat state   up  up_primary  acting  acting_primary

3.5c2   active+undersized+degraded+remapped+backfilling
[17,2,10]   17  [17,2]  17

3.54a   active+undersized+degraded+remapped+backfilling
[10,19,2]   10  [10,17] 10

5.3b    active+undersized+degraded+remapped+backfilling
[3,19,0]    3   [10,17] 10

5.b3    active+undersized+degraded+remapped+backfilling
[10,19,2]   10  [10,17] 10

3.180   active+undersized+degraded+remapped+backfilling
[17,10,6]   17  [22,19] 22

Most of the back filling is was caused by restarting osds to clear
blocked IO. Here are some of the blocked IOs:

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10
10.20.57.15:6806/7029 <http://10.20.57.15:6806/7029> 9362 : cluster
[WRN] slow request 60.834494 seconds old, received at 2017-09-07
13:28:36.143920: osd_op(client.114947.0:2039090 5.e637a4b3
(undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected
e6511) currently queued_for_pg

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10
10.20.57.15:6806/7029 <http://10.20.57.15:6806/7029> 9363 : cluster
[WRN] slow request 240.661052 seconds old, received at 2017-09-07
13:25:36.317363: osd_op(client.246934107.0:3 5.f69addd6 (undecoded)
ack+read+known_if_redirected e6511) currently queued_for_pg

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10
10.20.57.15:6806/7029 <http://10.20.57.15:6

Re: [ceph-users] Blocked requests

2017-09-07 Thread Brad Hubbard

Is it this?

https://bugzilla.redhat.com/show_bug.cgi?id=1430588

On Fri, Sep 8, 2017 at 7:01 AM, Matthew Stroud <mattstr...@overstock.com> wrote:
> After some troubleshooting, the issues appear to be caused by gnocchi using
> rados. I’m trying to figure out why.
>
>
>
> Thanks,
>
> Matthew Stroud
>
>
>
> From: Brian Andrus <brian.and...@dreamhost.com>
> Date: Thursday, September 7, 2017 at 1:53 PM
> To: Matthew Stroud <mattstr...@overstock.com>
> Cc: David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com"
> <ceph-users@lists.ceph.com>
>
>
> Subject: Re: [ceph-users] Blocked requests
>
>
>
> "ceph osd blocked-by" can do the same thing as that provided script.
>
>
>
> Can you post relevant osd.10 logs and a pg dump of an affected placement
> group? Specifically interested in recovery_state section.
>
>
>
> Hopefully you were careful in how you were rebooting OSDs, and not rebooting
> multiple in the same failure domain before recovery was able to occur.
>
>
>
> On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud <mattstr...@overstock.com>
> wrote:
>
> Here is the output of your snippet:
>
> [root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
>
>   6 osd.10
>
> 52  ops are blocked > 4194.3   sec on osd.17
>
> 9   ops are blocked > 2097.15  sec on osd.10
>
> 4   ops are blocked > 1048.58  sec on osd.10
>
> 39  ops are blocked > 262.144  sec on osd.10
>
> 19  ops are blocked > 131.072  sec on osd.10
>
> 6   ops are blocked > 65.536   sec on osd.10
>
> 2   ops are blocked > 32.768   sec on osd.10
>
>
>
> Here is some backfilling info:
>
>
>
> [root@mon01 ceph-conf]# ceph status
>
> cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
>
>  health HEALTH_WARN
>
> 5 pgs backfilling
>
> 5 pgs degraded
>
> 5 pgs stuck degraded
>
> 5 pgs stuck unclean
>
> 5 pgs stuck undersized
>
> 5 pgs undersized
>
> 122 requests are blocked > 32 sec
>
> recovery 2361/1097929 objects degraded (0.215%)
>
> recovery 5578/1097929 objects misplaced (0.508%)
>
>  monmap e1: 3 mons at
> {mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0}
>
> election epoch 58, quorum 0,1,2 mon01,mon02,mon03
>
>  osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
>
> flags sortbitwise,require_jewel_osds
>
>   pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
>
> 1005 GB used, 20283 GB / 21288 GB avail
>
> 2361/1097929 objects degraded (0.215%)
>
> 5578/1097929 objects misplaced (0.508%)
>
> 2587 active+clean
>
>5 active+undersized+degraded+remapped+backfilling
>
> [root@mon01 ceph-conf]# ceph pg dump_stuck unclean
>
> ok
>
> pg_stat state   up  up_primary  acting  acting_primary
>
> 3.5c2   active+undersized+degraded+remapped+backfilling [17,2,10]   17
> [17,2]  17
>
> 3.54a   active+undersized+degraded+remapped+backfilling [10,19,2]   10
> [10,17] 10
>
> 5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]3
> [10,17] 10
>
> 5.b3active+undersized+degraded+remapped+backfilling [10,19,2]   10
> [10,17] 10
>
> 3.180   active+undersized+degraded+remapped+backfilling [17,10,6]   17
> [22,19] 22
>
>
>
> Most of the back filling is was caused by restarting osds to clear blocked
> IO. Here are some of the blocked IOs:
>
>
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10
> 10.20.57.15:6806/7029 9362 : cluster [WRN] slow request 60.834494 seconds
> old, received at 2017-09-07 13:28:36.143920: osd_op(client.114947.0:2039090
> 5.e637a4b3 (undecoded)
> ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10
> 10.20.57.15:6806/7029 9363 : cluster [WRN] slow request 240.661052 seconds
> old, received at 2017-09-07 13:25:36.317363: osd_op(client.246934107.0:3
> 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10
> 10.20.57.15:6806/7029 9364 : cluster [WRN] slow request 240.660763 seconds
> old, received at 2017-09-07 13:25:36.317651: osd_op(client.246944377.0:2
> 5.f69addd6 (undecoded) ack+read+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10
> 10.20.57.15:68

Re: [ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud

After some troubleshooting, the issues appear to be caused by gnocchi using 
rados. I’m trying to figure out why.

Thanks,
Matthew Stroud

From: Brian Andrus <brian.and...@dreamhost.com>
Date: Thursday, September 7, 2017 at 1:53 PM
To: Matthew Stroud <mattstr...@overstock.com>
Cc: David Turner <drakonst...@gmail.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Blocked requests

"ceph osd blocked-by" can do the same thing as that provided script.

Can you post relevant osd.10 logs and a pg dump of an affected placement group? 
Specifically interested in recovery_state section.

Hopefully you were careful in how you were rebooting OSDs, and not rebooting 
multiple in the same failure domain before recovery was able to occur.

On Thu, Sep 7, 2017 at 12:30 PM, Matthew Stroud 
<mattstr...@overstock.com<mailto:mattstr...@overstock.com>> wrote:
Here is the output of your snippet:
[root@mon01 ceph-conf]# bash /tmp/ceph_foo.sh
  6 osd.10
52  ops are blocked > 4194.3   sec on osd.17
9   ops are blocked > 2097.15  sec on osd.10
4   ops are blocked > 1048.58  sec on osd.10
39  ops are blocked > 262.144  sec on osd.10
19  ops are blocked > 131.072  sec on osd.10
6   ops are blocked > 65.536   sec on osd.10
2   ops are blocked > 32.768   sec on osd.10

Here is some backfilling info:

[root@mon01 ceph-conf]# ceph status
cluster 55ebbc2d-c5b7-4beb-9688-0926cefee155
 health HEALTH_WARN
5 pgs backfilling
5 pgs degraded
5 pgs stuck degraded
5 pgs stuck unclean
5 pgs stuck undersized
5 pgs undersized
122 requests are blocked > 32 sec
recovery 2361/1097929 objects degraded (0.215%)
recovery 5578/1097929 objects misplaced (0.508%)
 monmap e1: 3 mons at 
{mon01=10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0<http://10.20.57.10:6789/0,mon02=10.20.57.11:6789/0,mon03=10.20.57.12:6789/0>}
election epoch 58, quorum 0,1,2 mon01,mon02,mon03
 osdmap e6511: 24 osds: 21 up, 21 in; 5 remapped pgs
flags sortbitwise,require_jewel_osds
  pgmap v6474659: 2592 pgs, 5 pools, 333 GB data, 356 kobjects
1005 GB used, 20283 GB / 21288 GB avail
2361/1097929 objects degraded (0.215%)
5578/1097929 objects misplaced (0.508%)
2587 active+clean
   5 active+undersized+degraded+remapped+backfilling
[root@mon01 ceph-conf]# ceph pg dump_stuck unclean
ok
pg_stat state   up  up_primary  acting  acting_primary
3.5c2   active+undersized+degraded+remapped+backfilling [17,2,10]   17  
[17,2]  17
3.54a   active+undersized+degraded+remapped+backfilling [10,19,2]   10  
[10,17] 10
5.3bactive+undersized+degraded+remapped+backfilling [3,19,0]3   
[10,17] 10
5.b3active+undersized+degraded+remapped+backfilling [10,19,2]   10  
[10,17] 10
3.180   active+undersized+degraded+remapped+backfilling [17,10,6]   17  
[22,19] 22

Most of the back filling is was caused by restarting osds to clear blocked IO. 
Here are some of the blocked IOs:

/var/log/ceph/ceph.log:2017-09-07 13:29:36.978559 osd.10 
10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9362 : cluster [WRN] slow 
request 60.834494 seconds old, received at 2017-09-07 13:28:36.143920: 
osd_op(client.114947.0:2039090 5.e637a4b3 (undecoded) 
ack+read+balance_reads+skiprwlocks+known_if_redirected e6511) currently 
queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978565 osd.10 
10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9363 : cluster [WRN] slow 
request 240.661052 seconds old, received at 2017-09-07 13:25:36.317363: 
osd_op(client.246934107.0:3 5.f69addd6 (undecoded) ack+read+known_if_redirected 
e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978571 osd.10 
10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9364 : cluster [WRN] slow 
request 240.660763 seconds old, received at 2017-09-07 13:25:36.317651: 
osd_op(client.246944377.0:2 5.f69addd6 (undecoded) ack+read+known_if_redirected 
e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:36.978576 osd.10 
10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9365 : cluster [WRN] slow 
request 240.660675 seconds old, received at 2017-09-07 13:25:36.317740: 
osd_op(client.246944377.0:3 5.f69addd6 (undecoded) ack+read+known_if_redirected 
e6511) currently queued_for_pg
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979367 osd.10 
10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9366 : cluster [WRN] 72 
slow requests, 3 included below; oldest blocked for > 1820.342287 secs
/var/log/ceph/ceph.log:2017-09-07 13:29:42.979373 osd.10 
10.20.57.15:6806/7029<http://10.20.57.15:6806/7029> 9367 : cluster [WRN] slow 
request 30.606290 seconds old, received at 2017-

Re: [ceph-users] Blocked requests

2017-09-07 Thread Brian Andrus

017-09-07 13:29:42.979377 osd.10
> 10.20.57.15:6806/7029 9368 : cluster [WRN] slow request 30.554317 seconds
> old, received at 2017-09-07 13:29:12.424972: osd_op(client.115020.0:1831942
> 5.39f2d3b (undecoded) ack+read+known_if_redirected e6511) currently
> queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:42.979383 osd.10
> 10.20.57.15:6806/7029 9369 : cluster [WRN] slow request 30.368086 seconds
> old, received at 2017-09-07 13:29:12.611204: osd_op(client.115014.0:73392774
> 5.e637a4b3 (undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected
> e6511) currently queued_for_pg
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:43.979553 osd.10
> 10.20.57.15:6806/7029 9370 : cluster [WRN] 73 slow requests, 1 included
> below; oldest blocked for > 1821.342499 secs
>
> /var/log/ceph/ceph.log:2017-09-07 13:29:43.979559 osd.10
> 10.20.57.15:6806/7029 9371 : cluster [WRN] slow request 30.452344 seconds
> old, received at 2017-09-07 13:29:13.527157: osd_op(client.115011.0:483954528
> 5.e637a4b3 (undecoded) ack+read+balance_reads+skiprwlocks+known_if_redirected
> e6511) currently queued_for_pg
>
>
>
> *From: *David Turner <drakonst...@gmail.com>
> *Date: *Thursday, September 7, 2017 at 1:17 PM
>
> *To: *Matthew Stroud <mattstr...@overstock.com>, "
> ceph-users@lists.ceph.com" <ceph-users@lists.ceph.com>
> *Subject: *Re: [ceph-users] Blocked requests
>
>
>
> I would recommend pushing forward with the update instead of rolling
> back.  Ceph doesn't have a track record of rolling back to a previous
> version.
>
>
>
> I don't have enough information to really make sense of the ceph health
> detail output.  Like are the osds listed all on the same host?  Over time
> of watching this output, are some of the requests clearing up?  Are there
> any other patterns?  I put the following in a script and run it in a watch
> command to try and follow patterns when I'm plagued with blocked requests.
>
> output=$(ceph --cluster $cluster health detail | grep 'ops are
> blocked' | sort -nrk6 | sed 's/ ops/+ops/' | sed 's/ sec/+sec/' | column -t
> -s'+')
>
> echo "$output" | grep -v 'on osd'
>
> echo "$output" | grep -Eo osd.[0-9]+ | sort -n | uniq -c | grep -v ' 1
> '
>
> echo "$output" | grep 'on osd'
>
>
>
> Why do you have backfilling?  You haven't mentioned that you have any
> backfilling yet.  Installing an update shouldn't cause backfilling, but
> it's likely related to your blocked requests.
>
>
>
> On Thu, Sep 7, 2017 at 2:24 PM Matthew Stroud <mattstr...@overstock.com>
> wrote:
>
> Well in the meantime things have gone from bad to worse now the cluster
> isn’t rebuilding and clients are unable to pass IO to the cluster. When
> this first took place, we started rolling back to 10.2.7, though that was
> successful, it didn’t help with the issue. Here is the command output:
>
>
>
> HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43
> pgs stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs
> undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests;
> recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738
> objects misplaced (0.944%)
>
> pg 3.624 is stuck unclean for 1402.022837, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [12,9]
>
> pg 3.587 is stuck unclean for 2536.693566, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [18,13]
>
> pg 3.45f is stuck unclean for 1421.178244, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [14,10]
>
> pg 3.41a is stuck unclean for 1505.091187, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [9,23]
>
> pg 3.4cc is stuck unclean for 1560.824332, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [18,10]
>
> < snip>
>
> pg 3.188 is stuck degraded for 1207.118130, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [14,17]
>
> pg 3.768 is stuck degraded for 1123.722910, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [11,18]
>
> pg 3.77c is stuck degraded for 1211.981606, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [9,2]
>
> pg 3.7d1 is stuck degraded for 1074.422756, current state
> active+undersized+degraded+remapped+wait_backfill, last acting [10,12]
>
> pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting
> [10,12]
>
> pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting
> [9,2]
>
> pg 3.768 is active+undersized

Re: [ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud

currently 
queued_for_pg

From: David Turner <drakonst...@gmail.com>
Date: Thursday, September 7, 2017 at 1:17 PM
To: Matthew Stroud <mattstr...@overstock.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Blocked requests

I would recommend pushing forward with the update instead of rolling back.  
Ceph doesn't have a track record of rolling back to a previous version.

I don't have enough information to really make sense of the ceph health detail 
output.  Like are the osds listed all on the same host?  Over time of watching 
this output, are some of the requests clearing up?  Are there any other 
patterns?  I put the following in a script and run it in a watch command to try 
and follow patterns when I'm plagued with blocked requests.
output=$(ceph --cluster $cluster health detail | grep 'ops are blocked' | 
sort -nrk6 | sed 's/ ops/+ops/' | sed 's/ sec/+sec/' | column -t -s'+')
echo "$output" | grep -v 'on osd'
echo "$output" | grep -Eo osd.[0-9]+ | sort -n | uniq -c | grep -v ' 1 '
echo "$output" | grep 'on osd'

Why do you have backfilling?  You haven't mentioned that you have any 
backfilling yet.  Installing an update shouldn't cause backfilling, but it's 
likely related to your blocked requests.

On Thu, Sep 7, 2017 at 2:24 PM Matthew Stroud 
<mattstr...@overstock.com<mailto:mattstr...@overstock.com>> wrote:
Well in the meantime things have gone from bad to worse now the cluster isn’t 
rebuilding and clients are unable to pass IO to the cluster. When this first 
took place, we started rolling back to 10.2.7, though that was successful, it 
didn’t help with the issue. Here is the command output:

HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 pgs 
stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs 
undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; 
recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 objects 
misplaced (0.944%)
pg 3.624 is stuck unclean for 1402.022837, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [12,9]
pg 3.587 is stuck unclean for 2536.693566, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [18,13]
pg 3.45f is stuck unclean for 1421.178244, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [14,10]
pg 3.41a is stuck unclean for 1505.091187, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [9,23]
pg 3.4cc is stuck unclean for 1560.824332, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [18,10]
< snip>
pg 3.188 is stuck degraded for 1207.118130, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [14,17]
pg 3.768 is stuck degraded for 1123.722910, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [11,18]
pg 3.77c is stuck degraded for 1211.981606, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [9,2]
pg 3.7d1 is stuck degraded for 1074.422756, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [10,12]
pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting [10,12]
pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2]
pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting [11,18]
pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting [10,4]

pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting [2,10]
pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting [8,19]
pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting [2,21]
pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting [12,9]
2 ops are blocked > 1048.58 sec on osd.9
3 ops are blocked > 65.536 sec on osd.9
7 ops are blocked > 1048.58 sec on osd.8
1 ops are blocked > 524.288 sec on osd.8
1 ops are blocked > 131.072 sec on osd.8

1 ops are blocked > 524.288 sec on osd.2
1 ops are blocked > 262.144 sec on osd.2
2 ops are blocked > 65.536 sec on osd.21
9 ops are blocked > 1048.58 sec on osd.5
9 ops are blocked > 524.288 sec on osd.5
71 ops are blocked > 131.072 sec on osd.5
19 ops are blocked > 65.536 sec on osd.5
35 ops are blocked > 32.768 sec on osd.5
14 osds have slow requests
recovery 4678/1097738 objects degraded (0.426%)
recovery 10364/1097738 objects misplaced (0.944%)


From: David Turner <drakonst...@gmail.com<mailto:drakonst...@gmail.com>>
Date: Thursday, September 7, 2017 at 11:33 AM
To: Matthew Stroud <mattstr...@overstock.com<mailto:mattstr...@overstock.com>>, 
"ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>" 
<ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>>
Subject: Re: [ceph-users] Blocked requests

To be f

Re: [ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud

Well in the meantime things have gone from bad to worse now the cluster isn’t 
rebuilding and clients are unable to pass IO to the cluster. When this first 
took place, we started rolling back to 10.2.7, though that was successful, it 
didn’t help with the issue. Here is the command output:

HEALTH_WARN 39 pgs backfill_wait; 5 pgs backfilling; 43 pgs degraded; 43 pgs 
stuck degraded; 44 pgs stuck unclean; 43 pgs stuck undersized; 43 pgs 
undersized; 367 requests are blocked > 32 sec; 14 osds have slow requests; 
recovery 4678/1097738 objects degraded (0.426%); recovery 10364/1097738 objects 
misplaced (0.944%)
pg 3.624 is stuck unclean for 1402.022837, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [12,9]
pg 3.587 is stuck unclean for 2536.693566, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [18,13]
pg 3.45f is stuck unclean for 1421.178244, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [14,10]
pg 3.41a is stuck unclean for 1505.091187, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [9,23]
pg 3.4cc is stuck unclean for 1560.824332, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [18,10]
< snip>
pg 3.188 is stuck degraded for 1207.118130, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [14,17]
pg 3.768 is stuck degraded for 1123.722910, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [11,18]
pg 3.77c is stuck degraded for 1211.981606, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [9,2]
pg 3.7d1 is stuck degraded for 1074.422756, current state 
active+undersized+degraded+remapped+wait_backfill, last acting [10,12]
pg 3.7d1 is active+undersized+degraded+remapped+wait_backfill, acting [10,12]
pg 3.77c is active+undersized+degraded+remapped+wait_backfill, acting [9,2]
pg 3.768 is active+undersized+degraded+remapped+wait_backfill, acting [11,18]
pg 3.709 is active+undersized+degraded+remapped+wait_backfill, acting [10,4]

pg 3.5d8 is active+undersized+degraded+remapped+wait_backfill, acting [2,10]
pg 3.5dc is active+undersized+degraded+remapped+wait_backfill, acting [8,19]
pg 3.5f8 is active+undersized+degraded+remapped+wait_backfill, acting [2,21]
pg 3.624 is active+undersized+degraded+remapped+wait_backfill, acting [12,9]
2 ops are blocked > 1048.58 sec on osd.9
3 ops are blocked > 65.536 sec on osd.9
7 ops are blocked > 1048.58 sec on osd.8
1 ops are blocked > 524.288 sec on osd.8
1 ops are blocked > 131.072 sec on osd.8

1 ops are blocked > 524.288 sec on osd.2
1 ops are blocked > 262.144 sec on osd.2
2 ops are blocked > 65.536 sec on osd.21
9 ops are blocked > 1048.58 sec on osd.5
9 ops are blocked > 524.288 sec on osd.5
71 ops are blocked > 131.072 sec on osd.5
19 ops are blocked > 65.536 sec on osd.5
35 ops are blocked > 32.768 sec on osd.5
14 osds have slow requests
recovery 4678/1097738 objects degraded (0.426%)
recovery 10364/1097738 objects misplaced (0.944%)


From: David Turner <drakonst...@gmail.com>
Date: Thursday, September 7, 2017 at 11:33 AM
To: Matthew Stroud <mattstr...@overstock.com>, "ceph-users@lists.ceph.com" 
<ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Blocked requests

To be fair, other times I have to go in and tweak configuration settings and 
timings to resolve chronic blocked requests.

On Thu, Sep 7, 2017 at 1:32 PM David Turner 
<drakonst...@gmail.com<mailto:drakonst...@gmail.com>> wrote:
`ceph health detail` will give a little more information into the blocked 
requests.  Specifically which OSDs are the requests blocked on and how long 
have they actually been blocked (as opposed to '> 32 sec').  I usually find a 
pattern after watching that for a time and narrow things down to an OSD, 
journal, etc.  Some times I just need to restart a specific OSD and all is well.

On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud 
<mattstr...@overstock.com<mailto:mattstr...@overstock.com>> wrote:
After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for 
‘currently waiting for missing object’. I have tried bouncing the osds and 
rebooting the osd nodes, but that just moves the problems around. Previous to 
this upgrade we had no issues. Any ideas of what to look at?

Thanks,
Matthew Stroud



CONFIDENTIALITY NOTICE: This message is intended only for the use and review of 
the individual or entity to which it is addressed and may contain information 
that is privileged and confidential. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message solely to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received

Re: [ceph-users] Blocked requests

2017-09-07 Thread David Turner

To be fair, other times I have to go in and tweak configuration settings
and timings to resolve chronic blocked requests.

On Thu, Sep 7, 2017 at 1:32 PM David Turner  wrote:

> `ceph health detail` will give a little more information into the blocked
> requests.  Specifically which OSDs are the requests blocked on and how long
> have they actually been blocked (as opposed to '> 32 sec').  I usually find
> a pattern after watching that for a time and narrow things down to an OSD,
> journal, etc.  Some times I just need to restart a specific OSD and all is
> well.
>
> On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud 
> wrote:
>
>> After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests
>> for ‘currently waiting for missing object’. I have tried bouncing the osds
>> and rebooting the osd nodes, but that just moves the problems around.
>> Previous to this upgrade we had no issues. Any ideas of what to look at?
>>
>>
>>
>> Thanks,
>>
>> Matthew Stroud
>>
>> --
>>
>> CONFIDENTIALITY NOTICE: This message is intended only for the use and
>> review of the individual or entity to which it is addressed and may contain
>> information that is privileged and confidential. If the reader of this
>> message is not the intended recipient, or the employee or agent responsible
>> for delivering the message solely to the intended recipient, you are hereby
>> notified that any dissemination, distribution or copying of this
>> communication is strictly prohibited. If you have received this
>> communication in error, please notify sender immediately by telephone or
>> return email. Thank you.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests

2017-09-07 Thread David Turner

`ceph health detail` will give a little more information into the blocked
requests.  Specifically which OSDs are the requests blocked on and how long
have they actually been blocked (as opposed to '> 32 sec').  I usually find
a pattern after watching that for a time and narrow things down to an OSD,
journal, etc.  Some times I just need to restart a specific OSD and all is
well.

On Thu, Sep 7, 2017 at 10:33 AM Matthew Stroud 
wrote:

> After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests
> for ‘currently waiting for missing object’. I have tried bouncing the osds
> and rebooting the osd nodes, but that just moves the problems around.
> Previous to this upgrade we had no issues. Any ideas of what to look at?
>
>
>
> Thanks,
>
> Matthew Stroud
>
> --
>
> CONFIDENTIALITY NOTICE: This message is intended only for the use and
> review of the individual or entity to which it is addressed and may contain
> information that is privileged and confidential. If the reader of this
> message is not the intended recipient, or the employee or agent responsible
> for delivering the message solely to the intended recipient, you are hereby
> notified that any dissemination, distribution or copying of this
> communication is strictly prohibited. If you have received this
> communication in error, please notify sender immediately by telephone or
> return email. Thank you.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Blocked requests

2017-09-07 Thread Matthew Stroud

After updating from 10.2.7 to 10.2.9 I have a bunch of blocked requests for 
‘currently waiting for missing object’. I have tried bouncing the osds and 
rebooting the osd nodes, but that just moves the problems around. Previous to 
this upgrade we had no issues. Any ideas of what to look at?

Thanks,
Matthew Stroud



CONFIDENTIALITY NOTICE: This message is intended only for the use and review of 
the individual or entity to which it is addressed and may contain information 
that is privileged and confidential. If the reader of this message is not the 
intended recipient, or the employee or agent responsible for delivering the 
message solely to the intended recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this communication in error, please notify 
sender immediately by telephone or return email. Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests problem

2017-08-23 Thread Ramazan Terzi

Finally problem solved.

First, I set noscrub, nodeep-scrub, norebalance, nobackfill, norecover, noup 
and nodown flags. Then I restarted the OSD which has problem.
When OSD daemon started, blocked requests increased (up to 100) and some 
misplaced PGs appeared. Then I unset flags in order to noup, nodown, norecover, 
nobackfill, norebalance.
In a little while, all misplaced PGs repaired. Then I unset noscrub and 
nodeep-scrub flags. And finally: HEALTH_OK.

Thanks for your helps,
Ramazan

> On 22 Aug 2017, at 20:46, Ranjan Ghosh  wrote:
> 
> Hm. That's quite weird. On our cluster, when I set "noscrub", "nodeep-scrub", 
> scrubbing will always stop pretty quickly (a few minutes). I wonder why this 
> doesnt happen on your cluster. When exactly did you set the flag? Perhaps it 
> just needs some more time... Or there might be a disk problem why the 
> scrubbing never finishes. Perhaps it's really a good idea, just like you 
> proposed, to shutdown the corresponding OSDs. But that's just my thoughts. 
> Perhaps some Ceph pro can shed some light on the possible reasons, why a 
> scrubbing might get stuck and how to resolve this.
> 
> 
> Am 22.08.2017 um 18:58 schrieb Ramazan Terzi:
>> Hi Ranjan,
>> 
>> Thanks for your reply. I did set scrub and nodeep-scrub flags. But active 
>> scrubbing operation can’t working properly. Scrubbing operation always in 
>> same pg (20.1e).
>> 
>> $ ceph pg dump | grep scrub
>> dumped all in format plain
>> pg_stat  objects mip degrmispunf bytes   log disklog 
>> state   state_stamp v   reportedup  up_primary  
>> acting  acting_primary  last_scrub  scrub_stamp last_deep_scrub 
>> deep_scrub_stamp
>> 20.1e25189   0   0   0   0   98359116362 3048
>> 3048active+clean+scrubbing  2017-08-21 04:55:13.354379  
>> 6930'2393   6930:20949058   [29,31,3]   29  [29,31,3]   29   
>>6712'22950171   2017-08-20 04:46:59.208792  6712'22950171   
>> 2017-08-20 04:46:59.208792
>> 
>> 
>> $ ceph -s
>> cluster 
>>  health HEALTH_WARN
>> 33 requests are blocked > 32 sec
>> noscrub,nodeep-scrub flag(s) set
>>  monmap e9: 3 mons at 
>> {ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0}
>> election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
>>  osdmap e6930: 36 osds: 36 up, 36 in
>> flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
>>   pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects
>> 70497 GB used, 127 TB / 196 TB avail
>> 1407 active+clean
>>1 active+clean+scrubbing
>> 
>> 
>> Thanks,
>> Ramazan
>> 
>> 
>>> On 22 Aug 2017, at 18:52, Ranjan Ghosh  wrote:
>>> 
>>> Hi Ramazan,
>>> 
>>> I'm no Ceph expert, but what I can say from my experience using Ceph is:
>>> 
>>> 1) During "Scrubbing", Ceph can be extremely slow. This is probably where 
>>> your "blocked requests" are coming from. BTW: Perhaps you can even find out 
>>> which processes are currently blocking with: ps aux | grep "D". You might 
>>> even want to kill some of those and/or shutdown services in order to 
>>> relieve some stress from the machine until it recovers.
>>> 
>>> 2) I usually have the following in my ceph.conf. This lets the scrubbing 
>>> only run between midnight and 6 AM (hopefully the time of least demand; 
>>> adjust as necessary)  - and with the lowest priority.
>>> 
>>> #Reduce impact of scrub.
>>> osd_disk_thread_ioprio_priority = 7
>>> osd_disk_thread_ioprio_class = "idle"
>>> osd_scrub_end_hour = 6
>>> 
>>> 3) The Scrubbing begin and end hour will always work. The low priority 
>>> mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your 
>>> current scheduler like this (replace sda with your device):
>>> 
>>> cat /sys/block/sda/queue/scheduler
>>> 
>>> You can also echo to this file to set a different scheduler.
>>> 
>>> 
>>> With these settings you can perhaps alleviate the problem so far, that the 
>>> scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt 
>>> have to finish in one night. It will continue the next night and so on.
>>> 
>>> The Ceph experts say scrubbing is important. Don't know why, but I just 
>>> believe them. They've built this complex stuff after all :-)
>>> 
>>> Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back 
>>> to work, but you should not let it run like this forever and a day.
>>> 
>>> Hope this helps at least a bit.
>>> 
>>> BR,
>>> 
>>> Ranjan
>>> 
>>> 
>>> Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:
 Hello,
 
 I have a Ceph Cluster with specifications below:
 3 x Monitor node
 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have 
 SSD journals)
 Distributed public and private networks. All NICs are 10Gbit/s
 osd pool default size = 3
 osd pool default

Re: [ceph-users] Blocked requests problem

2017-08-23 Thread Manuel Lausch

Hi,

Sometimes we have the same issue on our 10.2.9 Cluster. (24 Nodes á 60
OSDs)

I think there is some racecondition or something like that
which results in this state. The blocking requests starts exactly at
the time the PG begins to scrub. 

you can try the following. The OSD will automaticaly recover and the
blocked requests will disapear.

ceph osd down 31 


In my opinion this is a bug but I have note investigated so far. Mayby
some developer can say something about this issue 


Regards,
Manuel


Am Tue, 22 Aug 2017 16:20:14 +0300
schrieb Ramazan Terzi :

> Hello,
> 
> I have a Ceph Cluster with specifications below:
> 3 x Monitor node
> 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks
> have SSD journals) Distributed public and private networks. All NICs
> are 10Gbit/s osd pool default size = 3
> osd pool default min size = 2
> 
> Ceph version is Jewel 10.2.6.
> 
> My cluster is active and a lot of virtual machines running on it
> (Linux and Windows VM's, database clusters, web servers etc).
> 
> During normal use, cluster slowly went into a state of blocked
> requests. Blocked requests periodically incrementing. All OSD's seems
> healthy. Benchmark, iowait, network tests, all of them succeed.
> 
> Yerterday, 08:00:
> $ ceph health detail
> HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
> 1 ops are blocked > 134218 sec on osd.31
> 1 ops are blocked > 134218 sec on osd.3
> 1 ops are blocked > 8388.61 sec on osd.29
> 3 osds have slow requests
> 
> Todat, 16:05:
> $ ceph health detail
> HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow
> requests 1 ops are blocked > 134218 sec on osd.31
> 1 ops are blocked > 134218 sec on osd.3
> 16 ops are blocked > 134218 sec on osd.29
> 11 ops are blocked > 67108.9 sec on osd.29
> 2 ops are blocked > 16777.2 sec on osd.29
> 1 ops are blocked > 8388.61 sec on osd.29
> 3 osds have slow requests
> 
> $ ceph pg dump | grep scrub
> dumped all in format plain
> pg_stat   objects mip degrmisp
> unf   bytes   log disklog state
> state_stamp   v   reportedup
> up_primaryacting  acting_primary
> last_scrubscrub_stamp last_deep_scrub
> deep_scrub_stamp 20.1e25183   0   0   0
> 0 98332537930 30663066
> active+clean+scrubbing2017-08-21 04:55:13.354379
> 6930'23908781 6930:20905696   [29,31,3]   29
> [29,31,3] 29  6712'22950171   2017-08-20
> 04:46:59.208792   6712'22950171   2017-08-20 04:46:59.208792
> 
> Active scrub does not finish (about 24 hours). I did not restart any
> OSD meanwhile. I'm thinking set noscrub, noscrub-deep, norebalance,
> nobackfill, and norecover flags and restart 3,29,31th OSDs. Is this
> solve my problem? Or anyone has suggestion about this problem?
> 
> Thanks,
> Ramazan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Manuel Lausch

Systemadministrator
Cloud Services

1&1 Mail & Media Development & Technology GmbH | Brauerstraße 48 |
76135 Karlsruhe | Germany Phone: +49 721 91374-1847
E-Mail: manuel.lau...@1und1.de | Web: www.1und1.de

Amtsgericht Montabaur, HRB 5452

Geschäftsführer: Thomas Ludwig, Jan Oetjen


Member of United Internet

Diese E-Mail kann vertrauliche und/oder gesetzlich geschützte
Informationen enthalten. Wenn Sie nicht der bestimmungsgemäße Adressat
sind oder diese E-Mail irrtümlich erhalten haben, unterrichten Sie
bitte den Absender und vernichten Sie diese E-Mail. Anderen als dem
bestimmungsgemäßen Adressaten ist untersagt, diese E-Mail zu speichern,
weiterzuleiten oder ihren Inhalt auf welche Weise auch immer zu
verwenden.

This e-mail may contain confidential and/or privileged information. If
you are not the intended recipient of this e-mail, you are hereby
notified that saving, distribution or use of the content of this e-mail
in any way is prohibited. If you have received this e-mail in error,
please notify the sender and delete the e-mail.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests problem

2017-08-22 Thread Ranjan Ghosh

Hm. That's quite weird. On our cluster, when I set "noscrub",
"nodeep-scrub", scrubbing will always stop pretty quickly (a few
minutes). I wonder why this doesnt happen on your cluster. When exactly
did you set the flag? Perhaps it just needs some more time... Or there
might be a disk problem why the scrubbing never finishes. Perhaps it's
really a good idea, just like you proposed, to shutdown the
corresponding OSDs. But that's just my thoughts. Perhaps some Ceph pro
can shed some light on the possible reasons, why a scrubbing might get
stuck and how to resolve this.

Am 22.08.2017 um 18:58 schrieb Ramazan Terzi:

Hi Ranjan,

Thanks for your reply. I did set scrub and nodeep-scrub flags. But active
scrubbing operation can’t working properly. Scrubbing operation always in same
pg (20.1e).

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes log disklog state
state_stamp v reportedup up_primary acting
acting_primary last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e 25189 0 0 0 0 98359116362 30483048
active+clean+scrubbing 2017-08-21 04:55:13.354379 6930'2393
6930:20949058 [29,31,3] 29 [29,31,3] 29 6712'22950171
2017-08-20 04:46:59.208792 6712'22950171 2017-08-20 04:46:59.208792

$ ceph -s
cluster
health HEALTH_WARN
33 requests are blocked > 32 sec
noscrub,nodeep-scrub flag(s) set
monmap e9: 3 mons at
{ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0}
election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
osdmap e6930: 36 osds: 36 up, 36 in
flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects
70497 GB used, 127 TB / 196 TB avail
1407 active+clean
1 active+clean+scrubbing

Thanks,
Ramazan

On 22 Aug 2017, at 18:52, Ranjan Ghosh wrote:

Hi Ramazan,

I'm no Ceph expert, but what I can say from my experience using Ceph is:

1) During "Scrubbing", Ceph can be extremely slow. This is probably where your "blocked
requests" are coming from. BTW: Perhaps you can even find out which processes are currently blocking
with: ps aux | grep "D". You might even want to kill some of those and/or shutdown services in
order to relieve some stress from the machine until it recovers.

2) I usually have the following in my ceph.conf. This lets the scrubbing only
run between midnight and 6 AM (hopefully the time of least demand; adjust as
necessary) - and with the lowest priority.

#Reduce impact of scrub.
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = "idle"
osd_scrub_end_hour = 6

3) The Scrubbing begin and end hour will always work. The low priority mode,
however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current
scheduler like this (replace sda with your device):

cat /sys/block/sda/queue/scheduler

You can also echo to this file to set a different scheduler.

With these settings you can perhaps alleviate the problem so far, that the
scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt have
to finish in one night. It will continue the next night and so on.

The Ceph experts say scrubbing is important. Don't know why, but I just believe
them. They've built this complex stuff after all :-)

Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back to
work, but you should not let it run like this forever and a day.

Hope this helps at least a bit.

BR,

Ranjan

Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:

Hello,

I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2

Ceph version is Jewel 10.2.6.

My cluster is active and a lot of virtual machines running on it (Linux and
Windows VM's, database clusters, web servers etc).

During normal use, cluster slowly went into a state of blocked requests.
Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark,
iowait, network tests, all of them succeed.

Yerterday, 08:00:
$ ceph health detail
HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

Todat, 16:05:
$ ceph health detail
HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
16 ops are blocked > 134218 sec on osd.29
11 ops are blocked > 67108.9 sec on osd.29
2 ops are blocked > 16777.2

Re: [ceph-users] Blocked requests problem

2017-08-22 Thread Ramazan Terzi

Hi Ranjan,

Thanks for your reply. I did set scrub and nodeep-scrub flags. But active 
scrubbing operation can’t working properly. Scrubbing operation always in same 
pg (20.1e).

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e   25189   0   0   0   0   98359116362 30483048
active+clean+scrubbing  2017-08-21 04:55:13.354379  6930'2393   
6930:20949058   [29,31,3]   29  [29,31,3]   29  6712'22950171   
2017-08-20 04:46:59.208792  6712'22950171   2017-08-20 04:46:59.208792


$ ceph -s
cluster 
 health HEALTH_WARN
33 requests are blocked > 32 sec
noscrub,nodeep-scrub flag(s) set
 monmap e9: 3 mons at 
{ceph-mon01=**:6789/0,ceph-mon02=**:6789/0,ceph-mon03=**:6789/0}
election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
 osdmap e6930: 36 osds: 36 up, 36 in
flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v17667617: 1408 pgs, 5 pools, 24779 GB data, 6494 kobjects
70497 GB used, 127 TB / 196 TB avail
1407 active+clean
   1 active+clean+scrubbing


Thanks,
Ramazan


> On 22 Aug 2017, at 18:52, Ranjan Ghosh  wrote:
> 
> Hi Ramazan,
> 
> I'm no Ceph expert, but what I can say from my experience using Ceph is:
> 
> 1) During "Scrubbing", Ceph can be extremely slow. This is probably where 
> your "blocked requests" are coming from. BTW: Perhaps you can even find out 
> which processes are currently blocking with: ps aux | grep "D". You might 
> even want to kill some of those and/or shutdown services in order to relieve 
> some stress from the machine until it recovers.
> 
> 2) I usually have the following in my ceph.conf. This lets the scrubbing only 
> run between midnight and 6 AM (hopefully the time of least demand; adjust as 
> necessary)  - and with the lowest priority.
> 
> #Reduce impact of scrub.
> osd_disk_thread_ioprio_priority = 7
> osd_disk_thread_ioprio_class = "idle"
> osd_scrub_end_hour = 6
> 
> 3) The Scrubbing begin and end hour will always work. The low priority mode, 
> however, works (AFAIK!) only with CFQ I/O Scheduler. Show your current 
> scheduler like this (replace sda with your device):
> 
> cat /sys/block/sda/queue/scheduler
> 
> You can also echo to this file to set a different scheduler.
> 
> 
> With these settings you can perhaps alleviate the problem so far, that the 
> scrubbing runs over many nights until it finished. Again, AFAIK, it doesnt 
> have to finish in one night. It will continue the next night and so on.
> 
> The Ceph experts say scrubbing is important. Don't know why, but I just 
> believe them. They've built this complex stuff after all :-)
> 
> Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server back 
> to work, but you should not let it run like this forever and a day.
> 
> Hope this helps at least a bit.
> 
> BR,
> 
> Ranjan
> 
> 
> Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:
>> Hello,
>> 
>> I have a Ceph Cluster with specifications below:
>> 3 x Monitor node
>> 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have 
>> SSD journals)
>> Distributed public and private networks. All NICs are 10Gbit/s
>> osd pool default size = 3
>> osd pool default min size = 2
>> 
>> Ceph version is Jewel 10.2.6.
>> 
>> My cluster is active and a lot of virtual machines running on it (Linux and 
>> Windows VM's, database clusters, web servers etc).
>> 
>> During normal use, cluster slowly went into a state of blocked requests. 
>> Blocked requests periodically incrementing. All OSD's seems healthy. 
>> Benchmark, iowait, network tests, all of them succeed.
>> 
>> Yerterday, 08:00:
>> $ ceph health detail
>> HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
>> 1 ops are blocked > 134218 sec on osd.31
>> 1 ops are blocked > 134218 sec on osd.3
>> 1 ops are blocked > 8388.61 sec on osd.29
>> 3 osds have slow requests
>> 
>> Todat, 16:05:
>> $ ceph health detail
>> HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
>> 1 ops are blocked > 134218 sec on osd.31
>> 1 ops are blocked > 134218 sec on osd.3
>> 16 ops are blocked > 134218 sec on osd.29
>> 11 ops are blocked > 67108.9 sec on osd.29
>> 2 ops are blocked > 16777.2 sec on osd.29
>> 1 ops are blocked > 8388.61 sec on osd.29
>> 3 osds have slow requests
>> 
>> $ ceph pg dump | grep scrub
>> dumped all in format plain
>> pg_stat  objects mip degrmispunf bytes   log disklog 
>> state   state_stamp v   reportedup  up_primary  
>> acting  acting_primary  last_scrub  scrub_stamp last_deep_scrub 
>> deep_scrub_stamp
>> 20.1e25183   0   0

Re: [ceph-users] Blocked requests problem

2017-08-22 Thread Ranjan Ghosh


Hi Ramazan,

I'm no Ceph expert, but what I can say from my experience using Ceph is:

1) During "Scrubbing", Ceph can be extremely slow. This is probably 
where your "blocked requests" are coming from. BTW: Perhaps you can even 
find out which processes are currently blocking with: ps aux | grep "D". 
You might even want to kill some of those and/or shutdown services in 
order to relieve some stress from the machine until it recovers.


2) I usually have the following in my ceph.conf. This lets the scrubbing 
only run between midnight and 6 AM (hopefully the time of least demand; 
adjust as necessary)  - and with the lowest priority.


#Reduce impact of scrub.
osd_disk_thread_ioprio_priority = 7
osd_disk_thread_ioprio_class = "idle"
osd_scrub_end_hour = 6

3) The Scrubbing begin and end hour will always work. The low priority 
mode, however, works (AFAIK!) only with CFQ I/O Scheduler. Show your 
current scheduler like this (replace sda with your device):


cat /sys/block/sda/queue/scheduler

You can also echo to this file to set a different scheduler.


With these settings you can perhaps alleviate the problem so far, that 
the scrubbing runs over many nights until it finished. Again, AFAIK, it 
doesnt have to finish in one night. It will continue the next night and 
so on.


The Ceph experts say scrubbing is important. Don't know why, but I just 
believe them. They've built this complex stuff after all :-)


Thus, you can use "noscrub"/"nodeepscrub" to quickly get a hung server 
back to work, but you should not let it run like this forever and a day.


Hope this helps at least a bit.

BR,

Ranjan


Am 22.08.2017 um 15:20 schrieb Ramazan Terzi:

Hello,

I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2

Ceph version is Jewel 10.2.6.

My cluster is active and a lot of virtual machines running on it (Linux and 
Windows VM's, database clusters, web servers etc).

During normal use, cluster slowly went into a state of blocked requests. 
Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, 
iowait, network tests, all of them succeed.

Yerterday, 08:00:
$ ceph health detail
HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

Todat, 16:05:
$ ceph health detail
HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
16 ops are blocked > 134218 sec on osd.29
11 ops are blocked > 67108.9 sec on osd.29
2 ops are blocked > 16777.2 sec on osd.29
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e   25183   0   0   0   0   98332537930 30663066
active+clean+scrubbing  2017-08-21 04:55:13.354379  6930'23908781   
6930:20905696   [29,31,3]   29  [29,31,3]   29  6712'22950171   
2017-08-20 04:46:59.208792  6712'22950171   2017-08-20 04:46:59.208792

Active scrub does not finish (about 24 hours). I did not restart any OSD 
meanwhile.
I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover 
flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has 
suggestion about this problem?

Thanks,
Ramazan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Blocked requests problem

2017-08-22 Thread Ramazan Terzi

Hello,

I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2

Ceph version is Jewel 10.2.6.

My cluster is active and a lot of virtual machines running on it (Linux and 
Windows VM's, database clusters, web servers etc).

During normal use, cluster slowly went into a state of blocked requests. 
Blocked requests periodically incrementing. All OSD's seems healthy. Benchmark, 
iowait, network tests, all of them succeed.

Yerterday, 08:00:
$ ceph health detail
HEALTH_WARN 3 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

Todat, 16:05:
$ ceph health detail
HEALTH_WARN 32 requests are blocked > 32 sec; 3 osds have slow requests
1 ops are blocked > 134218 sec on osd.31
1 ops are blocked > 134218 sec on osd.3
16 ops are blocked > 134218 sec on osd.29
11 ops are blocked > 67108.9 sec on osd.29
2 ops are blocked > 16777.2 sec on osd.29
1 ops are blocked > 8388.61 sec on osd.29
3 osds have slow requests

$ ceph pg dump | grep scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes   log disklog state   
state_stamp v   reportedup  up_primary  acting  
acting_primary  last_scrub  scrub_stamp last_deep_scrub deep_scrub_stamp
20.1e   25183   0   0   0   0   98332537930 30663066
active+clean+scrubbing  2017-08-21 04:55:13.354379  6930'23908781   
6930:20905696   [29,31,3]   29  [29,31,3]   29  6712'22950171   
2017-08-20 04:46:59.208792  6712'22950171   2017-08-20 04:46:59.208792

Active scrub does not finish (about 24 hours). I did not restart any OSD 
meanwhile.
I'm thinking set noscrub, noscrub-deep, norebalance, nobackfill, and norecover 
flags and restart 3,29,31th OSDs. Is this solve my problem? Or anyone has 
suggestion about this problem?

Thanks,
Ramazan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests after "osd in"

2015-12-11 Thread Christian Kauhaus

Am 10.12.2015 um 06:38 schrieb Robert LeBlanc:
> Since I'm very interested in
> reducing this problem, I'm willing to try and submit a fix after I'm
> done with the new OP queue I'm working on. I don't know the best
> course of action at the moment, but I hope I can get some input for
> when I do try and tackle the problem next year.

Is there already a ticket present for this issue in the bug tracker? I think
this is an import issue.

Regards

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests after "osd in"

2015-12-10 Thread Christian Kauhaus

Am 10.12.2015 um 06:38 schrieb Robert LeBlanc:
> I noticed this a while back and did some tracing. As soon as the PGs
> are read in by the OSD (very limited amount of housekeeping done), the
> OSD is set to the "in" state so that peering with other OSDs can
> happen and the recovery process can begin. The problem is that when
> the OSD is "in", the clients also see that and start sending requests
> to the OSDs before it has had a chance to actually get its bearings
> and is able to even service the requests. After discussion with some
> of the developers, there is no easy way around this other than let the
> PGs recover to other OSDs and then bring in the OSDs after recovery (a
> ton of data movement).

Many thanks for your detailed analysis. It's a bit disappointing that there
seems to be no easy way around. Any work to improve the situation is much
appreciated.

In the meantime, I'll be experimenting with pre-seeding the VFS cache to speed
things up at least a little bit.

Regards

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Jan Schermer

Are you seeing "peering" PGs when the blocked requests are happening? That's 
what we see regularly when starting OSDs.

I'm not sure this can be solved completely (and whether there are major 
improvements in newer Ceph versions), but it can be sped up by
1) making sure you have free (and not dirtied or fragmented) memory on the node 
where you are starting the OSD
- that means dropping caches before starting the OSD if you have lots 
of "free" RAM that is used for VFS cache
2) starting the OSDs one by one instead of booting several of them
3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found it 
to be best to pin the OSD to a cgroup limited to one NUMA node and then limit 
it to a subset of cores after it has run a bit. OSD tends to use hundreds of % 
of CPU when booting
4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd...

It's unclear to me whether MONs influence this somehow (the peering stage) but 
I have observed their CPU usage and IO also spikes when OSDs are started, so 
make sure they are not under load.

Jan


> On 09 Dec 2015, at 11:03, Christian Kauhaus  wrote:
> 
> Hi,
> 
> I'm getting blocked requests (>30s) every time when an OSD is set to "in" in
> our clusters. Once this has happened, backfills run smoothly.
> 
> I have currently no idea where to start debugging. Has anyone a hint what to
> examine first in order to narrow this issue?
> 
> TIA
> 
> Christian
> 
> -- 
> Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
> Flying Circus Internet Operations GmbH · http://flyingcircus.io
> Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
> HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Christian Kauhaus

Hi,

I'm getting blocked requests (>30s) every time when an OSD is set to "in" in
our clusters. Once this has happened, backfills run smoothly.

I have currently no idea where to start debugging. Has anyone a hint what to
examine first in order to narrow this issue?

TIA

Christian

-- 
Dipl-Inf. Christian Kauhaus <>< · k...@flyingcircus.io · +49 345 219401-0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Christian Kauhaus

Am 09.12.2015 um 11:21 schrieb Jan Schermer:
> Are you seeing "peering" PGs when the blocked requests are happening? That's 
> what we see regularly when starting OSDs.

Mostly "peering" and "activating".

> I'm not sure this can be solved completely (and whether there are major 
> improvements in newer Ceph versions), but it can be sped up by
> 1) making sure you have free (and not dirtied or fragmented) memory on the 
> node where you are starting the OSD
>   - that means dropping caches before starting the OSD if you have lots 
> of "free" RAM that is used for VFS cache
> 2) starting the OSDs one by one instead of booting several of them
> 3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found 
> it to be best to pin the OSD to a cgroup limited to one NUMA node and then 
> limit it to a subset of cores after it has run a bit. OSD tends to use 
> hundreds of % of CPU when booting
> 4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd...

Thank you for your advice. The use case is not so much after rebooting a
server, but more when we take OSDs in/out for maintenance. During boot, we
already start them one after another with 10s pause between each pair.

I've done a bit of tracing. I've kept a small cluster running with 2 "in" OSDs
out of 3 and put the third one "in" at 15:06:22. From ceph.log:

| 2015-12-09 15:06:22.827030 mon.0 172.20.4.6:6789/0 54964 : cluster [INF]
osdmap e264345: 3 osds: 3 up, 3 in
| 2015-12-09 15:06:22.828693 mon.0 172.20.4.6:6789/0 54965 : cluster [INF]
pgmap v39871295: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 4515
GB / 5421 GB avail; 6406 B/s rd, 889 kB/s wr, 67 op/s
| [...]
| 2015-12-09 15:06:29.163793 mon.0 172.20.4.6:6789/0 54972 : cluster [INF]
pgmap v39871299: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 7700
GB / 8607 GB avail

After a few seconds, backfills start as expected:

| 2015-12-09 15:06:24.853507 osd.3 172.20.4.40:6800/5072 778 : cluster [INF]
410.c9 restarting backfill on osd.2 from (0'0,0'0] MAX to 264336'502426
| [...]
| 2015-12-09 15:06:29.874092 osd.3 172.20.4.40:6800/5072 1308 : cluster [INF]
410.d1 restarting backfill on osd.2 from (0'0,0'0] MAX to 264344'1202983
| 2015-12-09 15:06:32.584907 mon.0 172.20.4.6:6789/0 54973 : cluster [INF]
pgmap v39871300: 1800 pgs: 3 active+remapped+wait_backfill, 191
active+remapped, 1169 active+clean, 437 activating+remapped; 439 GB data, 906
GB used, 7700 GB / 8607 GB avail; 1725 kB/s rd, 2486 kB/s wr, 605 op/s;
23058/278796 objects misplaced (8.271%); 56612 kB/s, 14 objects/s recovering
| 2015-12-09 15:06:24.851307 osd.0 172.20.4.51:6800/4919 2662 : cluster [INF]
410.c8 restarting backfill on osd.2 from (0'0,0'0] MAX to 264344'1017219
| 2015-12-09 15:06:38.555243 mon.0 172.20.4.6:6789/0 54976 : cluster [INF]
pgmap v39871303: 1800 pgs: 22 active+remapped+wait_backfill, 520
active+remapped, 638 active+clean, 620 activating+remapped; 439 GB data, 906
GB used, 7700 GB / 8607
| GB avail; 45289 B/s wr, 4 op/s; 64014/313904 objects misplaced (20.393%)
| 2015-12-09 15:06:38.133376 osd.3 172.20.4.40:6800/5072 1309 : cluster [WRN]
9 slow requests, 9 included below; oldest blocked for > 15.306541 secs
| 2015-12-09 15:06:38.133385 osd.3 172.20.4.40:6800/5072 1310 : cluster [WRN]
slow request 15.305213 seconds old, received at 2015-12-09 15:06:22.828061:
osd_op(client.15205073.0:35726 rbd_header.13998a74b0dc51 [watch reconnect
cookie 139897352489152 gen 37] 410.937870ca ondisk+write+known_if_redirected
e264345) currently reached_pg

It seems that PGs in "activating" state are causing blocked requests.

After a half minute or so, slow requests disappear and backfill proceeds 
normally:

| 2015-12-09 15:06:54.139948 osd.3 172.20.4.40:6800/5072 1396 : cluster [WRN]
42 slow requests, 9 included below; oldest blocked for > 31.188267 secs
| 2015-12-09 15:06:54.139957 osd.3 172.20.4.40:6800/5072 1397 : cluster [WRN]
slow request 15.566440 seconds old, received at 2015-12-09 15:06:38.573403:
osd_op(client.15165527.0:5878994 rbd_data.129a42ae8944a.0f2b
[set-alloc-hint object_size 4194304 write_size 4194304,write 1728512~4096]
410.de3ce70d snapc 3fd2=[3fd2] ack+ondisk+write+known_if_redirected e264348)
currently waiting for subops from 0,2
| 2015-12-09 15:06:54.139977 osd.3 172.20.4.40:6800/5072 1401 : cluster [WRN]
slow request 15.356852 seconds old, received at 2015-12-09 15:06:38.782990:
osd_op(client.15165527.0:5878997 rbd_data.129a42ae8944a.0f2b
[set-alloc-hint object_size 4194304 write_size 4194304,write 1880064~4096]
410.de3ce70d snapc 3fd2=[3fd2] ack+ondisk+write+known_if_redirected e264348)
currently waiting for subops from 0,2
| [...]
| 2015-12-09 15:07:00.072403 mon.0 172.20.4.6:6789/0 54989 : cluster [INF]
osdmap e264351: 3 osds: 3 up, 3 in
| 2015-12-09 15:07:00.074536 mon.0 172.20.4.6:6789/0 54990 : cluster [INF]
pgmap v39871313: 1800 pgs: 277 active+remapped+wait_backfill, 881
active+remapped, 4 active+remapped+backfilling, 638 active+clean;

Re: [ceph-users] Blocked requests after "osd in"

2015-12-09 Thread Robert LeBlanc

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I noticed this a while back and did some tracing. As soon as the PGs
are read in by the OSD (very limited amount of housekeeping done), the
OSD is set to the "in" state so that peering with other OSDs can
happen and the recovery process can begin. The problem is that when
the OSD is "in", the clients also see that and start sending requests
to the OSDs before it has had a chance to actually get its bearings
and is able to even service the requests. After discussion with some
of the developers, there is no easy way around this other than let the
PGs recover to other OSDs and then bring in the OSDs after recovery (a
ton of data movement).

I've suggested some options on how to work around this issue, but they
all require a large amount of rework. Since I'm very interested in
reducing this problem, I'm willing to try and submit a fix after I'm
done with the new OP queue I'm working on. I don't know the best
course of action at the moment, but I hope I can get some input for
when I do try and tackle the problem next year.

1. Add a new state that allows OSDs to peer without client requests
coming in. (up -> in -> active) I'm not sure if other OSDs are seen as
clients, I don't think so. I'm not sure if there would have to be some
trickery to make the booting OSDs not be primary until all the PGs are
read and ready for I/O (not necessary recovered yet).
2. When a request comes in for a PG that is not ready, send the client
a redirect message to use the primary in a previous map. I have a
feeling this could be very messy and not very safe.
3. Proxy the OP on behalf of the client until the PGs are ready. The
"other" OSD would have to understand that it is OK to do that
write/read OP even though it is not the primary, this can be difficult
to do safely.

Right now I'm leaning to option #1. When the new OSD boots, keep the
previous primary running and the PG is in degraded mode until the new
OSD has done all of it's housekeeping and can service the IO
effectively, then make a change to the CRUSH map to swap the primaries
where needed. Any input and ideas from the devs would be helpful.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWaQ/tCRDmVDuy+mK58QAAt78QAIipf97avpZv+FLF0SUT
F9vaUwTDI8fTpOmca1v4/nJ90pxM0RksYpg7Q+tg7+JlyQ6gns2QoKwUAf5F
EgVPg6pUQXmzkKcVvUgt51NDR4d80E+xIXHSmJKT4iU3BPI5ezNHYoVlAOhm
LXdDrYTaEPy/EfQxj5Prole0mLsCB129ydgPG7ud1qaNjzxLikyihLvA72Bd
AZhOhvjXTXGzWR1Uw2oPStYuw2i0JrFHp9//bipa6hqHd1XJSb3afe6VW9vJ
9E3AqGXMrdZG5Nk7kjaH7MfZbsxl39KimgAcHPDBz1XK2ZrSrtNZ1nTo09+u
Bb8DIB66kAT/4OIXQ1NvwTNn8INi9u14IFPzS2Z1Ewidg7jMAPkS0XxIPjhF
6G01GornpfN+emhOsQRz5sw6WPC8dlLGP9JfEP8+rPkLcNqBP82aCJ68AllZ
TWelhgAJoW/LdyyCaFD87wmQ1lqQxbujcDsLaDzBLQ/vDqmw9mNTubCIKfR2
WKRft9CyDR5r/Ous16RVsy+PFhmw/e/ovrWBFLx4t/KrbQVYUCfgDZrNSLtb
4aNRUtel7PN3AXUtFM8O7gS+CaYv5fP+CotQer8HuSnL4eFGIe9yg2jHSGVy
fmDFEirT3DlxFEDWja8uNFGdJ8rMYjTqOMdyOCS3SLtizTmC/+SF00kk0m9A
sB8x
=i9F/
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Wed, Dec 9, 2015 at 7:33 AM, Christian Kauhaus  wrote:
> Am 09.12.2015 um 11:21 schrieb Jan Schermer:
>> Are you seeing "peering" PGs when the blocked requests are happening? That's 
>> what we see regularly when starting OSDs.
>
> Mostly "peering" and "activating".
>
>> I'm not sure this can be solved completely (and whether there are major 
>> improvements in newer Ceph versions), but it can be sped up by
>> 1) making sure you have free (and not dirtied or fragmented) memory on the 
>> node where you are starting the OSD
>>   - that means dropping caches before starting the OSD if you have lots 
>> of "free" RAM that is used for VFS cache
>> 2) starting the OSDs one by one instead of booting several of them
>> 3) if you pin the OSDs to CPUs/cores, do that after the OSD is in - I found 
>> it to be best to pin the OSD to a cgroup limited to one NUMA node and then 
>> limit it to a subset of cores after it has run a bit. OSD tends to use 
>> hundreds of % of CPU when booting
>> 4) you could possibly prewarm cache for the OSD in /var/lib/ceph/osd...
>
> Thank you for your advice. The use case is not so much after rebooting a
> server, but more when we take OSDs in/out for maintenance. During boot, we
> already start them one after another with 10s pause between each pair.
>
> I've done a bit of tracing. I've kept a small cluster running with 2 "in" OSDs
> out of 3 and put the third one "in" at 15:06:22. From ceph.log:
>
> | 2015-12-09 15:06:22.827030 mon.0 172.20.4.6:6789/0 54964 : cluster [INF]
> osdmap e264345: 3 osds: 3 up, 3 in
> | 2015-12-09 15:06:22.828693 mon.0 172.20.4.6:6789/0 54965 : cluster [INF]
> pgmap v39871295: 1800 pgs: 1800 active+clean; 439 GB data, 906 GB used, 4515
> GB / 5421 GB avail; 6406 B/s rd, 889 kB/s wr, 67 op/s
> | [...]
> | 2015-12-09 15:06:29.163793 mon.0 172.20.4.6:6789/0 54972 :

Re: [ceph-users] Blocked requests/ops?

2015-05-28 Thread Christian Balzer


Hello,

On Thu, 28 May 2015 12:05:03 +0200 Xavier Serrano wrote:

 On Thu May 28 11:22:52 2015, Christian Balzer wrote:
 
   We are testing different scenarios before making our final decision
   (cache-tiering, journaling, separate pool,...).
  
  Definitely a good idea to test things out and get an idea what Ceph and
  your hardware can do.
  
  From my experience and reading this ML however I think your best bet
  (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for
  your 20 OSDs HDDs.
  
  Currently cache-tiering is probably the worst use for those SSD
  resources, though the code and strategy is of course improving.
  
 I agree: in our particular enviroment, our tests also conclude that
 SSD journaling performs far better than cache-tiering, especially when
 cache becomes close to its capacity and data movement between cache
 and backing storage occurs frequently.

Precisely.
 
 We also want to test if it is possible to use SSD disks as a
 transparent cache for the HDDs at system (Linux kernel) level, and how
 reliable/good is it.
 
There are quite a number of threads about this here, some quite
recent/current. 
They range from not worth it (i.e. about the same performance as journal
SSDs) to xyz-cache destroyed my data, ate my babies and set the house on
fire (i.e. massive reliability problems).

Which is a pity, as in theory they look like a nice fit/addition to Ceph.

  Dedicated SSD pools may be a good fit depending on your use case.
  However I'd advise against mixing SSD and HDD OSDs on the same node.
  To fully utilize those SSDs you'll need a LOT more CPU power than
  required by HDD OSDs or SSD journals/HDD OSDs systems. 
  And you already have 20 OSDs in that box.
 
 Good point! We did not consider that, thanks for pointing it out.
 
  What CPUs do you have in those storage nodes anyway?
  
 Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo.
 We have only 1 CPU per osd node, so I'm afraid we have another
 potential bottleneck here.
 
Oh dear, about 10GHz (that CPU is supposedly 2.4, but you may see the
2.5 because it already is in turbo mode) for 20 OSDs.
Where the recommendation for HDD only OSDs is 1GHz.

Fire up atop (large window so you can see all the details and devices) on
one of your storage nodes.

Then from a client (VM) run this:
---
fio --size=8G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=randwrite --name=fiojob --blocksize=4M --iodepth=32
---
This should result in your disks (OSDs) getting busy to the point of 100%
utilization, but your CPU to still have some idle (that's idle AND wait
combined).

If you change the blocksize to 4K (and just ctrl-c fio after 30 or so
seconds) you should see a very different picture, with the CPU being much
busier and the HDDs seeing less than 100% usage.

That will become even more pronounced with faster HDDs and/or journal SSDs.

And pure SSD clusters/pools are way above that in terms of CPU hunger.

  If you have the budget, I'd deploy the current storage nodes in classic
  (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure
  SSD nodes, optimized for their task (more CPU power, faster network).
  
  Then use those SSD nodes to experiment with cache-tiers and pure SSD
  pools and switch over things when you're comfortable with this and
  happy with the performance. 
   
   
However with 20 OSDs per node, you're likely to go from a being
bottlenecked by your HDDs to being CPU limited (when dealing with
lots of small IOPS at least).
Still, better than now for sure.

   This is very interesting, thanks for pointing it out!
   What would you suggest to use in order to identify the actual
   bottleneck? (disk, CPU, RAM, etc.). Tools like munin?
   
  Munin might work, I use collectd to gather all those values (and even
  more importantly all Ceph counters) and graphite to visualize it.
  For ad-hoc, on the spot analysis I really like atop (in a huge window),
  which will make it very clear what is going on.
  
   In addition, there are some kernel tunables that may be helpful
   to improve overall performance. Maybe we are filling some kernel
   internals and that limits our results (for instance, we had to
   increase fs.aio-max-nr in sysctl.d to 262144 to be able to use 20
   disks per host). Which tunables should we observe?
   
  I'm no expert for large (not even medium) clusters, so you'll have to
  research the archives and net (the CERN Ceph slide is nice).
  One thing I remember is kernel.pid_max, which is something you're
  likely to run into at some point with your dense storage nodes:
  http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations
  
  Christian
 
 All you say is really interesting. Thanks for your valuable advice.
 We surely still have plenty of things to learn and test before going
 to production.
 
As long as you have the time to test out things, you'll be fine. ^_^

Christian

 Thanks again for your

Re: [ceph-users] Blocked requests/ops?

2015-05-28 Thread Xavier Serrano

On Thu May 28 11:22:52 2015, Christian Balzer wrote:

  We are testing different scenarios before making our final decision
  (cache-tiering, journaling, separate pool,...).
 
 Definitely a good idea to test things out and get an idea what Ceph and
 your hardware can do.
 
 From my experience and reading this ML however I think your best bet
 (overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your
 20 OSDs HDDs.
 
 Currently cache-tiering is probably the worst use for those SSD resources,
 though the code and strategy is of course improving.
 
I agree: in our particular enviroment, our tests also conclude that
SSD journaling performs far better than cache-tiering, especially when
cache becomes close to its capacity and data movement between cache
and backing storage occurs frequently.

We also want to test if it is possible to use SSD disks as a transparent
cache for the HDDs at system (Linux kernel) level, and how reliable/good
is it.

 Dedicated SSD pools may be a good fit depending on your use case.
 However I'd advise against mixing SSD and HDD OSDs on the same node.
 To fully utilize those SSDs you'll need a LOT more CPU power than required
 by HDD OSDs or SSD journals/HDD OSDs systems. 
 And you already have 20 OSDs in that box.

Good point! We did not consider that, thanks for pointing it out.

 What CPUs do you have in those storage nodes anyway?
 
Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz, according to /proc/cpuinfo.
We have only 1 CPU per osd node, so I'm afraid we have another
potential bottleneck here.

 If you have the budget, I'd deploy the current storage nodes in classic
 (SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD
 nodes, optimized for their task (more CPU power, faster network).
 
 Then use those SSD nodes to experiment with cache-tiers and pure SSD pools
 and switch over things when you're comfortable with this and happy with the
 performance. 
  
  
   However with 20 OSDs per node, you're likely to go from a being
   bottlenecked by your HDDs to being CPU limited (when dealing with lots
   of small IOPS at least).
   Still, better than now for sure.
   
  This is very interesting, thanks for pointing it out!
  What would you suggest to use in order to identify the actual
  bottleneck? (disk, CPU, RAM, etc.). Tools like munin?
  
 Munin might work, I use collectd to gather all those values (and even more
 importantly all Ceph counters) and graphite to visualize it.
 For ad-hoc, on the spot analysis I really like atop (in a huge window),
 which will make it very clear what is going on.
 
  In addition, there are some kernel tunables that may be helpful
  to improve overall performance. Maybe we are filling some kernel
  internals and that limits our results (for instance, we had to increase
  fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per
  host). Which tunables should we observe?
  
 I'm no expert for large (not even medium) clusters, so you'll have to
 research the archives and net (the CERN Ceph slide is nice).
 One thing I remember is kernel.pid_max, which is something you're likely
 to run into at some point with your dense storage nodes:
 http://ceph.com/docs/master/start/hardware-recommendations/#additional-considerations
 
 Christian

All you say is really interesting. Thanks for your valuable advice.
We surely still have plenty of things to learn and test before going
to production.

Thanks again for your time and help.

Best regards,
- Xavier Serrano
- LCAC, Laboratori de Càlcul
- Departament d'Arquitectura de Computadors, UPC

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests/ops?

2015-05-27 Thread Xavier Serrano

Hello,

Slow requests, blocked requests and blocked ops occur quite often
in our cluster; too often, I'd say: several times during one day.
I must say we are running some tests, but we are far from pushing
the cluster to the limit (or at least, that's what I believe).

Every time a blocked request/operation happened, restarting the
affected OSD solved the problem.

Yesterday, we wanted to see if it was possible to minimize the impact
that backfills and recovery have over normal cluster performace.
In our case, performance dropped from 1000 cluster IOPS (approx)
to 10 IOPS (approx) when doing some kind of recovery.

Thus, we reduced the parameters osd max backfills and osd recovery
max active to 1 (defaults are 10 and 15, respectively). Cluster
performance during recovery improved to 500-600 IOPS (approx),
and overall recovery time stayed approximately the same (surprisingly).

Since then, we have had no more slow/blocked requests/ops
(and our tests are still running). It is soon to say this, but
my guess is that osds/disks in our cluster cannot cope with
all I/O: network bandwidth is not an issue (10 GbE interconnection,
graphs show network usage is under control all the time), but
spindles are not high-performance (WD Green). Eventually, this might
lead to slow/blocked requests/ops (which shouldn't occur that often).

Reducing I/O pressure caused by recovery and backfill undoubtedly
helped on improving cluster performance during recovery, that was
expected. But we did not expect that recovery time stayed the same...
The only explanation for this is that, during recovery, there are
lots of operations that fail due a timeout, are retried several
times, etc.

So if disks are the bottleneck, reducing such values may help as
well in normal cluster operation (when propagating the replicas,
for instance). And slow/blocked requests/ops do not occur (or at
least, occur less frequently).

Does this make sense to you? Any other thoughts?

Thank you very much again for your time.
- Xavier Serrano
- LCAC, Laboratori de Càlcul
- Departament d'Arquitectura de Computadors, UPC

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests/ops?

2015-05-27 Thread Xavier Serrano

Hello,

On Wed May 27 21:20:49 2015, Christian Balzer wrote:

 
 Hello,
 
 On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:
 
  Hello,
  
  Slow requests, blocked requests and blocked ops occur quite often
  in our cluster; too often, I'd say: several times during one day.
  I must say we are running some tests, but we are far from pushing
  the cluster to the limit (or at least, that's what I believe).
  
  Every time a blocked request/operation happened, restarting the
  affected OSD solved the problem.
  
 You should open a bug with that description and a way to reproduce things,
 even if only sometimes. 
 Having slow disks instead of an overloaded network causing permanently
 blocked requests definitely shouldn't happen.
 
I totally agree. I'll try to reproduce and definitely open a bug.
I'll let you know.


  Yesterday, we wanted to see if it was possible to minimize the impact
  that backfills and recovery have over normal cluster performace.
  In our case, performance dropped from 1000 cluster IOPS (approx)
  to 10 IOPS (approx) when doing some kind of recovery.
  
  Thus, we reduced the parameters osd max backfills and osd recovery
  max active to 1 (defaults are 10 and 15, respectively). Cluster
  performance during recovery improved to 500-600 IOPS (approx),
  and overall recovery time stayed approximately the same (surprisingly).
  
 There are some sleep values for recovery and scrub as well, these help a
 LOT with loaded clusters, too.
 
  Since then, we have had no more slow/blocked requests/ops
  (and our tests are still running). It is soon to say this, but
  my guess is that osds/disks in our cluster cannot cope with
  all I/O: network bandwidth is not an issue (10 GbE interconnection,
  graphs show network usage is under control all the time), but
  spindles are not high-performance (WD Green). Eventually, this might
  lead to slow/blocked requests/ops (which shouldn't occur that often).
 
 Ah yes, I was going to comment on your HDDs earlier.
 As Dan van der Ster at CERN will happily admit, using green, slow HDDs
 with Ceph (and no SSD journals) is a bad idea.
 
 You're likely to see a VAST improvement with even just 1 journal SSD (of
 suficient speed and durability) for 10 of your HDDs, a 1:5 ratio would of
 course be better.

We do have SSDs, but we are not using them right now.
We have 4 SSD per osd host (24 SSD at the moment).
SSD model is Intel DC S3700 (400 GB).

We are testing different scenarios before making our final decision
(cache-tiering, journaling, separate pool,...).


 However with 20 OSDs per node, you're likely to go from a being
 bottlenecked by your HDDs to being CPU limited (when dealing with lots of
 small IOPS at least).
 Still, better than now for sure.
 
This is very interesting, thanks for pointing it out!
What would you suggest to use in order to identify the actual
bottleneck? (disk, CPU, RAM, etc.). Tools like munin?

In addition, there are some kernel tunables that may be helpful
to improve overall performance. Maybe we are filling some kernel
internals and that limits our results (for instance, we had to increase
fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per
host). Which tunables should we observe?

Thank you very much again for your time.

Best regards,
- Xavier Serrano
- LCAC, Laboratori de Càlcul
- Departament d'Arquitectura de Computadors, UPC


 BTW, if your monitors are just used for that function, 128GB is total and
 utter overkill. 
 They will be fine with 16-32GB, your storage nodes will be much better
 served (pagecache for hot read objects) with more RAM.
 And with 20 OSDs per node 32GB is pretty close to the minimum I'd
 recommend anyway.
 
  
  Reducing I/O pressure caused by recovery and backfill undoubtedly
  helped on improving cluster performance during recovery, that was
  expected. But we did not expect that recovery time stayed the same...
  The only explanation for this is that, during recovery, there are
  lots of operations that fail due a timeout, are retried several
  times, etc.
  
  So if disks are the bottleneck, reducing such values may help as
  well in normal cluster operation (when propagating the replicas,
  for instance). And slow/blocked requests/ops do not occur (or at
  least, occur less frequently).
  
  Does this make sense to you? Any other thoughts?
  
 Very much so, see above for more thoughts.
 
 Christian
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests/ops?

2015-05-27 Thread Christian Balzer


Hello,

On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:

 Hello,
 
 Slow requests, blocked requests and blocked ops occur quite often
 in our cluster; too often, I'd say: several times during one day.
 I must say we are running some tests, but we are far from pushing
 the cluster to the limit (or at least, that's what I believe).
 
 Every time a blocked request/operation happened, restarting the
 affected OSD solved the problem.
 
You should open a bug with that description and a way to reproduce things,
even if only sometimes. 
Having slow disks instead of an overloaded network causing permanently
blocked requests definitely shouldn't happen.

 Yesterday, we wanted to see if it was possible to minimize the impact
 that backfills and recovery have over normal cluster performace.
 In our case, performance dropped from 1000 cluster IOPS (approx)
 to 10 IOPS (approx) when doing some kind of recovery.
 
 Thus, we reduced the parameters osd max backfills and osd recovery
 max active to 1 (defaults are 10 and 15, respectively). Cluster
 performance during recovery improved to 500-600 IOPS (approx),
 and overall recovery time stayed approximately the same (surprisingly).
 
There are some sleep values for recovery and scrub as well, these help a
LOT with loaded clusters, too.

 Since then, we have had no more slow/blocked requests/ops
 (and our tests are still running). It is soon to say this, but
 my guess is that osds/disks in our cluster cannot cope with
 all I/O: network bandwidth is not an issue (10 GbE interconnection,
 graphs show network usage is under control all the time), but
 spindles are not high-performance (WD Green). Eventually, this might
 lead to slow/blocked requests/ops (which shouldn't occur that often).

Ah yes, I was going to comment on your HDDs earlier.
As Dan van der Ster at CERN will happily admit, using green, slow HDDs
with Ceph (and no SSD journals) is a bad idea.

You're likely to see a VAST improvement with even just 1 journal SSD (of
suficient speed and durability) for 10 of your HDDs, a 1:5 ratio would of
course be better.
However with 20 OSDs per node, you're likely to go from a being
bottlenecked by your HDDs to being CPU limited (when dealing with lots of
small IOPS at least).
Still, better than now for sure.

BTW, if your monitors are just used for that function, 128GB is total and
utter overkill. 
They will be fine with 16-32GB, your storage nodes will be much better
served (pagecache for hot read objects) with more RAM.
And with 20 OSDs per node 32GB is pretty close to the minimum I'd
recommend anyway.

 
 Reducing I/O pressure caused by recovery and backfill undoubtedly
 helped on improving cluster performance during recovery, that was
 expected. But we did not expect that recovery time stayed the same...
 The only explanation for this is that, during recovery, there are
 lots of operations that fail due a timeout, are retried several
 times, etc.
 
 So if disks are the bottleneck, reducing such values may help as
 well in normal cluster operation (when propagating the replicas,
 for instance). And slow/blocked requests/ops do not occur (or at
 least, occur less frequently).
 
 Does this make sense to you? Any other thoughts?
 
Very much so, see above for more thoughts.

Christian

 Thank you very much again for your time.
 - Xavier Serrano
 - LCAC, Laboratori de Càlcul
 - Departament d'Arquitectura de Computadors, UPC
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests/ops?

2015-05-27 Thread Christian Balzer

On Wed, 27 May 2015 15:38:26 +0200 Xavier Serrano wrote:

 Hello,
 
 On Wed May 27 21:20:49 2015, Christian Balzer wrote:
 
  
  Hello,
  
  On Wed, 27 May 2015 12:54:04 +0200 Xavier Serrano wrote:
  
   Hello,
   
   Slow requests, blocked requests and blocked ops occur quite often
   in our cluster; too often, I'd say: several times during one day.
   I must say we are running some tests, but we are far from pushing
   the cluster to the limit (or at least, that's what I believe).
   
   Every time a blocked request/operation happened, restarting the
   affected OSD solved the problem.
   
  You should open a bug with that description and a way to reproduce
  things, even if only sometimes. 
  Having slow disks instead of an overloaded network causing permanently
  blocked requests definitely shouldn't happen.
  
 I totally agree. I'll try to reproduce and definitely open a bug.
 I'll let you know.
 
 
   Yesterday, we wanted to see if it was possible to minimize the impact
   that backfills and recovery have over normal cluster performace.
   In our case, performance dropped from 1000 cluster IOPS (approx)
   to 10 IOPS (approx) when doing some kind of recovery.
   
   Thus, we reduced the parameters osd max backfills and osd recovery
   max active to 1 (defaults are 10 and 15, respectively). Cluster
   performance during recovery improved to 500-600 IOPS (approx),
   and overall recovery time stayed approximately the same
   (surprisingly).
   
  There are some sleep values for recovery and scrub as well, these
  help a LOT with loaded clusters, too.
  
   Since then, we have had no more slow/blocked requests/ops
   (and our tests are still running). It is soon to say this, but
   my guess is that osds/disks in our cluster cannot cope with
   all I/O: network bandwidth is not an issue (10 GbE interconnection,
   graphs show network usage is under control all the time), but
   spindles are not high-performance (WD Green). Eventually, this might
   lead to slow/blocked requests/ops (which shouldn't occur that often).
  
  Ah yes, I was going to comment on your HDDs earlier.
  As Dan van der Ster at CERN will happily admit, using green, slow HDDs
  with Ceph (and no SSD journals) is a bad idea.
  
  You're likely to see a VAST improvement with even just 1 journal SSD
  (of suficient speed and durability) for 10 of your HDDs, a 1:5 ratio
  would of course be better.
 
 We do have SSDs, but we are not using them right now.
 We have 4 SSD per osd host (24 SSD at the moment).
 SSD model is Intel DC S3700 (400 GB).
 
That's a nice one. ^^

 We are testing different scenarios before making our final decision
 (cache-tiering, journaling, separate pool,...).

Definitely a good idea to test things out and get an idea what Ceph and
your hardware can do.

From my experience and reading this ML however I think your best bet
(overall performance) is to use those 4 SSDs a 1:5 journal SSDs for your
20 OSDs HDDs.

Currently cache-tiering is probably the worst use for those SSD resources,
though the code and strategy is of course improving.

Dedicated SSD pools may be a good fit depending on your use case.
However I'd advise against mixing SSD and HDD OSDs on the same node.
To fully utilize those SSDs you'll need a LOT more CPU power than required
by HDD OSDs or SSD journals/HDD OSDs systems. 
And you already have 20 OSDs in that box.
What CPUs do you have in those storage nodes anyway?

If you have the budget, I'd deploy the current storage nodes in classic
(SSDs for journals) mode and add a small (2x 8-12 SSDs) pair of pure SSD
nodes, optimized for their task (more CPU power, faster network).

Then use those SSD nodes to experiment with cache-tiers and pure SSD pools
and switch over things when you're comfortable with this and happy with the
performance. 
 
 
  However with 20 OSDs per node, you're likely to go from a being
  bottlenecked by your HDDs to being CPU limited (when dealing with lots
  of small IOPS at least).
  Still, better than now for sure.
  
 This is very interesting, thanks for pointing it out!
 What would you suggest to use in order to identify the actual
 bottleneck? (disk, CPU, RAM, etc.). Tools like munin?
 
Munin might work, I use collectd to gather all those values (and even more
importantly all Ceph counters) and graphite to visualize it.
For ad-hoc, on the spot analysis I really like atop (in a huge window),
which will make it very clear what is going on.

 In addition, there are some kernel tunables that may be helpful
 to improve overall performance. Maybe we are filling some kernel
 internals and that limits our results (for instance, we had to increase
 fs.aio-max-nr in sysctl.d to 262144 to be able to use 20 disks per
 host). Which tunables should we observe?
 
I'm no expert for large (not even medium) clusters, so you'll have to
research the archives and net (the CERN Ceph slide is nice).
One thing I remember is kernel.pid_max, which is something you're likely
to run into at some point

Re: [ceph-users] Blocked requests/ops?

2015-05-26 Thread Christian Balzer

Hello,

On Tue, 26 May 2015 10:00:13 -0600 Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I've seen I/O become stuck after we have done network torture tests.
It seems that after so many retries that the OSD peering just gives up
and doesn't retry any more. An OSD restart kicks off another round of
retries and the I/O completes. It seems like there was some discussion
about this on the devel list recently.

While that sounds certainly plausible, the Ceph network of my cluster
wasn't particular busy or tortured at that time at all.
I suppose other factors might cause a similar behavior, so a good way
forward would probably to ensure that retries will happen with no
limitation and in a reasonable interval.

As for Xavier, no I never filed a bug, that thread was all there is.
Since I didn't have anything other to report than it happened and
neither do you really, it is doubtful the devs can figure out what exactly
caused it.
So as I wrote above, probably best to make sure it keeps retrying no
matter what.

Christian
-
Robert LeBlanc
GPG Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

On Tue, May 26, 2015 at 4:06 AM, Xavier Serrano wrote:
Hello,

Thanks for your detailed explanation, and for the pointer to the
Unexplainable slow request thread.

After investigating osd logs, disk SMART status, etc., the disk under
osd.71 seems OK, so we restarted the osd... And voilà, problem seems
to be solved! (or at least, the slow request message disappeared).

But this really does not make me happy (and neither are you, Christian,
I'm afraid). I understand that it is not acceptable that sometimes,
apparently randomly, slow requests do happen and they remain stuck
until an operator manually restarts the affected osd.

My question now is: did you file a bug to ceph developers?
What did they say? Could you provide me the links? I would like
to reopen the issue if possible, and see if we can find a
solution for this.

About our cluster (testing, not production):
- ceph version 0.94.1
- all hosts running Ubuntu 14.04 LTS 64-bits, kernel 3.16
- 5 monitors, 128GB RAM each
- 6 osd hosts, 32GB RAM each, 20 osds per host, 1 HDD WD Green 2TB
per osd
- (and 6 more osds host to arrive soon)
- 10 GbE interconnection

Thank you very much indeed.
Best regards,
- Xavier Serrano
- LCAC, Laboratori de Càlcul
- Departament d'Arquitectura de Computadors, UPC

On Tue May 26 14:19:22 2015, Christian Balzer wrote:

Hello,

Firstly, find my Unexplainable slow request thread in the ML
archives and read all of it.

On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote:

Hello,

We have observed that our cluster is often moving back and forth
from HEALTH_OK to HEALTH_WARN states due to blocked requests.
We have also observed blocked ops. For instance:

As always SW versions and a detailed HW description (down to the
model of HDDs used) will be helpful and educational.

# ceph status
cluster 905a1185-b4f0-4664-b881-f0ad2d8be964
health HEALTH_WARN
1 requests are blocked 32 sec
monmap e5: 5 mons at
{ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0}
election epoch 44, quorum 0,1,2,3,4
ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap
e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools,
4373 GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail
2048 active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s

# ceph health detail
HEALTH_WARN 1 requests are blocked 32 sec; 1 osds have slow
requests 1 ops are blocked 67108.9 sec
1 ops are blocked 67108.9 sec on osd.71
1 osds have slow requests

You will want to have a very close look at osd.71 (logs, internal
counters, cranking up debugging), but might find it just as
mysterious as my case in the thread mentioned above.

My questions are:
(1) Is it normal to have slow requests in a cluster?
Not really, though the Ceph developers clearly think those just
merits a WARNING level, whereas I would consider those a clear sign
of brokenness, as VMs or other clients with those requests pending
are likely to be unusable at that point.

(2) Or is it a symptom that indicates that something is wrong?
(for example, a disk is about to fail)
That. Of course your cluster could be just at the edge of its
performance and nothing but improving that (most likely by adding
more nodes/OSDs) would fix that.

(3) How can we fix the slow requests?
Depends on cause of course.
AFTER you exhausted all means and gotten all relevant log/performance
data from osd.71 restarting the osd might be all that's needed.

(4) What's the meaning of blocked ops, and how can they be

Re: [ceph-users] Blocked requests/ops?

2015-05-26 Thread Xavier Serrano

Hello,

Thanks for your detailed explanation, and for the pointer to the
Unexplainable slow request thread.

But this really does not make me happy (and neither are you, Christian,
I'm afraid). I understand that it is not acceptable that sometimes,
apparently randomly, slow requests do happen and they remain stuck until
an operator manually restarts the affected osd.

My question now is: did you file a bug to ceph developers?
What did they say? Could you provide me the links? I would like
to reopen the issue if possible, and see if we can find a
solution for this.

About our cluster (testing, not production):
- ceph version 0.94.1
- all hosts running Ubuntu 14.04 LTS 64-bits, kernel 3.16
- 5 monitors, 128GB RAM each
- 6 osd hosts, 32GB RAM each, 20 osds per host, 1 HDD WD Green 2TB per osd
- (and 6 more osds host to arrive soon)
- 10 GbE interconnection

Thank you very much indeed.
Best regards,
- Xavier Serrano
- LCAC, Laboratori de Càlcul
- Departament d'Arquitectura de Computadors, UPC

On Tue May 26 14:19:22 2015, Christian Balzer wrote:

Hello,

Firstly, find my Unexplainable slow request thread in the ML archives
and read all of it.

On Tue, 26 May 2015 07:05:36 +0200 Xavier Serrano wrote:

Hello,

We have observed that our cluster is often moving back and forth
from HEALTH_OK to HEALTH_WARN states due to blocked requests.
We have also observed blocked ops. For instance:

As always SW versions and a detailed HW description (down to the model of
HDDs used) will be helpful and educational.

# ceph status
cluster 905a1185-b4f0-4664-b881-f0ad2d8be964
health HEALTH_WARN
1 requests are blocked 32 sec
monmap e5: 5 mons at
{ceph-host-1=192.168.0.65:6789/0,ceph-host-2=192.168.0.66:6789/0,ceph-host-3=192.168.0.67:6789/0,ceph-host-4=192.168.0.68:6789/0,ceph-host-5=192.168.0.69:6789/0}
election epoch 44, quorum 0,1,2,3,4
ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-4,ceph-host-5 osdmap
e5091: 120 osds: 100 up, 100 in pgmap v473436: 2048 pgs, 2 pools, 4373
GB data, 1093 kobjects 13164 GB used, 168 TB / 181 TB avail 2048
active+clean client io 10574 kB/s rd, 33883 kB/s wr, 655 op/s

# ceph health detail
HEALTH_WARN 1 requests are blocked 32 sec; 1 osds have slow requests
1 ops are blocked 67108.9 sec
1 ops are blocked 67108.9 sec on osd.71
1 osds have slow requests

You will want to have a very close look at osd.71 (logs, internal
counters, cranking up debugging), but might find it just as mysterious as
my case in the thread mentioned above.

My questions are:
(1) Is it normal to have slow requests in a cluster?
Not really, though the Ceph developers clearly think those just merits a
WARNING level, whereas I would consider those a clear sign of brokenness,
as VMs or other clients with those requests pending are likely to be
unusable at that point.

(2) Or is it a symptom that indicates that something is wrong?
(for example, a disk is about to fail)
That. Of course your cluster could be just at the edge of its performance
and nothing but improving that (most likely by adding more nodes/OSDs)
would fix that.

(3) How can we fix the slow requests?
Depends on cause of course.
AFTER you exhausted all means and gotten all relevant log/performance data
from osd.71 restarting the osd might be all that's needed.

(4) What's the meaning of blocked ops, and how can they be
blocked so long? (67000 seconds is more than 18 hours!)
Precisely, this shouldn't happen.

(5) How can we fix the blocked ops?

AFTER you exhausted all means and gotten all relevant log/performance data
from osd.71 restarting the osd might be all that's needed.

Christian
--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com Global OnLine Japan/Fusion Communications
http://www.gol.com/

--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Blocked requests/ops?

2015-05-26 Thread Robert LeBlanc