Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Dan van der Ster
On Wed, Oct 19, 2016 at 3:22 PM, Yoann Moulin  wrote:
> Hello,
>
>>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is 
>>> compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>>
>>> We have one rbd partition and a radosGW with 2 data pool, one replicated, 
>>> one EC (8+2)
>>>
>>> in attachment few details on our cluster.
>>>
>>> Currently, our cluster is not usable at all due to too much OSD 
>>> instability. OSDs daemon die randomly with "hit suicide timeout". 
>>> Yesterday, all
>>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 
>>> time
>>>
>>> here logs from ceph mon and from one OSD :
>>>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
>>
>> Do you have an older log showing the start of the incident? The
>> cluster was already down when this log started.
>
> Here the log from Saturday, OSD 134 is the first which had error :
>
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.134.log.4.bz2
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.4.bz2
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.4.bz2


Do you have osd.86's log? I think it was the first to fail:

2016-10-15 14:42:32.109025 mon.0 10.90.37.3:6789/0 5240160 : cluster
[INF] osd.86 10.90.37.15:6823/11625 failed (2 reporters from different
host after 20.000215 >= grace 20.00)

Then these osds a couple seconds later:

2016-10-15 14:42:34.900989 mon.0 10.90.37.3:6789/0 5240180 : cluster
[INF] osd.27 10.90.37.5:6802/5426 failed (2 reporters from different
host after 20.000417 >= grace 20.00)
2016-10-15 14:42:34.902105 mon.0 10.90.37.3:6789/0 5240183 : cluster
[INF] osd.95 10.90.37.12:6822/12403 failed (2 reporters from different
host after 20.001862 >= grace 20.00)
2016-10-15 14:42:34.902653 mon.0 10.90.37.3:6789/0 5240185 : cluster
[INF] osd.131 10.90.37.25:6820/195317 failed (2 reporters from
different host after 20.002387 >= grace 20.00)
2016-10-15 14:42:34.903205 mon.0 10.90.37.3:6789/0 5240187 : cluster
[INF] osd.136 10.90.37.23:6803/5148 failed (2 reporters from different
host after 20.002898 >= grace 20.00)
2016-10-15 14:42:35.576139 mon.0 10.90.37.3:6789/0 5240191 : cluster
[INF] osd.24 10.90.37.3:6800/4587 failed (2 reporters from different
host after 21.384669 >= grace 20.094412)
2016-10-15 14:42:35.580217 mon.0 10.90.37.3:6789/0 5240193 : cluster
[INF] osd.37 10.90.37.11:6838/179566 failed (3 reporters from
different host after 20.680190 >= grace 20.243928)
2016-10-15 14:42:35.581550 mon.0 10.90.37.3:6789/0 5240195 : cluster
[INF] osd.46 10.90.37.9:6800/4811 failed (2 reporters from different
host after 21.389655 >= grace 20.00)
2016-10-15 14:42:35.582286 mon.0 10.90.37.3:6789/0 5240197 : cluster
[INF] osd.64 10.90.37.21:6810/7658 failed (2 reporters from different
host after 21.390167 >= grace 20.409388)
2016-10-15 14:42:35.582823 mon.0 10.90.37.3:6789/0 5240199 : cluster
[INF] osd.107 10.90.37.19:6820/10260 failed (2 reporters from
different host after 21.390516 >= grace 20.074818)


Just a hunch, but do osds 86, 27, 95, etc... all share the same PG?
Use 'ceph pg dump' to check.

>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
>>
>> In this log the thread which is hanging is doing deep-scrub:
>>
>> 2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
>> [INF] : 39.54 deep-scrub starts
>> 2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
>> 2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150
>>
>> So you can disable scrubbing completely with
>>
>>   ceph osd set noscrub
>>   ceph osd set nodeep-scrub
>>
>> in case you are hitting some corner case with the scrubbing code.
>
> Now the cluster seem to be healthy. but as soon as I re enable scrubbing and 
> rebalancing OSD start to flap and the cluster switch to HEATH_ERR
>

Looks like recover/backfill are enabled and you have otherwise all
clean PGs. Don't be afraid to leave scrubbing disabled until you
understand exactly what is going wrong.

Do you see any SCSI / IO errors on the disks failing to scrub?
Though, it seems unlikely that so many disks are all failing at the
same time. More likely there's at least one object that's giving the
scrubber problems and hanging the related OSDs.


> cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
>   health HEALTH_WARN
>  noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>   monmap e1: 3 mons at
> {iccluster002.iccluster.epfl.ch=10.90.37.3:6789/0,iccluster010.iccluster.epfl.ch=10.90.37.11:6789/0,iccluster018.iccluster.epfl.ch=10.90.37.19:6789/0}
>  election epoch 64, quorum 0,1,2 
> iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
>fsmap e131: 1/1/1 up {0=iccluster022.iccluster.epfl.ch=up:active}, 2 
> up:standby
>   

Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Yoann Moulin
Hello,

>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose 
>> by 12 nodes, each nodes have 10 OSD with journal on disk.
>>
>> We have one rbd partition and a radosGW with 2 data pool, one replicated, 
>> one EC (8+2)
>>
>> in attachment few details on our cluster.
>>
>> Currently, our cluster is not usable at all due to too much OSD instability. 
>> OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 
>> time
>>
>> here logs from ceph mon and from one OSD :
>>
>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
> 
> Do you have an older log showing the start of the incident? The
> cluster was already down when this log started.

Here the log from Saturday, OSD 134 is the first which had error :

http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.134.log.4.bz2
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.4.bz2
http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.4.bz2

>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
> 
> In this log the thread which is hanging is doing deep-scrub:
> 
> 2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
> [INF] : 39.54 deep-scrub starts
> 2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
> 2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150
> 
> So you can disable scrubbing completely with
> 
>   ceph osd set noscrub
>   ceph osd set nodeep-scrub
> 
> in case you are hitting some corner case with the scrubbing code.

Now the cluster seem to be healthy. but as soon as I re enable scrubbing and 
rebalancing OSD start to flap and the cluster switch to HEATH_ERR

cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4
  health HEALTH_WARN
 noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
  monmap e1: 3 mons at
{iccluster002.iccluster.epfl.ch=10.90.37.3:6789/0,iccluster010.iccluster.epfl.ch=10.90.37.11:6789/0,iccluster018.iccluster.epfl.ch=10.90.37.19:6789/0}
 election epoch 64, quorum 0,1,2 
iccluster002.iccluster.epfl.ch,iccluster010.iccluster.epfl.ch,iccluster018.iccluster.epfl.ch
   fsmap e131: 1/1/1 up {0=iccluster022.iccluster.epfl.ch=up:active}, 2 
up:standby
  osdmap e72932: 144 osds: 144 up, 120 in
 flags noout,noscrub,nodeep-scrub,sortbitwise
   pgmap v4834810: 9408 pgs, 28 pools, 153 TB data, 75849 kobjects
 449 TB used, 203 TB / 653 TB avail
 9408 active+clean


>> We have stopped all clients i/o to see if the cluster get stable without 
>> success, to avoid  endless rebalancing with OSD flapping, we had to
>> "set noout" the cluster. For now we have no idea what's going on.
>>
>> Anyone can help us to understand what's happening ?
> 
> Is your network OK?

We have one 10G nic for the private network and one 10G nic for the public 
network. The network is far under loaded right now and there is no
error. We don't use jumbo frame.

> It will be useful to see the start of the incident to better
> understand what caused this situation.
>
> Also, maybe useful for you... you can increase the suicide timeout, e.g.:
> 
>osd op thread suicide timeout: 
> 
> If the cluster is just *slow* somehow, then increasing that might
> help. If there is something systematically broken, increasing would
> just postpone the inevitable.

Ok, I'm going to study this option with my colleagues

thanks

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Christian Balzer

Hello,

no specific ideas, but this somewhat sounds familiar.

One thing first, you already stopped client traffic but to make sure your
cluster really becomes quiescent, stop all scrubs as well.
That's always a good idea in any recovery, overload situation.

Have you verified CPU load (are those OSD processes busy), memory status,
etc?
How busy are the actual disks?

Sudden deaths like this often are the results of network changes,  like a
switch rebooting and loosing jumbo frame configuration or whatnot.

Christian
 
On Wed, 19 Oct 2016 09:44:01 +0200 Yoann Moulin wrote:

> Dear List,
> 
> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose 
> by 12 nodes, each nodes have 10 OSD with journal on disk.
> 
> We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
> EC (8+2)
> 
> in attachment few details on our cluster.
> 
> Currently, our cluster is not usable at all due to too much OSD instability. 
> OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
> 
> here logs from ceph mon and from one OSD :
> 
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
> 
> We have stopped all clients i/o to see if the cluster get stable without 
> success, to avoid  endless rebalancing with OSD flapping, we had to
> "set noout" the cluster. For now we have no idea what's going on.
> 
> Anyone can help us to understand what's happening ?
> 
> thanks for your help
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Dan van der Ster
Hi Yoann,


On Wed, Oct 19, 2016 at 9:44 AM, Yoann Moulin  wrote:
> Dear List,
>
> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose 
> by 12 nodes, each nodes have 10 OSD with journal on disk.
>
> We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
> EC (8+2)
>
> in attachment few details on our cluster.
>
> Currently, our cluster is not usable at all due to too much OSD instability. 
> OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time
>
> here logs from ceph mon and from one OSD :
>
> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)

Do you have an older log showing the start of the incident? The
cluster was already down when this log started.

> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)

In this log the thread which is hanging is doing deep-scrub:

2016-10-18 22:16:23.985462 7f12da4af700  0 log_channel(cluster) log
[INF] : 39.54 deep-scrub starts
2016-10-18 22:16:39.008961 7f12e4cc4700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f12da4af700' had timed out after 15
2016-10-18 22:18:54.175912 7f12e34c1700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f12da4af700' had suicide timed out after 150

So you can disable scrubbing completely with

  ceph osd set noscrub
  ceph osd set nodeep-scrub

in case you are hitting some corner case with the scrubbing code.

> We have stopped all clients i/o to see if the cluster get stable without 
> success, to avoid  endless rebalancing with OSD flapping, we had to
> "set noout" the cluster. For now we have no idea what's going on.
>
> Anyone can help us to understand what's happening ?

Is your network OK?

It will be useful to see the start of the incident to better
understand what caused this situation.

Also, maybe useful for you... you can increase the suicide timeout, e.g.:

   osd op thread suicide timeout: 

If the cluster is just *slow* somehow, then increasing that might
help. If there is something systematically broken, increasing would
just postpone the inevitable.

-- Dan




>
> thanks for your help
>
> --
> Yoann Moulin
> EPFL IC-IT
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HELP ! Cluster unusable with lots of "hit suicide timeout"

2016-10-19 Thread Yoann Moulin
Dear List,

We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 
12 nodes, each nodes have 10 OSD with journal on disk.

We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
EC (8+2)

in attachment few details on our cluster.

Currently, our cluster is not usable at all due to too much OSD instability. 
OSDs daemon die randomly with "hit suicide timeout". Yesterday, all
of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time

here logs from ceph mon and from one OSD :

http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)

We have stopped all clients i/o to see if the cluster get stable without 
success, to avoid  endless rebalancing with OSD flapping, we had to
"set noout" the cluster. For now we have no idea what's going on.

Anyone can help us to understand what's happening ?

thanks for your help

-- 
Yoann Moulin
EPFL IC-IT
$ ceph --version
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

$ uname -a
Linux icadmin004 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

$ ceph osd pool ls detail
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4927 flags hashpspool stripe_width 0
	removed_snaps [1~3]
pool 3 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 258 flags hashpspool stripe_width 0
pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 259 flags hashpspool stripe_width 0
pool 5 'default.rgw.data.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 260 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 6 'default.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 261 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 262 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 8 'erasure.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 271 flags hashpspool stripe_width 0
pool 9 'erasure.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 272 flags hashpspool stripe_width 0
pool 11 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 276 flags hashpspool stripe_width 0
pool 12 'default.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 277 flags hashpspool stripe_width 0
pool 14 'default.rgw.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 311 flags hashpspool stripe_width 0
pool 15 'default.rgw.users.keys' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 313 flags hashpspool stripe_width 0
pool 16 'default.rgw.meta' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 315 flags hashpspool stripe_width 0
pool 17 'default.rgw.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 320 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 18 'default.rgw.users.email' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 322 owner 18446744073709551615 flags hashpspool stripe_width 0
pool 19 'default.rgw.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 353 flags hashpspool stripe_width 0
pool 20 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4918 flags hashpspool stripe_width 0
pool 26 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3549 flags hashpspool stripe_width 0
pool 27 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3551 flags hashpspool stripe_width 0
pool 28 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3552 flags hashpspool stripe_width 0
pool 29 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3553 flags hashpspool stripe_width 0
pool 30 'test' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 4910 flags hashpspool stripe_width 0
pool 31 'data' replicated size 3 min_size 2 crush_ruleset 0