Re: [ceph-users] HELP ! Cluster unusable with lots of "hitsuicidetimeout"

2016-10-19 Thread Yoann Moulin
Hello,

>>> We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is 
>>> compose by 12 nodes, each nodes have 10 OSD with journal on disk.
>>>
>>> We have one rbd partition and a radosGW with 2 data pool, one replicated, 
>>> one EC (8+2)
>>>
>>> in attachment few details on our cluster.
>>>
>>> Currently, our cluster is not usable at all due to too much OSD 
>>> instability. OSDs daemon die randomly with "hit suicide timeout". 
>>> Yesterday, all
>>> of 120 OSDs died at least 12 time (max 74 time) with an average around 40 
>>> time
>>>
>>> here logs from ceph mon and from one OSD :
>>>
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
>>> http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)
>>>
>>> We have stopped all clients i/o to see if the cluster get stable without 
>>> success, to avoid  endless rebalancing with OSD flapping, we had to
>>> "set noout" the cluster. For now we have no idea what's going on.
>>>
>>> Anyone can help us to understand what's happening ?
>>>
>>> thanks for your help
>>>
>> no specific ideas, but this somewhat sounds familiar.
>>
>> One thing first, you already stopped client traffic but to make sure your
>> cluster really becomes quiescent, stop all scrubs as well.
>> That's always a good idea in any recovery, overload situation.

this is what we did.

>> Have you verified CPU load (are those OSD processes busy), memory status,
>> etc?
>> How busy are the actual disks?

The CPU and memory seem to not be overloaded, with journal on disk maybe a 
little bit busy.

>> Sudden deaths like this often are the results of network changes,  like a
>> switch rebooting and loosing jumbo frame configuration or whatnot.

We manage all equipments of the cluster, none of them have reboot. We decided 
to reboot node by node yesterday but the switch is healthy.

In the log I found that the problem has started after I start to copy data on 
the RadosGW EC pool (8+2).

At the same time, we had 6 process reading on the rbd partition, three of those 
process was writing on a replicated pool through the RadosGW s3
of the cluster itself and one was writing on a EC pool through the RadosGW s3 
too, 2 other was not writing on the cluster.
Maybe that pressure may slow down enough the disk to create the suicide timeout 
of the OSD ?

but now, we have no more I/O on the cluster and as soon as I re enable scrub 
and  rebalancing, OSDs start to fail again...

> just an additional comment:
> 
> you can disable backfilling and recovery temporarily by setting the 
> 'nobackfill' and 'norecover' flags. It will reduce the backfilling traffic
> and may help the cluster and its OSD to recover. Afterwards you should set 
> the backfill traffic settings to the minimum (e.g. max_backfills = 1)
> and unset the flags to allow the cluster to perform the outstanding recovery 
> operation.
>
> As the others already pointed out, these actions might help to get the 
> cluster up and running again, but you need to find the actual reason for
> the problems.

This is exactly what I want

Thanks for the help !

-- 
Yoann Moulin
EPFL IC-IT
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HELP ! Cluster unusable with lots of "hitsuicidetimeout"

2016-10-19 Thread Burkhard Linke

Hi,

just an additional comment:

you can disable backfilling and recovery temporarily by setting the 
'nobackfill' and 'norecover' flags. It will reduce the backfilling 
traffic and may help the cluster and its OSD to recover. Afterwards you 
should set the backfill traffic settings to the minimum (e.g. 
max_backfills = 1) and unset the flags to allow the cluster to perform 
the outstanding recovery operation.


As the others already pointed out, these actions might help to get the 
cluster up and running again, but you need to find the actual reason for 
the problems.


Regards,
Burkhard

On 19.10.2016 10:04, Christian Balzer wrote:

Hello,

no specific ideas, but this somewhat sounds familiar.

One thing first, you already stopped client traffic but to make sure your
cluster really becomes quiescent, stop all scrubs as well.
That's always a good idea in any recovery, overload situation.

Have you verified CPU load (are those OSD processes busy), memory status,
etc?
How busy are the actual disks?

Sudden deaths like this often are the results of network changes,  like a
switch rebooting and loosing jumbo frame configuration or whatnot.

Christian
  
On Wed, 19 Oct 2016 09:44:01 +0200 Yoann Moulin wrote:



Dear List,

We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 
12 nodes, each nodes have 10 OSD with journal on disk.

We have one rbd partition and a radosGW with 2 data pool, one replicated, one 
EC (8+2)

in attachment few details on our cluster.

Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon 
die randomly with "hit suicide timeout". Yesterday, all
of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time

here logs from ceph mon and from one OSD :

http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB)
http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB)

We have stopped all clients i/o to see if the cluster get stable without 
success, to avoid  endless rebalancing with OSD flapping, we had to
"set noout" the cluster. For now we have no idea what's going on.

Anyone can help us to understand what's happening ?

thanks for your help





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com