Erratum : Sorry for bad link for screenshots :

1st : https://supervision.pci-conseil.net/screenshot_LOAD.png

2nd : https://supervision.pci-conseil.net/screenshot_OSD_IO.png

:)

Le 02/03/2017 à 15:34, [email protected] a écrit :

Hello,

So, I need maybe some advices : 1 week ago (last 19 feb), I upgraded my stable Ceph Jewel from 10.2.3 to 10.2.5 (YES, It was maybe a bad idea).

I never had problem with Ceph 10.2.3 since last upgrade, last 23 September.

So since my upgrade (10.2.5), every 2 days, the first OSD server totaly Freeze. Load go > 500 and come back after somes minutes… I lost all OSD from this server (12/36) during issue.

It’s very strange: So, some informations :

_Infrastructure_:

3 x OSD servers with 12x OSD disk each and SSD Journal + 3 Mon server + 3 clients Ceph - RBD.

10G dedicated network for client and 10G dedicated networks for OSD.

So 36 x OSD. Each server has 16 CPU core (E5-2630v3x2) and 32G Ram. No problem with resources.

Performance is good for 36 x NL-SAS DISK 4To + 1 SSD write intensiv per OSD-server.

_Issue:_

This morning (last issue was 2 days ago):

See screenshot : http://www.performance-conseil-informatique.net/wp-content/uploads/2017/03/screenshot_LOAD-1.png

As you can see, there are few IO (just 2 clients, writing sometime 150Mo/s during few minutes) – It’s a big NAS for cold Data.

So during issue, there was no IO: it's strange. Same for other issue.

See screenshot : http://www.performance-conseil-informatique.net/wp-content/uploads/2017/03/screenshot_OSD_IO.png

Before issue: no activity. You can see all OSD READ, OSD Write, Journal (SSD), IO wait.

7 :07=>7 :09. 2 minutes with 12/36 OSD totaly lost. It come back after, but I need to fix that.

During time of issue, scrub is stopped as well, Trim night was finished… no IO.

No other cron on server, nothing. all server have same configuration.

*LOGS : *

A lot :

ceph-osd.3.log:2017-03-02 07:09:32.061754 7f6d501e4700 -1 osd.3 14557 heartbeat_check: no reply from 0x7f6dadb48c10 osd.19 since back 2017-03-02 07:07:53.286880 front 2017-03-02 07:07:53.286880 (cutoff 2017-03-02 07:09:12.061690)

 Sometime:

common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7fc38a5e9425]

2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char const*, long)+0x2e1) [0x7fc38a528de1]

3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7fc38a52963e]

4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7fc38a529e1c]

5: (CephContextServiceThread::entry()+0x15b) [0x7fc38a6011ab]

6: (()+0x7dc5) [0x7fc388304dc5]

7: (clone()+0x6d) [0x7fc38698f73d]

NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


_*Questions:*_ Why only first OSD-server freeze? 3 servers are strictly same. What are freezing server, increased load… ?

Already 4 freezes from the last upgrade. I will today modify log level and restart all to have more logs.

Any idea to troubleshoot ? (I already use sar statistics to find something…).

Maybe some change with heartbeat ?

Should I think to downgrade to 10.2.3 ? upgrade to Kraken ?

Thanks for your help,

Regards,

*Other things :*

rpm -qa|grep ceph

libcephfs1-10.2.5-0.el7.x86_64

ceph-common-10.2.5-0.el7.x86_64

ceph-mon-10.2.5-0.el7.x86_64

ceph-release-1-1.el7.noarch

ceph-10.2.5-0.el7.x86_64

ceph-radosgw-10.2.5-0.el7.x86_64

ceph-selinux-10.2.5-0.el7.x86_64

ceph-mds-10.2.5-0.el7.x86_64

python-cephfs-10.2.5-0.el7.x86_64

ceph-base-10.2.5-0.el7.x86_64

ceph-osd-10.2.5-0.el7.x86_64

uname -a

Linux ceph-osd-03 3.10.0-514.6.2.el7.x86_64 #1 SMP Thu Feb 23 03:04:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Ceph conf :

[global]

fsid = d26f269b-852f-4181-821d-756f213ae155

mon_initial_members = ceph-mon-01, ceph-mon-02, ceph-mon-03

mon_host = 192.168.43.147,192.168.43.148,192.168.43.149

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

max_open_files = 131072

public_network = 192.168.43.0/24

cluster_network = 192.168.44.0/24

osd_journal_size = 13000

osd_pool_default_size = 2 # Write an object n times.

osd_pool_default_min_size = 2 # Allow writing n copy in a degraded state.

osd_pool_default_pg_num = 512

osd_pool_default_pgp_num = 512

osd_crush_chooseleaf_type = 8

cephx_cluster_require_signatures = true

cephx_service_require_signatures = false

mon_pg_warn_max_object_skew = 0

mon_pg_warn_max_per_osd = 0

[mon]

[osd]

osd_max_backfills = 1

osd_recovery_priority = 3

osd_recovery_max_active = 3

osd_recovery_max_start = 3

filestore merge threshold = 40

filestore split multiple = 8

filestore xattr use omap = true

osd op threads = 8

osd disk threads = 4

osd op num threads per shard = 3

osd op num shards = 10

osd map cache size = 1024

osd_enable_op_tracker = false

osd_scrub_begin_hour = 20

osd_scrub_end_hour = 6

[client]

rbd_cache = true

rbd cache size = 67108864

rbd cache max dirty = 50331648

rbd cache target dirty = 33554432

rbd cache max dirty age = 2

rbd cache writethrough until flush = true

rbd readahead trigger requests = 10 # number of sequential requests necessary to trigger readahead.

rbd readahead max bytes = 524288 # maximum size of a readahead request, in bytes.

rbd readahead disable after bytes = 52428800

--
        *Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
[email protected] <mailto:[email protected]>
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net         /*News :*
On transforme ! <http://www.performance-conseil-informatique.net/2016/12/31/820/> Comme promis, en 2017 on transforme ! A vos côtés, nous transformons votre infrastructure informatique tout en gardant les fondamentaux PCI : Conti...
/


--
        *Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
[email protected] <mailto:[email protected]>
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net         /*News :*
On transforme ! <http://www.performance-conseil-informatique.net/2016/12/31/820/> Comme promis, en 2017 on transforme ! A vos côtés, nous transformons votre infrastructure informatique tout en gardant les fondamentaux PCI : Conti...
/

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to