Re: [ceph-users] [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

[email protected] Thu, 02 Mar 2017 07:41:38 -0800

Erratum : Sorry for bad link for screenshots :

1st : https://supervision.pci-conseil.net/screenshot_LOAD.png


2nd : https://supervision.pci-conseil.net/screenshot_OSD_IO.png

:)

Le 02/03/2017 à 15:34, [email protected] a écrit :

Hello,
So, I need maybe some advices : 1 week ago (last 19 feb), I upgradedmy stable Ceph Jewel from 10.2.3 to 10.2.5 (YES, It was maybe a bad idea).
I never had problem with Ceph 10.2.3 since last upgrade, last 23September.
So since my upgrade (10.2.5), every 2 days, the first OSD servertotaly Freeze. Load go > 500 and come back after somes minutes… I lostall OSD from this server (12/36) during issue.
It’s very strange: So, some informations :

_Infrastructure_:
3 x OSD servers with 12x OSD disk each and SSD Journal + 3 Mon server+ 3 clients Ceph - RBD.
10G dedicated network for client and 10G dedicated networks for OSD.
So 36 x OSD. Each server has 16 CPU core (E5-2630v3x2) and 32G Ram. Noproblem with resources.
Performance is good for 36 x NL-SAS DISK 4To + 1 SSD write intensivper OSD-server.
_Issue:_

This morning (last issue was 2 days ago):
See screenshot :http://www.performance-conseil-informatique.net/wp-content/uploads/2017/03/screenshot_LOAD-1.png
As you can see, there are few IO (just 2 clients, writing sometime150Mo/s during few minutes) – It’s a big NAS for cold Data.
So during issue, there was no IO: it's strange. Same for other issue.
See screenshot :http://www.performance-conseil-informatique.net/wp-content/uploads/2017/03/screenshot_OSD_IO.png
Before issue: no activity. You can see all OSD READ, OSD Write,Journal (SSD), IO wait.
7 :07=>7 :09. 2 minutes with 12/36 OSD totaly lost. It come backafter, but I need to fix that.
During time of issue, scrub is stopped as well, Trim night wasfinished… no IO.
No other cron on server, nothing. all server have same configuration.

*LOGS : *

A lot :
ceph-osd.3.log:2017-03-02 07:09:32.061754 7f6d501e4700 -1 osd.3 14557heartbeat_check: no reply from 0x7f6dadb48c10 osd.19 since back2017-03-02 07:07:53.286880 front 2017-03-02 07:07:53.286880 (cutoff2017-03-02 07:09:12.061690)
 Sometime:

common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, charconst*)+0x85) [0x7fc38a5e9425]
2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, charconst*, long)+0x2e1) [0x7fc38a528de1]
3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7fc38a52963e]

4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7fc38a529e1c]

5: (CephContextServiceThread::entry()+0x15b) [0x7fc38a6011ab]

6: (()+0x7dc5) [0x7fc388304dc5]

7: (clone()+0x6d) [0x7fc38698f73d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` isneeded to interpret this.
_*Questions:*_ Why only first OSD-server freeze? 3 servers arestrictly same. What are freezing server, increased load… ?
Already 4 freezes from the last upgrade. I will today modify log leveland restart all to have more logs.
Any idea to troubleshoot ? (I already use sar statistics to findsomething…).
Maybe some change with heartbeat ?

Should I think to downgrade to 10.2.3 ? upgrade to Kraken ?

Thanks for your help,

Regards,

*Other things :*

rpm -qa|grep ceph

libcephfs1-10.2.5-0.el7.x86_64

ceph-common-10.2.5-0.el7.x86_64

ceph-mon-10.2.5-0.el7.x86_64

ceph-release-1-1.el7.noarch

ceph-10.2.5-0.el7.x86_64

ceph-radosgw-10.2.5-0.el7.x86_64

ceph-selinux-10.2.5-0.el7.x86_64

ceph-mds-10.2.5-0.el7.x86_64

python-cephfs-10.2.5-0.el7.x86_64

ceph-base-10.2.5-0.el7.x86_64

ceph-osd-10.2.5-0.el7.x86_64

uname -a
Linux ceph-osd-03 3.10.0-514.6.2.el7.x86_64 #1 SMP Thu Feb 23 03:04:39UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Ceph conf :

[global]

fsid = d26f269b-852f-4181-821d-756f213ae155

mon_initial_members = ceph-mon-01, ceph-mon-02, ceph-mon-03

mon_host = 192.168.43.147,192.168.43.148,192.168.43.149

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

max_open_files = 131072

public_network = 192.168.43.0/24

cluster_network = 192.168.44.0/24

osd_journal_size = 13000

osd_pool_default_size = 2 # Write an object n times.

osd_pool_default_min_size = 2 # Allow writing n copy in a degraded state.

osd_pool_default_pg_num = 512

osd_pool_default_pgp_num = 512

osd_crush_chooseleaf_type = 8

cephx_cluster_require_signatures = true

cephx_service_require_signatures = false

mon_pg_warn_max_object_skew = 0

mon_pg_warn_max_per_osd = 0

[mon]

[osd]

osd_max_backfills = 1

osd_recovery_priority = 3

osd_recovery_max_active = 3

osd_recovery_max_start = 3

filestore merge threshold = 40

filestore split multiple = 8

filestore xattr use omap = true

osd op threads = 8

osd disk threads = 4

osd op num threads per shard = 3

osd op num shards = 10

osd map cache size = 1024

osd_enable_op_tracker = false

osd_scrub_begin_hour = 20

osd_scrub_end_hour = 6

[client]

rbd_cache = true

rbd cache size = 67108864

rbd cache max dirty = 50331648

rbd cache target dirty = 33554432

rbd cache max dirty age = 2

rbd cache writethrough until flush = true
rbd readahead trigger requests = 10 # number of sequential requestsnecessary to trigger readahead.
rbd readahead max bytes = 524288 # maximum size of a readaheadrequest, in bytes.
rbd readahead disable after bytes = 52428800

--
        *Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
[email protected] <mailto:[email protected]>
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net         /*News :*
On transforme !<http://www.performance-conseil-informatique.net/2016/12/31/820/>Comme promis, en 2017 on transforme ! A vos côtés, nous transformonsvotre infrastructure informatique tout en gardant les fondamentaux PCI: Conti...
/


--
        *Performance Conseil Informatique*
Pascal Pucci
Consultant Infrastructure
[email protected] <mailto:[email protected]>
Mobile : 06 51 47 84 98
Bureau : 02 85 52 41 81
http://www.performance-conseil-informatique.net         /*News :*

On transforme !<http://www.performance-conseil-informatique.net/2016/12/31/820/>Comme promis, en 2017 on transforme ! A vos côtés, nous transformonsvotre infrastructure informatique tout en gardant les fondamentaux PCI :Conti...

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Jewel] upgrade 10.2.3 => 10.2.5 KO : first OSD server freeze every two days :)

Reply via email to