Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down

Tuomas Juntunen Thu, 02 Jul 2015 19:39:54 -0700

Just reporting back on my findings


After making these changes the flapping occurred just once during the night.
To fix it further I changed the heartbeat grace to 120secs. Also matched 

osd_op_threads and filestore_op_threads to core count.

 

Br,T

 

 

From: ceph-users [mailto:[email protected]] On Behalf Of
Tuomas Juntunen
Sent: 2. heinäkuuta 2015 16:23
To: 'Somnath Roy'; 'ceph-users'
Subject: Re: [ceph-users] One of our nodes has logs saying: wrongly marked
me down

 

Thanks

 

Ill test these values, and also add the osd heartbeat grace to 60 seconds
instead of 20, hopefully that would help with the latency during deep scrub.

 

I changed shards to 6 and shard threads to 2, then it matches physical cores
on the server not including hyperthreading.

 

Br, T

 

From: Somnath Roy [mailto:[email protected]] 
Sent: 2. heinäkuuta 2015 6:29
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked
me down

 

Yeah, this can happen during deep_scrub and also during rebalancing..I
forgot to mention that..

Generally, it is a good idea to throttle those..For deep scrub, you can try
using (got it from old post, I never used it)

 

osd_scrub_chunk_min = 1

osd_scrub_chunk_max = 1

osd_scrub_sleep = 0.1

 

For rebalancing I think you are already using proper value..

 

But, I dont think this will eliminate the scenario all together but should
alleviate it a bit.

 

Also, why you are using so many shards ? How many OSDs you are running in a
box ? shard 25 should be good if you are running with single OSD, IF you
have lot of OSDs in a box, try to reduce it ~5 or so.

 

Thanks & Regards

Somnath

 

 

From: Tuomas Juntunen [mailto:[email protected]] 
Sent: Wednesday, July 01, 2015 8:18 PM
To: Somnath Roy; 'ceph-users'
Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked
me down

 

Ive checked the network, we use IPoIB and all nodes are connected to the
same switch, there are no breaks in connectivity while this happens. My
constant ping says 0.03  0.1ms. I would say this is ok.

 

This happens almost every time when deep scrubbing is running. Our loads on
this particular server goes to 300+ and osds are marked down.

 

Any suggestions on settings? I now have the following settings that might
affect this

 

[global]

                             osd_op_threads = 6

                             osd_op_num_threads_per_shard = 1

                             osd_op_num_shards = 25

                             #osd_op_num_sharded_pool_threads = 25

                             filestore_op_threads = 6

                             ms_nocrc = true

                             filestore_fd_cache_size = 64

                             filestore_fd_cache_shards = 32

                             ms_dispatch_throttle_bytes = 0

                             throttler_perf_counter = false

 

[osd]

                             osd scrub load threshold = 0.1

                             osd max backfills = 1

                             osd recovery max active = 1

                             osd scrub sleep = .1

                             osd disk thread ioprio class = idle

                             osd disk thread ioprio priority = 7

                             osd scrub chunk max = 5

                             osd deep scrub stride = 1048576

                             filestore queue max ops = 10000

                             filestore max sync interval = 30

                             filestore min sync interval = 29

                             osd_client_message_size_cap = 0

                             osd_client_message_cap = 0

                             osd_enable_op_tracker = false

 

Br, T

 

 

From: Somnath Roy [mailto:[email protected]] 
Sent: 2. heinäkuuta 2015 0:30
To: Tuomas Juntunen; 'ceph-users'
Subject: RE: [ceph-users] One of our nodes has logs saying: wrongly marked
me down

 

This can happen if your OSDs are flapping.. Hope your network is stable.

 

Thanks & Regards

Somnath

 

From: ceph-users [mailto:[email protected]] On Behalf Of
Tuomas Juntunen
Sent: Wednesday, July 01, 2015 2:24 PM
To: 'ceph-users'
Subject: [ceph-users] One of our nodes has logs saying: wrongly marked me
down

 

Hi

 

One our nodes has OSD logs that say wrongly marked me down for every OSD
at some point. What could be the reason for this. Anyone have any similar
experiences?

 

Other nodes work totally fine and they are all identical.

 

Br,T

 

  _____  


PLEASE NOTE: The information contained in this electronic mail message is
intended only for the use of the designated recipient(s) named above. If the
reader of this message is not the intended recipient, you are hereby
notified that you have received this message in error and that any review,
dissemination, distribution, or copying of this message is strictly
prohibited. If you have received this communication in error, please notify
the sender by telephone or e-mail (as shown above) immediately and destroy
any and all copies of this message in your possession (whether hard copies
or electronically stored copies).

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] One of our nodes has logs saying: wrongly marked me down

Reply via email to