Hi,

> On 24 May 2016, at 09:16, Christian Balzer <ch...@gol.com> wrote:
> 
> 
> Hello,
> 
> On Tue, 24 May 2016 07:03:25 +0000 Josef Johansson wrote:
> 
>> Hi,
>> 
>> You need to monitor latency instead of peak points. As Ceph is writing to
>> two other nodes if you have 3 replicas that is 4x extra the latency
>> compared to one roundtrip to the first OSD from client. So smaller and
>> more IO equals more pain in latency.
>> 
> While very true, I don't think latency (as in network Ceph code related)
> is causing his problems.
> 
> 30+ second slow requests tend to be nearly exclusively the domain of the
> I/O system being unable to keep up.
> And I'm pretty sure it's the HDD based EC pool.
> 
> Not just monitoring Ceph counters but the actual disk stats can be
> helpful, as keeping in mind that it takes only one slow OSD, node, wonky
> link to bring everything to a standstill. 
> 

I agree, more counters to identify the problem. Any way nowadays to 
automatically find out what OSD that is causing the slow requests? Like, a 
graphical way of doing it :)

>> And the worst thing is that there is nothing that actually shows this
>> AFAIK, ceph osd perf shows some latency. A slow CPU could hamper
>> performance even if it's showing no sign of it.
>> 
>> I believe you can see which operations is running right now and see where
>> they are waiting for, I think there's a thread on this ML regarding dead
>> lock and skow requests that could be interesting.
>> 
>> I did not see Christians responses either so maybe not a problem with
>> your client.
>> 
> The list (archive) and me sure did:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009464.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009464.html>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009744.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009744.html>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009747.html 
> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009747.html>
> 

So another Subject line for that issue. That explains it.

Regards,
Josef

> Christian
> 
>> Regards,
>> Josef
>> 
>> On Tue, 24 May 2016, 08:49 Peter Kerdisle, <peter.kerdi...@gmail.com>
>> wrote:
>> 
>>> Hey Christian,
>>> 
>>> I honestly haven't seen any replies to my earlier message. I will
>>> traverse my email and make sure I find it, my apologies.
>>> 
>>> I am graphing everything with collectd and graphite this is what makes
>>> it so frustrating since I am not seeing any obvious pain points.
>>> 
>>> I am basically using the pool in read-forward now so there should be
>>> almost no promotion from EC to the SSD pool. I will see what options I
>>> have for adding some SSD journals to the OSD nodes to help speed
>>> things along.
>>> 
>>> Thanks, and apologies again for missing your earlier replies.
>>> 
>>> Peter
>>> 
>>> On Tue, May 24, 2016 at 4:25 AM, Christian Balzer <ch...@gol.com>
>>> wrote:
>>> 
>>>> 
>>>> Hello,
>>>> 
>>>> On Mon, 23 May 2016 10:45:41 +0200 Peter Kerdisle wrote:
>>>> 
>>>>> Hey,
>>>>> 
>>>>> Sadly I'm still battling this issue. I did notice one interesting
>>>>> thing.
>>>>> 
>>>>> I changed the cache settings for my cache tier to add redundancy to
>>>>> the pool which means a lot of recover activity on the cache. During
>>>>> all this there were absolutely no slow requests reported. Is there
>>>>> anything I can conclude from that information? Is it possible that
>>>>> not having any SSDs for journals could be the bottleneck on my
>>>>> erasure pool and that's generating slow requests?
>>>>> 
>>>> That's what I suggested in the first of my 3 replies to your original
>>>> thread which seemingly got ignored judging by the lack of a reply.
>>>> 
>>>>> I simply can't imagine why a request can be blocked for 30 or even
>>>>> 60 seconds. It's getting really frustrating not being able to fix
>>>>> this and
>>>> I
>>>>> simply don't know what else I can do at this point.
>>>>> 
>>>> Are you monitoring your cluster, especially the HDD nodes?
>>>> Permanently with collectd and graphite (or similar) and topically
>>>> (especially during tests) with atop (or iostat, etc)?
>>>> 
>>>> And one more time, your cache tier can only help you if it is fast and
>>>> large enough to sufficiently dis-engage you from your slow backing
>>>> storage.
>>>> And lets face it EC pool AND no journal SSDs will be quite slow.
>>>> 
>>>> So if your cache is dirty all the time and has to flush (causing IO
>>>> on the backing storage) while there is also promotion from the
>>>> backing storage to the cache going on you're basically down to the
>>>> base speed of your EC pool, at least for some of your ops.
>>>> 
>>>> Christian
>>>> 
>>>>> If anybody has anything I haven't tried before please let me know.
>>>>> 
>>>>> Peter
>>>>> 
>>>>> On Thu, May 5, 2016 at 10:30 AM, Peter Kerdisle
>>>>> <peter.kerdi...@gmail.com> wrote:
>>>>> 
>>>>>> Hey guys,
>>>>>> 
>>>>>> I'm running into an issue with my cluster during high activity.
>>>>>> 
>>>>>> I have two SSD cache servers (2 SSDs for journals, 7 SSDs for
>>>>>> data) with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit
>>>>>> public and 10Gbit cluster network for the erasure pool (10x3TB
>>>>>> without separate journal). This is all on Jewel.
>>>>>> 
>>>>>> It's working fine with normal load. However when I force increased
>>>>>> activity by lowering the cache_target_dirty_ratio to make sure my
>>>>>> files are promoted things start to go amiss.
>>>>>> 
>>>>>> To give an example:
>>>>>> http://pastie.org/private/5k5ml6a8gqkivjshgjcedg
>>>>>> 
>>>>>> This is especially concering:  pgs: 9
>>>>>> activating+undersized+degraded, 48 active+undersized+degraded, 1
>>>>>> stale+active+clean, 27 peering.
>>>>>> 
>>>>>> Here is an other minute or so where I grepped for warnings:
>>>>>> http://pastie.org/private/bfv3kxl63cfcduafoaurog
>>>>>> 
>>>>>> These warnings are generated all over the OSD nodes, not
>>>>>> specifically one OSD or even one node.
>>>>>> 
>>>>>> During this time the different OSD logs show various warnings:
>>>>>> 
>>>>>> 2016-05-05 10:04:10.873603 7f1afaf3b700  1 heartbeat_map
>>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out
>>>>>> after 15 2016-05-05 10:04:10.873605 7f1afaf3b700  1 heartbeat_map
>>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out
>>>>>> after 15 2016-05-05 10:04:10.905997 7f1afc73e700  1 heartbeat_map
>>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out
>>>>>> after 15 2016-05-05 10:04:10.906000 7f1afc73e700  1 heartbeat_map
>>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out
>>>>>> after 15 2016-05-05 10:04:10.906022 7f1afaf3b700  1 heartbeat_map
>>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out
>>>>>> after 15 2016-05-05 10:04:10.906027 7f1afaf3b700  1 heartbeat_map
>>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out
>>>>>> after 15 2016-05-05 10:04:10.949894 7f1af3f2d700  1 heartbeat_map
>>>>>> reset_timeout 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed
>>>>>> out after 15 2016-05-05 10:04:10.956801 7f1afc73e700  1
>>>>>> heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700'
>>>>>> had timed out after 15 2016-05-05 10:04:10.956833 7f1afaf3b700  1
>>>>>> heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700'
>>>>>> had timed out after 15 2016-05-05 10:04:10.957816 7f1af5f31700  1
>>>>>> heartbeat_map reset_timeout 'OSD::osd_op_tp thread
>>>>>> 0x7f1af5f31700' had timed out after 15
>>>>>> 
>>>>>> or
>>>>>> 
>>>>>> 2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235
>>>>>> heartbeat_check: no reply from osd.25 since back 2016-05-05
>>>>>> 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff
>>>>>> 2016-05-05 10:03:07.269651) 2016-05-05 10:03:28.269838
>>>>>> 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25
>>>>>> since back 2016-05-05 10:03:06.566276 front 2016-05-05
>>>>>> 10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831) 2016-05-05
>>>>>> 10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no
>>>>>> reply from osd.25 since back 2016-05-05 10:03:06.566276 front
>>>>>> 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992)
>>>>>> 2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235
>>>>>> heartbeat_check: no reply from osd.25 since back 2016-05-05
>>>>>> 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff
>>>>>> 2016-05-05 10:03:09.801142) 2016-05-05 10:04:06.275237
>>>>>> 7f98cca35700  0 log_channel(cluster) log [WRN] : 1 slow requests,
>>>>>> 1 included below; oldest blocked for > 30.910385 secs 2016-05-05
>>>>>> 10:04:06.275252 7f98cca35700  0 log_channel(cluster) log [WRN] :
>>>>>> slow request 30.910385 seconds old, received at 2016-05-05
>>>>>> 10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254
>>>>>> rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993]
>>>>>> snapc 0=[]
>>>>>> ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected
>>>>>> e7235) currently commit_sent
>>>>>> 
>>>>>> 
>>>>>> I've followed the instructions on
>>>>>> 
>>>> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/
>>>>>> to hopefully find out what's happening but as far as the hardware
>>>>>> is concerned everything looks fine. No smart errors logged,
>>>>>> iostats shows some activity but nothing pegged to 100%, no
>>>>>> messages in dmesg and the cpus are only used for around 25% max.
>>>>>> 
>>>>>> This makes me think it might be network related. I have a 10Gbit
>>>> public
>>>>>> and 10Gbit cluster network neither of which seems to hit any
>>>>>> limits. There is one thing that might be a problem and that is
>>>>>> that one of the cache node has a bonded interface and no access
>>>>>> to the cluster network, and the other cache node has a public and
>>>>>> cluster interface.
>>>>>> 
>>>>>> Could anybody give me some more steps I can take to further
>>>>>> discover where this bottleneck lies?
>>>>>> 
>>>>>> Thanks in advance,
>>>>>> 
>>>>>> Peter
>>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Christian Balzer        Network/Systems Engineer
>>>> ch...@gol.com           Global OnLine Japan/Rakuten Communications
>>>> http://www.gol.com/
>>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> ch...@gol.com <mailto:ch...@gol.com>          Global OnLine Japan/Rakuten 
> Communications
> http://www.gol.com/ <http://www.gol.com/>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to