Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

Alex Gorbachev Mon, 24 Aug 2015 10:06:33 -0700

HI Jan,

On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer <j...@schermer.cz> wrote:
> I never actually set up iSCSI with VMware, I just had to research various 
> VMware storage options when we had a SAN-probelm at a former job... But I can 
> take a look at it again if you want me to.


Thank you, I don't want to waste your time as I have asked Vmware TAP
to research that - I will communicate back anything with which they
respond.

>
> Is it realy deadlocked when this issue occurs?
> What I think is partly responsible for this situation is that the iSCSI LUN 
> queues fill up and that's what actually kills your IO - VMware lowers queue 
> depth to 1 in that situation and it can take a really long time to recover 
> (especially if one of the LUNs  on the target constantly has problems, or 
> when heavy IO hammers the adapter) - you should never fill this queue, ever.
> iSCSI will likely be innocent victim in the chain, not the cause of the 
> issues.

Completely agreed, so iSCSI's job then is to properly communicate to
the initiator that it cannot do what it is asked to do and quit the
IO.

>
> Ceph should gracefully handle all those situations, you just need to set the 
> timeouts right. I have it set so that whatever happens the OSD can only delay 
> work for 40s and then it is marked down - at that moment all IO start flowing 
> again.

What setting in ceph do you use to do that?  is that
mon_osd_down_out_interval?  I think stopping slow OSDs is the answer
to the root of the problem - so far I only know to do "ceph osd perf"
and look at latencies.

>
> You should take this to VMware support, they should be able to tell whether 
> the problem is in iSCSI target (then you can take a look at how that behaves) 
> or in the initiator settings. Though in my experience after two visits from 
> their "foremost experts" I had to google everything myself because they were 
> clueless - YMMV.

I am hoping the TAP Elite team can do better...but we'll see...

>
> The root cause is however slow ops in Ceph, and I have no idea why you'd have 
> them if the OSDs come back up - maybe one of them is really deadlocked or 
> backlogged in some way? I found that when OSDs are "dead but up" they don't 
> respond to "ceph tell osd.xxx ..." so try if they all respond in a timely 
> manner, that should help pinpoint the bugger.

I think I know in this case - there are some PCIe AER/Bus errors and
TLP Header messages strewing across the console of one OSD machine -
ceph osd perf showing latencies aboce a second per OSD, but only when
IO is done to those OSDs.  I am thankful this is not production
storage, but worried of this situation in production - the OSDs are
staying up and in, but their latencies are slowing clusterwide IO to a
crawl.  I am trying to envision this situation in production and how
would one find out what is slowing everything down without guessing.

Regards,
Alex


>
> Jan
>
>
>> On 24 Aug 2015, at 18:26, Alex Gorbachev <a...@iss-integration.com> wrote:
>>
>>> This can be tuned in the iSCSI initiation on VMware - look in advanced 
>>> settings on your ESX hosts (at least if you use the software initiator).
>>
>> Thanks, Jan. I asked this question of Vmware as well, I think the
>> problem is specific to a given iSCSI session, so wondering if that's
>> strictly the job of the target?  Do you know of any specific SCSI
>> settings that mitigate this kind of issue?  Basically, give up on a
>> session and terminate it and start a new one should an RBD not
>> respond?
>>
>> As I understand, RBD simply never gives up.  If an OSD does not
>> respond but is still technically up and in, Ceph will retry IOs
>> forever.  I think RBD and Ceph need a timeout mechanism for this.
>>
>> Best regards,
>> Alex
>>
>>> Jan
>>>
>>>
>>>> On 23 Aug 2015, at 21:28, Nick Fisk <n...@fisk.me.uk> wrote:
>>>>
>>>> Hi Alex,
>>>>
>>>> Currently RBD+LIO+ESX is broken.
>>>>
>>>> The problem is caused by the RBD device not handling device aborts properly
>>>> causing LIO and ESXi to enter a death spiral together.
>>>>
>>>> If something in the Ceph cluster causes an IO to take longer than 10
>>>> seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens,
>>>> as you have seen it never recovers.
>>>>
>>>> Mike Christie from Redhat is doing a lot of work on this currently, so
>>>> hopefully in the future there will be a direct RBD interface into LIO and 
>>>> it
>>>> will all work much better.
>>>>
>>>> Either tgt or SCST seem to be pretty stable in testing.
>>>>
>>>> Nick
>>>>
>>>>> -----Original Message-----
>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>>>>> Alex Gorbachev
>>>>> Sent: 23 August 2015 02:17
>>>>> To: ceph-users <ceph-users@lists.ceph.com>
>>>>> Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD
>>>>> client IO hangs
>>>>>
>>>>> Hello, this is an issue we have been suffering from and researching along
>>>>> with a good number of other Ceph users, as evidenced by the recent posts.
>>>>> In our specific case, these issues manifest themselves in a RBD -> iSCSI
>>>> LIO ->
>>>>> ESXi configuration, but the problem is more general.
>>>>>
>>>>> When there is an issue on OSD nodes (examples: network hangs/blips, disk
>>>>> HBAs failing, driver issues, page cache/XFS issues), some OSDs respond
>>>>> slowly or with significant delays.  ceph osd perf does not show this,
>>>> neither
>>>>> does ceph osd tree, ceph -s / ceph -w.  Instead, the RBD IO hangs to a
>>>> point
>>>>> where the client times out, crashes or displays other unsavory behavior -
>>>>> operationally this crashes production processes.
>>>>>
>>>>> Today in our lab we had a disk controller issue, which brought an OSD node
>>>>> down.  Upon restart, the OSDs started up and rejoined into the cluster.
>>>>> However, immediately all IOs started hanging for a long time and aborts
>>>> from
>>>>> ESXi -> LIO were not succeeding in canceling these IOs.  The only warning
>>>> I
>>>>> could see was:
>>>>>
>>>>> root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30
>>>>> requests are blocked > 32 sec;
>>>>> 1 osds have slow requests 30 ops are blocked > 2097.15 sec
>>>>> 30 ops are blocked > 2097.15 sec on osd.4
>>>>> 1 osds have slow requests
>>>>>
>>>>> However, ceph osd perf is not showing high latency on osd 4:
>>>>>
>>>>> root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms)
>>>>> fs_apply_latency(ms)
>>>>> 0                     0                   13
>>>>> 1                     0                    0
>>>>> 2                     0                    0
>>>>> 3                   172                  208
>>>>> 4                     0                    0
>>>>> 5                     0                    0
>>>>> 6                     0                    1
>>>>> 7                     0                    0
>>>>> 8                   174                  819
>>>>> 9                     6                   10
>>>>> 10                     0                    1
>>>>> 11                     0                    1
>>>>> 12                     3                    5
>>>>> 13                     0                    1
>>>>> 14                     7                   23
>>>>> 15                     0                    1
>>>>> 16                     0                    0
>>>>> 17                     5                    9
>>>>> 18                     0                    1
>>>>> 19                    10                   18
>>>>> 20                     0                    0
>>>>> 21                     0                    0
>>>>> 22                     0                    1
>>>>> 23                     5                   10
>>>>>
>>>>> SMART state for osd 4 disk is OK.  The OSD in up and in:
>>>>>
>>>>> root@lab2-mon1:/var/log/ceph# ceph osd tree
>>>>> ID WEIGHT   TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
>>>>> -8        0 root ssd
>>>>> -7 14.71997 root platter
>>>>> -3  7.12000     host croc3
>>>>> 22  0.89000         osd.22      up  1.00000          1.00000
>>>>> 15  0.89000         osd.15      up  1.00000          1.00000
>>>>> 16  0.89000         osd.16      up  1.00000          1.00000
>>>>> 13  0.89000         osd.13      up  1.00000          1.00000
>>>>> 18  0.89000         osd.18      up  1.00000          1.00000
>>>>> 8  0.89000         osd.8       up  1.00000          1.00000
>>>>> 11  0.89000         osd.11      up  1.00000          1.00000
>>>>> 20  0.89000         osd.20      up  1.00000          1.00000
>>>>> -4  0.47998     host croc2
>>>>> 10  0.06000         osd.10      up  1.00000          1.00000
>>>>> 12  0.06000         osd.12      up  1.00000          1.00000
>>>>> 14  0.06000         osd.14      up  1.00000          1.00000
>>>>> 17  0.06000         osd.17      up  1.00000          1.00000
>>>>> 19  0.06000         osd.19      up  1.00000          1.00000
>>>>> 21  0.06000         osd.21      up  1.00000          1.00000
>>>>> 9  0.06000         osd.9       up  1.00000          1.00000
>>>>> 23  0.06000         osd.23      up  1.00000          1.00000
>>>>> -2  7.12000     host croc1
>>>>> 7  0.89000         osd.7       up  1.00000          1.00000
>>>>> 2  0.89000         osd.2       up  1.00000          1.00000
>>>>> 6  0.89000         osd.6       up  1.00000          1.00000
>>>>> 1  0.89000         osd.1       up  1.00000          1.00000
>>>>> 5  0.89000         osd.5       up  1.00000          1.00000
>>>>> 0  0.89000         osd.0       up  1.00000          1.00000
>>>>> 4  0.89000         osd.4       up  1.00000          1.00000
>>>>> 3  0.89000         osd.3       up  1.00000          1.00000
>>>>>
>>>>> How can we proactively detect this condition?  Is there anything I can run
>>>>> that will output all slow OSDs?
>>>>>
>>>>> Regards,
>>>>> Alex
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow responding OSDs are not OUTed and cause RBD client IO hangs

Reply via email to