HI Jan, On Mon, Aug 24, 2015 at 12:40 PM, Jan Schermer <j...@schermer.cz> wrote: > I never actually set up iSCSI with VMware, I just had to research various > VMware storage options when we had a SAN-probelm at a former job... But I can > take a look at it again if you want me to.
Thank you, I don't want to waste your time as I have asked Vmware TAP to research that - I will communicate back anything with which they respond. > > Is it realy deadlocked when this issue occurs? > What I think is partly responsible for this situation is that the iSCSI LUN > queues fill up and that's what actually kills your IO - VMware lowers queue > depth to 1 in that situation and it can take a really long time to recover > (especially if one of the LUNs on the target constantly has problems, or > when heavy IO hammers the adapter) - you should never fill this queue, ever. > iSCSI will likely be innocent victim in the chain, not the cause of the > issues. Completely agreed, so iSCSI's job then is to properly communicate to the initiator that it cannot do what it is asked to do and quit the IO. > > Ceph should gracefully handle all those situations, you just need to set the > timeouts right. I have it set so that whatever happens the OSD can only delay > work for 40s and then it is marked down - at that moment all IO start flowing > again. What setting in ceph do you use to do that? is that mon_osd_down_out_interval? I think stopping slow OSDs is the answer to the root of the problem - so far I only know to do "ceph osd perf" and look at latencies. > > You should take this to VMware support, they should be able to tell whether > the problem is in iSCSI target (then you can take a look at how that behaves) > or in the initiator settings. Though in my experience after two visits from > their "foremost experts" I had to google everything myself because they were > clueless - YMMV. I am hoping the TAP Elite team can do better...but we'll see... > > The root cause is however slow ops in Ceph, and I have no idea why you'd have > them if the OSDs come back up - maybe one of them is really deadlocked or > backlogged in some way? I found that when OSDs are "dead but up" they don't > respond to "ceph tell osd.xxx ..." so try if they all respond in a timely > manner, that should help pinpoint the bugger. I think I know in this case - there are some PCIe AER/Bus errors and TLP Header messages strewing across the console of one OSD machine - ceph osd perf showing latencies aboce a second per OSD, but only when IO is done to those OSDs. I am thankful this is not production storage, but worried of this situation in production - the OSDs are staying up and in, but their latencies are slowing clusterwide IO to a crawl. I am trying to envision this situation in production and how would one find out what is slowing everything down without guessing. Regards, Alex > > Jan > > >> On 24 Aug 2015, at 18:26, Alex Gorbachev <a...@iss-integration.com> wrote: >> >>> This can be tuned in the iSCSI initiation on VMware - look in advanced >>> settings on your ESX hosts (at least if you use the software initiator). >> >> Thanks, Jan. I asked this question of Vmware as well, I think the >> problem is specific to a given iSCSI session, so wondering if that's >> strictly the job of the target? Do you know of any specific SCSI >> settings that mitigate this kind of issue? Basically, give up on a >> session and terminate it and start a new one should an RBD not >> respond? >> >> As I understand, RBD simply never gives up. If an OSD does not >> respond but is still technically up and in, Ceph will retry IOs >> forever. I think RBD and Ceph need a timeout mechanism for this. >> >> Best regards, >> Alex >> >>> Jan >>> >>> >>>> On 23 Aug 2015, at 21:28, Nick Fisk <n...@fisk.me.uk> wrote: >>>> >>>> Hi Alex, >>>> >>>> Currently RBD+LIO+ESX is broken. >>>> >>>> The problem is caused by the RBD device not handling device aborts properly >>>> causing LIO and ESXi to enter a death spiral together. >>>> >>>> If something in the Ceph cluster causes an IO to take longer than 10 >>>> seconds(I think!!!) ESXi submits an iSCSI abort message. Once this happens, >>>> as you have seen it never recovers. >>>> >>>> Mike Christie from Redhat is doing a lot of work on this currently, so >>>> hopefully in the future there will be a direct RBD interface into LIO and >>>> it >>>> will all work much better. >>>> >>>> Either tgt or SCST seem to be pretty stable in testing. >>>> >>>> Nick >>>> >>>>> -----Original Message----- >>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of >>>>> Alex Gorbachev >>>>> Sent: 23 August 2015 02:17 >>>>> To: ceph-users <ceph-users@lists.ceph.com> >>>>> Subject: [ceph-users] Slow responding OSDs are not OUTed and cause RBD >>>>> client IO hangs >>>>> >>>>> Hello, this is an issue we have been suffering from and researching along >>>>> with a good number of other Ceph users, as evidenced by the recent posts. >>>>> In our specific case, these issues manifest themselves in a RBD -> iSCSI >>>> LIO -> >>>>> ESXi configuration, but the problem is more general. >>>>> >>>>> When there is an issue on OSD nodes (examples: network hangs/blips, disk >>>>> HBAs failing, driver issues, page cache/XFS issues), some OSDs respond >>>>> slowly or with significant delays. ceph osd perf does not show this, >>>> neither >>>>> does ceph osd tree, ceph -s / ceph -w. Instead, the RBD IO hangs to a >>>> point >>>>> where the client times out, crashes or displays other unsavory behavior - >>>>> operationally this crashes production processes. >>>>> >>>>> Today in our lab we had a disk controller issue, which brought an OSD node >>>>> down. Upon restart, the OSDs started up and rejoined into the cluster. >>>>> However, immediately all IOs started hanging for a long time and aborts >>>> from >>>>> ESXi -> LIO were not succeeding in canceling these IOs. The only warning >>>> I >>>>> could see was: >>>>> >>>>> root@lab2-mon1:/var/log/ceph# ceph health detail HEALTH_WARN 30 >>>>> requests are blocked > 32 sec; >>>>> 1 osds have slow requests 30 ops are blocked > 2097.15 sec >>>>> 30 ops are blocked > 2097.15 sec on osd.4 >>>>> 1 osds have slow requests >>>>> >>>>> However, ceph osd perf is not showing high latency on osd 4: >>>>> >>>>> root@lab2-mon1:/var/log/ceph# ceph osd perf osd fs_commit_latency(ms) >>>>> fs_apply_latency(ms) >>>>> 0 0 13 >>>>> 1 0 0 >>>>> 2 0 0 >>>>> 3 172 208 >>>>> 4 0 0 >>>>> 5 0 0 >>>>> 6 0 1 >>>>> 7 0 0 >>>>> 8 174 819 >>>>> 9 6 10 >>>>> 10 0 1 >>>>> 11 0 1 >>>>> 12 3 5 >>>>> 13 0 1 >>>>> 14 7 23 >>>>> 15 0 1 >>>>> 16 0 0 >>>>> 17 5 9 >>>>> 18 0 1 >>>>> 19 10 18 >>>>> 20 0 0 >>>>> 21 0 0 >>>>> 22 0 1 >>>>> 23 5 10 >>>>> >>>>> SMART state for osd 4 disk is OK. The OSD in up and in: >>>>> >>>>> root@lab2-mon1:/var/log/ceph# ceph osd tree >>>>> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >>>>> -8 0 root ssd >>>>> -7 14.71997 root platter >>>>> -3 7.12000 host croc3 >>>>> 22 0.89000 osd.22 up 1.00000 1.00000 >>>>> 15 0.89000 osd.15 up 1.00000 1.00000 >>>>> 16 0.89000 osd.16 up 1.00000 1.00000 >>>>> 13 0.89000 osd.13 up 1.00000 1.00000 >>>>> 18 0.89000 osd.18 up 1.00000 1.00000 >>>>> 8 0.89000 osd.8 up 1.00000 1.00000 >>>>> 11 0.89000 osd.11 up 1.00000 1.00000 >>>>> 20 0.89000 osd.20 up 1.00000 1.00000 >>>>> -4 0.47998 host croc2 >>>>> 10 0.06000 osd.10 up 1.00000 1.00000 >>>>> 12 0.06000 osd.12 up 1.00000 1.00000 >>>>> 14 0.06000 osd.14 up 1.00000 1.00000 >>>>> 17 0.06000 osd.17 up 1.00000 1.00000 >>>>> 19 0.06000 osd.19 up 1.00000 1.00000 >>>>> 21 0.06000 osd.21 up 1.00000 1.00000 >>>>> 9 0.06000 osd.9 up 1.00000 1.00000 >>>>> 23 0.06000 osd.23 up 1.00000 1.00000 >>>>> -2 7.12000 host croc1 >>>>> 7 0.89000 osd.7 up 1.00000 1.00000 >>>>> 2 0.89000 osd.2 up 1.00000 1.00000 >>>>> 6 0.89000 osd.6 up 1.00000 1.00000 >>>>> 1 0.89000 osd.1 up 1.00000 1.00000 >>>>> 5 0.89000 osd.5 up 1.00000 1.00000 >>>>> 0 0.89000 osd.0 up 1.00000 1.00000 >>>>> 4 0.89000 osd.4 up 1.00000 1.00000 >>>>> 3 0.89000 osd.3 up 1.00000 1.00000 >>>>> >>>>> How can we proactively detect this condition? Is there anything I can run >>>>> that will output all slow OSDs? >>>>> >>>>> Regards, >>>>> Alex >>>>> _______________________________________________ >>>>> ceph-users mailing list >>>>> ceph-users@lists.ceph.com >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com