After introducing Veeam into our environment (which has been the backend for both Hyper-V and VMware hosts for 2+ years with minimal issues), I've been trying to deal with a new intermittent issue. SAN traffic is via IB/SRP presented by COMSTAR and I am currently running r151012.
What happens is that when my Veeam backup jobs run (at this point, I had one big job of ~60 VMs), latency on the datastore that the repository is on deteriorates. When I first saw this, it seemed it was only affecting the VM running the Veeam proxy/repository (esxtop was only showing high QAVG on that VM). I would vMotion this VM to another host and everything would work normally again (I would put the host into maintenance mode and reboot it to get it out of this "state"). This suggested that it was not a SAN issue and isolated to the VM and/or host. After updating to the latest Veeam B&R version (update 2), the issue persisted. Then I updated to vSphere 5.5 update 2e, and the issue went away for about 2 weeks (and my Veeam incident was closed). At this point, my backup job tried to merge the incremental backup back into the full, and that's when the issue reappeared. I was able to vMotion the VM to another host and get the job completed. I also noticed after update 2e that the latency appeared on any datastore that the VM had access to (had hot-added), and subsequently noticed via CDM benchmarks that it was all VMs on the affected host. I reopened the case at Veeam, and took their recommendation to split the job up into several smaller jobs. This seemed to work for a few days until I caught it doing it again. This time I opened a case with VMware - while I was on hold the server got out of the condition by itself (no more latency - I was actually in the middle of doing CrystalDiskMark benchmarks on a few VMs, on and off the affected host). The VMware engineer took a look at the vmkernel logs and highlighted the following entries: 2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276395 req=7a done=0 tsk_status=255 req_lim=73 tx_head=106811312 tx_tail=106811184 2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x927639a req=65 on lun 0 state 0 2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x927639a req=65 done=0 tsk_status=255 req_lim=72 tx_head=106811313 tx_tail=106811185 2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276396 req=4f on lun 11 state 0 2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276396 req=4f done=0 tsk_status=255 req_lim=71 tx_head=106811314 tx_tail=106811186 2015-06-23T16:11:07.942Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276397 req=36 on lun 10 state 0 2015-06-23T16:11:08.016Z cpu19:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[2a] SN 0x9276399 req=31 on lun 6 state 0 2015-06-23T16:05:16.110Z cpu26:32839)ScsiDeviceIO: 2325: Cmd(0x4136c28516c0) 0x88, CmdSN 0x8000001a from world 35997 to dev "naa.600144f09084e251000052919b460003" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0. 2015-06-23T16:05:18.253Z cpu20:36138)ScsiDeviceIO: 2325: Cmd(0x4136c0428d40) 0x2a, CmdSN 0xa389000 from world 35970 to dev "naa.600144f09084e251000052919b560004" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0. 2015-06-23T16:05:19.382Z cpu8:32821)ScsiDeviceIO: 2325: Cmd(0x412ece968480) 0x28, CmdSN 0x80000038 from world 36101 to dev "naa.600144f09084e251000052919b460003" failed H:0x0 D:0x8 P:0x0 Possible sense data: 0x0 0x0 0x0. 2015-06-23T16:15:06.963Z cpu24:33485)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] New target id_ext=0x0002c903000bc0bc ioc_guid=0x0002c903000bc0bc pkey=0xffff service_id=0x0002c903000bc0bc dgid=fe80:0000:0000:0000:0002:c903:000b:c0bd - req_lim 128 2015-06-23T16:15:06.963Z cpu23:33475)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] ALIVE!!! 2015-06-23T16:15:06.972Z cpu15:33730)NMP: nmp_ThrottleLogForDevice:2322: Cmd 0x16 (0x412ecb3737c0, 0) to dev "naa.600144f09084e25100005286cf240003" on path "vmhba_mlx4_0.1.1:C0:T1:L2 " Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE 2015-06-23T16:15:06.972Z cpu15:33730)ScsiDeviceIO: 2325: Cmd(0x412ecb3737c0) 0x16, CmdSN 0x9f1f from world 0 to dev "naa.600144f09084e25100005286cf240003" failed H:0x0 D:0x2 P:0x0 Va lid sense data: 0x6 0x29 0x0. 2015-06-23T16:15:06.972Z cpu15:33730)ScsiCore: 1486: Power-on Reset occurred on naa.600144f09084e25100005286cf240003 2015-06-23T16:15:05.029Z cpu6:34170)World: 14302: VC opID hostd-8fa8 maps to vmkernel opID ba341ef0 2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[12] SN 0x92766c9 req=63 on lun 6 state 0 2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] active_abort 1 lim 1 used_req 126 tag=0000000000000163 tsk_tag=0000000000000063 2015-06-23T16:15:06.941Z cpu13:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[12] SN 0x92766c9 req=63 done=0 tsk_status=255 req_lim=0 tx_head=106811505 tx_tail=106811377 2015-06-23T16:15:06.941Z cpu13:33135)WARNING: LinScsi: SCSILinuxAbortCommands:1843: Failed, Driver ib_srp, for vmhba_mlx4_0.1.1 2015-06-23T16:15:06.959Z cpu4:33135)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Abort cmd[1a] SN 0x92766f2 req=5a on lun 10 state 0 2015-06-23T16:15:06.959Z cpu4:33135)<3>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] tsk_mgmt no TX_IU head 106811505 tail 106811378 lim 0 2015-06-23T16:15:06.959Z cpu4:33135)WARNING: LinScsi: SCSILinuxAbortCommands:1843: Failed, Driver ib_srp, for vmhba_mlx4_0.1.1 2015-06-23T16:15:06.959Z cpu21:33475)<7>ib_srp: TGT[vmhba_mlx4_0.1.1 :: 1] Sending DREQ cm_id 0x410b54a06d00 qp_num 0x55 2015-06-23T16:15:06.960Z cpu25:36234)NMP: nmp_ThrottleLogForDevice:2322: Cmd 0x12 (0x4136c4245a40, 0) to dev "naa.600144f09084e251000052919b340002" on path "vmhba_mlx4_0.1.1:C0:T1:L6 " Failed: H:0x8 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL Of particular interest is the 0x28 SCSI device message returned by the SAN: VMK_SCSI_DEVICE_QUEUE_FULL (TASK SET FULL) = 0x28 This status is returned when the LUN prevents accepting SCSI commands from initiators due to lack of resources, namely the queue depth on the array. (outlined in this KB: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1030381) The engineer said that when under load the SAN cannot service the requests, and gets into this state. He asked me to check with the storage vendor (hah!) and check with corresponding timestamps to see what is happening. What doesn't make sense to me is that it got into this state at around 3am (that's when the performance dropped, when no other load other than backup was running, and it had been running for 6+ hours at that point), and then magically recovered at around noon (9+ hours). It's isolated to the host because other VMs experience the same thing (although you can't really see it on esxtop until you do sequential reads/writes). I've lowered the max concurrent tasks for the Veeam proxy (from 4 to 3) for now, and will try to change the transport mode as suggested by the Veeam tech (to see if it's a hot-add thing). Is there a tunable I can check/change, or some other troubleshooting I can do? Or is this one of those, "Oh, it's COMSTAR.. YMMV" answers? :) What is strange is that after updating vSphere my initial "big" job ran, whereas previously it would fail after so many hours consistently. Is it a driver thing (running Mellanox 1.8.2, latest that supports SRP)?
_______________________________________________ OmniOS-discuss mailing list [email protected] http://lists.omniti.com/mailman/listinfo/omnios-discuss
