Hello, On Wed, 2023-02-15 at 19:21 +0100, Milan Oravec wrote: > > > I've lot of this error messages ind dmesg: > > > > > > Feb 09 10:56:05 virt2n kernel: connection2:0: detected conn > > > error > > > (1015) > > > Feb 09 10:56:05 virt2n iscsid[2790]: connection2:0 is operational > > > after recovery (1 attempts) > > > Feb 09 10:56:05 virt2n iscsid[2790]: connection1:0 is operational > > > after recovery (1 attempts) > > > > Connection dropped and re-established. Looks normal. > > > > > Yes I understand that message but WHY this happens, what can be the > cause for this? Link is stable and no errors are reported on ifaces > connected to SAN switch: >
That is mostly to analyze on the target side or on the network, to start with. Usually, in a multinode clustered target seutp, when a takeover happens, connecitons from node1 will be dropped for a moment while it is taken over by node2. I don't know nothing about the Fujitsu target so I cannot generalize. > ens1f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 > inet 10.11.12.17 netmask 255.255.255.0 broadcast > 10.11.12.255 > inet6 fe80::9640:c9ff:fe5f:9172 prefixlen 64 scopeid > 0x20<link> > ether 94:40:c9:5f:91:72 txqueuelen 1000 (Ethernet) > RX packets 184466309231 bytes 146589180968929 (133.3 TiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 86121730517 bytes 284905539017306 (259.1 TiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 102 memory 0xa5820000-a583ffff > > ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000 > inet 10.11.14.17 netmask 255.255.255.0 broadcast > 10.11.14.255 > inet6 fe80::9640:c9ff:fe5f:9173 prefixlen 64 scopeid > 0x20<link> > ether 94:40:c9:5f:91:73 txqueuelen 1000 (Ethernet) > RX packets 114453605386 bytes 85545283479332 (77.8 TiB) > RX errors 0 dropped 0 overruns 0 frame 0 > TX packets 236069263519 bytes 283187128321034 (257.5 TiB) > TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 > device interrupt 132 memory 0xa5800000-a581ffff > > Thanks for sharing this. So as you mentioned there's no issue on the network side. > > > > > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI > > > connection > > > 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) > > > state > > > (3) > > > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI > > > connection > > > 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) > > > state > > > (3) > > > Feb 09 10:56:10 virt2n iscsid[2790]: connection1:0 is operational > > > after recovery (1 attempts) > > > Feb 09 10:56:10 virt2n iscsid[2790]: connection2:0 is operational > > > after recovery (1 attempts) > > > > Conn 2:0 recovered but the earlier error it gave looks misleading > > to > > me. What target controller you have ? What there a failover across > > the > > nodes ? How does the target handle the transition period to > > initiator > > queries ? > > > > > yes, two paths with failover > So did you trigger a cluster takeover ? My guess is it is your target initiating the connection drops, while taking over to the other node. How a target behaves during the transition is left to the target. The initiator will keep querying for recovery, until either it times out or recovers. > > > > > > When this happens iscsi opration is interupted for few seconds, > > > multipath -ll reports (this is only example, i/o pending is > > > reported > > > dor all 30 dm-s: > > > > > > > This all looks normal to me. If you want the errors to be > > suppressed, > > you can increase the timeouts. On the other hand, otherwise, you > > could > > also use dm-multipath on top, like you have shown below. > > > > > What can cause this errors? I've multipath setup, but I want to find > root cause of this error messages. > I don't know what is causing to trigger those errors on your setup. Please do note that you have different layers mixed to build that storage solution * TCP/IP * iSCSI * SCSI * Device Mapper Multipath Every layer has errors that it treats differently. For example, a SCSI error is perfectly suppressed by the DM-Multipath layer depending on how it is setup, to the upper layers. > > > > > > > > web3v (3600000e00d2c0000002c177200120000) dm-52 > > > FUJITSU,ETERNUS_DXL > > > size=500G features='3 queue_if_no_path queue_mode mq' > > > hwhandler='1 > > > alua' wp=rw > > > `-+- policy='service-time 0' prio=10 status=active > > > |- 1:0:0:16 sdah 66:16 active i/o pending running > > > `- 2:0:0:16 sdai 66:32 active ready running > > > ais_app (3600000e00d2c0000002c1772001f0000) dm-99 > > > FUJITSU,ETERNUS_DXL > > > size=500G features='3 queue_if_no_path queue_mode mq' > > > hwhandler='1 > > > alua' wp=rw > > > `-+- policy='service-time 0' prio=10 status=active > > > |- 1:0:0:28 sdbh 67:176 active i/o pending running > > > `- 2:0:0:28 sdbf 67:144 active ready running > > > > > > > Looks all correct to me. You already are using feature > > queue_if_no_path > > > > > SAN consists of one Fujitsu DX100 S4 storage with two controllers > > > connected to LAN switch with two 10Gb fibre links (one from > > > controller), each link has its own VLAN configured. Errors > > > reported > > > ocures on virtualization host that are connected mutipath with > > > two > > > 10Gb fibre links to respective VLANs. Jumbo frames are enabled > > > across > > > the way. > > > > > > > Good. As I expected, you do have a 2 node target controller setup. > > > > > I'll add all neded info upun request. > > > > > > I've consulted this issue with Fujitsu representative and it > > > seems we > > > have all configured right and he advised me to contact Debian > > > support. So I'm here and would kindly ask to point me the right > > > direction. > > > > > > > Okay!! What behavior do you expect ? What anomaly do you see with > > the > > iSCSI initiator in Debian ? > > > > > I expect no errors in logs and drop out free communication with > target. I think this is not normal/standard behaviour. > > There will be errors in your system journal for this particular setup. Errors like: * connection drops * iscsi session drops/terminations * SCSI errors * multipath path checker errors All these will be errors which will be recovered eventually. That is why we have the need for close integration in between these layers, when building a storage solution on top. Note: These days I only have a software LIO target to test/play with, where I have not seen any real issues/errors. How each SAN Target behaves is something highly specific to the target, in your case the Fujitsu target. -- Ritesh Raj Sarraf | http://people.debian.org/~rrs Debian - The Universal Operating System
signature.asc
Description: This is a digitally signed message part