Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

Ritesh Raj Sarraf Thu, 16 Feb 2023 05:21:17 -0800

Hello,

On Wed, 2023-02-15 at 19:21 +0100, Milan Oravec wrote:
> > > I've lot of this error messages ind dmesg:
> > > 
> > > Feb 09 10:56:05 virt2n kernel:  connection2:0: detected conn
> > > error
> > > (1015)
> > > Feb 09 10:56:05 virt2n iscsid[2790]: connection2:0 is operational
> > > after recovery (1 attempts)
> > > Feb 09 10:56:05 virt2n iscsid[2790]: connection1:0 is operational
> > > after recovery (1 attempts)
> > 
> > Connection dropped and re-established. Looks normal.
> > 
> 
> 
> Yes I understand that message but WHY this happens, what can be the
> cause for this? Link is stable and no errors are reported on ifaces
> connected to SAN switch:
>


That is mostly to analyze on the target side or on the network, to
start with.

Usually, in a multinode clustered target seutp, when a takeover
happens, connecitons from node1 will be dropped for a moment while it
is taken over by node2.

I don't know nothing about the Fujitsu target so I cannot generalize.

> ens1f0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
>         inet 10.11.12.17  netmask 255.255.255.0  broadcast
> 10.11.12.255
>         inet6 fe80::9640:c9ff:fe5f:9172  prefixlen 64  scopeid
> 0x20<link>
>         ether 94:40:c9:5f:91:72  txqueuelen 1000  (Ethernet)
>         RX packets 184466309231  bytes 146589180968929 (133.3 TiB)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 86121730517  bytes 284905539017306 (259.1 TiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>         device interrupt 102  memory 0xa5820000-a583ffff  
> 
> ens1f1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
>         inet 10.11.14.17  netmask 255.255.255.0  broadcast
> 10.11.14.255
>         inet6 fe80::9640:c9ff:fe5f:9173  prefixlen 64  scopeid
> 0x20<link>
>         ether 94:40:c9:5f:91:73  txqueuelen 1000  (Ethernet)
>         RX packets 114453605386  bytes 85545283479332 (77.8 TiB)
>         RX errors 0  dropped 0  overruns 0  frame 0
>         TX packets 236069263519  bytes 283187128321034 (257.5 TiB)
>         TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
>         device interrupt 132  memory 0xa5800000-a581ffff  
> 
>  

Thanks for sharing this. So as you mentioned there's no issue on the
network side.

> > 
> > > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI
> > > connection
> > > 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch)
> > > state
> > > (3)
> > > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI
> > > connection
> > > 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch)
> > > state
> > > (3)
> > > Feb 09 10:56:10 virt2n iscsid[2790]: connection1:0 is operational
> > > after recovery (1 attempts)
> > > Feb 09 10:56:10 virt2n iscsid[2790]: connection2:0 is operational
> > > after recovery (1 attempts)
> > 
> > Conn 2:0 recovered but the earlier error it gave looks misleading
> > to
> > me. What target controller you have ? What there a failover across
> > the
> > nodes ? How does the target handle the transition period to
> > initiator
> > queries ?
> > 
> 
> 
> yes, two paths with failover
>  

So did you trigger a cluster takeover ? My guess is it is your target
initiating the connection drops, while taking over to the other node.

How a target behaves during the transition is left to the target. The
initiator will keep querying for recovery, until either it times out or
recovers.

> > > 
> > > When this happens iscsi opration is interupted for few seconds,
> > > multipath -ll reports (this is only example, i/o pending is
> > > reported
> > > dor all 30 dm-s:
> > > 
> > 
> > This all looks normal to me. If you want the errors to be
> > suppressed,
> > you can increase the timeouts. On the other hand, otherwise, you
> > could
> > also use dm-multipath on top, like you have shown below.
> > 
> 
> 
> What can cause this errors? I've multipath setup, but I want to find
> root cause of this error messages.
>  

I don't know what is causing to trigger those errors on your setup.

Please do note that you have different layers mixed to build that
storage solution


* TCP/IP
* iSCSI
* SCSI
* Device Mapper Multipath

Every layer has errors that it treats differently. For example, a SCSI
error is perfectly suppressed by the DM-Multipath layer depending on
how it is setup, to the upper layers.


> > 
> > > 
> > > web3v (3600000e00d2c0000002c177200120000) dm-52
> > > FUJITSU,ETERNUS_DXL
> > > size=500G features='3 queue_if_no_path queue_mode mq'
> > > hwhandler='1
> > > alua' wp=rw
> > > `-+- policy='service-time 0' prio=10 status=active
> > >   |- 1:0:0:16 sdah 66:16  active i/o pending running
> > >   `- 2:0:0:16 sdai 66:32  active ready running
> > > ais_app (3600000e00d2c0000002c1772001f0000) dm-99
> > > FUJITSU,ETERNUS_DXL
> > > size=500G features='3 queue_if_no_path queue_mode mq'
> > > hwhandler='1
> > > alua' wp=rw
> > > `-+- policy='service-time 0' prio=10 status=active
> > >   |- 1:0:0:28 sdbh 67:176 active i/o pending running
> > >   `- 2:0:0:28 sdbf 67:144 active ready running
> > > 
> > 
> > Looks all correct to me. You already are using feature
> > queue_if_no_path
> > 
> > > SAN consists of one Fujitsu DX100 S4 storage with two controllers
> > > connected to LAN switch with two 10Gb fibre links (one from
> > > controller), each link has its own VLAN configured. Errors
> > > reported
> > > ocures on virtualization host that are connected mutipath with
> > > two
> > > 10Gb fibre links to respective VLANs. Jumbo frames are enabled
> > > across
> > > the way.
> > > 
> > 
> > Good. As I expected, you do have a 2 node target controller setup.
> > 
> > > I'll add all neded info upun request.
> > > 
> > > I've consulted this issue with Fujitsu representative and it
> > > seems we
> > > have all configured right and he advised me to contact Debian
> > > support. So I'm here and would kindly ask to point me the right
> > > direction. 
> > > 
> > 
> > Okay!! What behavior do you expect ? What anomaly do you see with
> > the
> > iSCSI initiator in Debian ?
> > 
> 
>  
> I expect no errors in logs and drop out free communication with
> target.  I think this is not normal/standard behaviour. 
> 
> 

There will be errors in your system journal for this particular setup.

Errors like:

* connection drops
* iscsi session drops/terminations
* SCSI errors
* multipath path checker errors

All these will be errors which will be recovered eventually. That is
why we have the need for close integration in between these layers,
when building a storage solution on top.



Note: These days I only have a software LIO target to test/play with,
where I have not seen any real issues/errors. How each SAN Target
behaves is something highly specific to the target, in your case the
Fujitsu target.



-- 
Ritesh Raj Sarraf | http://people.debian.org/~rrs
Debian - The Universal Operating System

signature.asc
Description: This is a digitally signed message part

Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

Reply via email to