Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

2023-02-27 Thread Milan Oravec
hello Sarraf,

On Thu, Feb 23, 2023 at 2:44 PM Ritesh Raj Sarraf  wrote:

> On Thu, 2023-02-23 at 14:19 +0100, Milan Oravec wrote:
> > > >
> > >
> > > By KVM, do you mean just pure KVM ? Or the management suite too,
> > > libvirt ?
> > >
> >
> >
> > Yes, libvirt is used to manage KVM, and virt-manager on my desktop to
> > connect to it. It is simple setup without clustering.
> >
>
> Thanks for confirming.
>
> You should start with the hypervisor where libvirtd is running. Check
> if that hypervisor host is running normal.
>

No unusual errors reported.


>
> From all the symptoms you've shared, my wild arrow in the dark is that
> your hypervisor/guest network may be running into issues.
>
>
There is all ok with network.


> >
> >
> > Do you know someone who can help with this? Here is example of KVM
> > guest running configuration:
> >
> > libvirt+ 244533  1  5 Feb03 ?1-03:54:37 qemu-system-
> > x86_64 -enable-kvm -name guest=me_test,process=qemu:me_test,debug-
> > threads=on -S -object
> > secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-
> > 108-me_test/master-key.aes -machine pc-1.1,accel=kvm,usb=off,dump-
> > guest-core=off -cpu host -m 8096 -realtime mlock=off -smp
> > 4,sockets=1,cores=4,threads=1 -uuid 1591f345-96b5-4077-9d32-
> > b2991003753d -no-user-config -nodefaults -chardev
> > socket,id=charmonitor,fd=57,server,nowait -mon
> > chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-
> > shutdown -boot strict=on -device piix3-usb-
> > uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-
> > pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive if=none,id=drive-
> > ide0-1-0,readonly=on -device ide-cd,bus=ide.1,unit=0,drive=drive-
> > ide0-1-0,id=ide0-1-0,bootindex=1 -drive
> > file=/dev/mapper/me_test,format=raw,if=none,id=drive-virtio-
> > disk0,cache=none,discard=unmap,aio=native -device virtio-blk-
> > pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-
> > disk0,bootindex=2,write-cache=on -netdev
> > tap,fd=60,id=hostnet0,vhost=on,vhostfd=61 -device virtio-net-
> > pci,netdev=hostnet0,id=net0,mac=aa:bb:cc:00:10:31,bus=pci.0,addr=0x3
> > -chardev pty,id=charserial0 -device isa-
> > serial,chardev=charserial0,id=serial0 -chardev
> > spicevmc,id=charchannel0,name=vdagent -device
> > virtserialport,bus=virtio-
> > serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice
> > .0 -spice port=5929,addr=0.0.0.0,disable-ticketing,seamless-
> > migration=on -device qxl-
> > vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,v
> > gamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2 -device virtio-balloon-
> > pci,id=balloon0,bus=pci.0,addr=0x5 -sandbox
> > on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=de
> > ny -msg timestamp=on
> >
> > Maybe there is something undesirable for the iscsi target.
> >
>
> Seems like you are using host networking ? Have you validated that that
> part is working reliably ?
>

Yes, this is working fine too.

I've made some experiments with storage parameters passed to KVM, standard
disk created within virt-manager is running with following paramaters:

format=raw,cache=none,discard=unmap,aio=native.

I've removed cache  parameter (let hypervisor set its default) and now
running disk with:
format=raw,discard=unmap.

With this setup, there are no errors (connection1:0: pdu (op 0x5 itt 0x48)
rejected. Reason code 0x9) in logs. Virtual host was stressed with:

ns2:/home/migo# stress-ng -v -d 1
stress-ng: debug: [858] 2 processors online, 2 processors configured
stress-ng: info:  [858] defaulting to a 86400 second (1 day, 0.00 secs) run
per stressor
stress-ng: info:  [858] dispatching hogs: 1 hdd
stress-ng: debug: [858] cache allocate: default cache size: 262144K
stress-ng: debug: [858] starting stressors
stress-ng: debug: [858] 1 stressor started
stress-ng: debug: [859] stress-ng-hdd: started [859] (instance 0)
^Cstress-ng: debug: [859] stress-ng-hdd: exited [859] (instance 0)
stress-ng: debug: [858] process [859] terminated
stress-ng: info:  [858] successful run completed in 30377.75s (8 hours, 26
mins, 17.75 secs)
stress-ng: debug: [858] metrics-check: all stressor metrics validated and
sane

It seems that using hypervisors cache is filtering those bogus commands
(causing trouble for target) sent to the target. What do you think about it?

But running iscsi target with cache is not a good idea, and this is no real
solution. :(



>
> > >
> > > > > There will be errors in your system journal for this particular
> > > > > setup.
> > > 

Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

2023-02-23 Thread Milan Oravec
Hello,

On Thu, Feb 23, 2023 at 1:35 PM Ritesh Raj Sarraf  wrote:

> Hello Milan,
>
> On Mon, 2023-02-20 at 09:51 +0100, Milan Oravec wrote:
> > > >
> > >
> > > So did you trigger a cluster takeover ? My guess is it is your
> > > target
> > > initiating the connection drops, while taking over to the other
> > > node.
> > >
> > >
> >
> >
> > There was no intentional cluster takeover. This happens during normal
> > operation.
> >
> > > How a target behaves during the transition is left to the target.
> > > The
> > > initiator will keep querying for recovery, until either it times
> > > out or
> > > recovers.
> > >
> >
> >
> > recovery seems to work fine. I've tested it by disconnecting one path
> > to target and all was OK, second path was used and when first one
> > recovered is switched to that one over.
> >
>
>
> > >
> > > > > >
> >
> > I can understand this is very complex problem, what do you suggest me
> > to debug this issues? We have 5 host connected to Fujitsu target, but
> > errors are only on those two which are running KVM guests. Other
> > servers running mail hosts and camera surveillance are not affected
> > it seems. They are Debian systems but not running multipath, so I've
> > made test:
> >
> > So far I've disabled all KVM guests on one (of those two KVM hosts
> > used fro virtualization) host and mounted  one KVM guest file system
> > locally (/mnt) and performed stress test on that mount:
> >
>
> By KVM, do you mean just pure KVM ? Or the management suite too,
> libvirt ?
>

Yes, libvirt is used to manage KVM, and virt-manager on my desktop to
connect to it. It is simple setup without clustering.


>
> >
> > When I let run one KVM guest (Debian) on this host system following
> > errors are in dmesg:
> >
> > [Mon Feb 20 09:49:39 2023]  connection1:0: pdu (op 0x5 itt 0x53)
> > rejected. Reason code 0x9
> > [Mon Feb 20 09:49:41 2023]  connection1:0: pdu (op 0x5 itt 0x7c)
> > rejected. Reason code 0x9
> > [Mon Feb 20 09:49:41 2023]  connection1:0: pdu (op 0x5 itt 0x51)
> > rejected. Reason code 0x9
> > [Mon Feb 20 09:50:07 2023]  connection1:0: pdu (op 0x5 itt 0x33)
> > rejected. Reason code 0x9
> >
> > Guest is using Virtio disk drivers (vda) so I've switched it to
> > generic SATA (sda) for this guest, but errors in log remains.
> >
> > It seem that KVM is triggering this errors and make our NAS unstable.
> >
> > What can KVM do differently? It is using /dev/mapper/dm-XX as disk
> > devices so no direct iscsi access.
> >
> > Any ideas what should I try next?
> >
> >
>
> I'm afraid I do not have much direct hints for you here. given that
> this issue does not happen when KVM is not involved, would imply that
> the erratic behavior originates from that software's integration.
>
>
Do you know someone who can help with this? Here is example of KVM guest
running configuration:

libvirt+ 244533  1  5 Feb03 ?1-03:54:37 qemu-system-x86_64
-enable-kvm -name guest=me_test,process=qemu:me_test,debug-threads=on -S
-object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-108-me_test/master-key.aes
-machine pc-1.1,accel=kvm,usb=off,dump-guest-core=off -cpu host -m 8096
-realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -uuid
1591f345-96b5-4077-9d32-b2991003753d -no-user-config -nodefaults -chardev
socket,id=charmonitor,fd=57,server,nowait -mon
chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
-boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive
if=none,id=drive-ide0-1-0,readonly=on -device
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -drive
file=/dev/mapper/me_test,format=raw,if=none,id=drive-virtio-disk0,cache=none,discard=unmap,aio=native
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2,write-cache=on
-netdev tap,fd=60,id=hostnet0,vhost=on,vhostfd=61 -device
virtio-net-pci,netdev=hostnet0,id=net0,mac=aa:bb:cc:00:10:31,bus=pci.0,addr=0x3
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -chardev
spicevmc,id=charchannel0,name=vdagent -device
virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0
-spice port=5929,addr=0.0.0.0,disable-ticketing,seamless-migration=on
-device
qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,vram64_size_mb=0,vgamem_mb=16,max_outputs=1,bus=pci.0,addr=0x2
-device virtio-balloon-pci,id=balloon0,bus=p

Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

2023-02-20 Thread Milan Oravec
Hello,

On Thu, Feb 16, 2023 at 2:16 PM Ritesh Raj Sarraf  wrote:

> Hello,
>
> On Wed, 2023-02-15 at 19:21 +0100, Milan Oravec wrote:
> > > > I've lot of this error messages ind dmesg:
> > > >
> > > > Feb 09 10:56:05 virt2n kernel:  connection2:0: detected conn
> > > > error
> > > > (1015)
> > > > Feb 09 10:56:05 virt2n iscsid[2790]: connection2:0 is operational
> > > > after recovery (1 attempts)
> > > > Feb 09 10:56:05 virt2n iscsid[2790]: connection1:0 is operational
> > > > after recovery (1 attempts)
> > >
> > > Connection dropped and re-established. Looks normal.
> > >
> >
> >
> > Yes I understand that message but WHY this happens, what can be the
> > cause for this? Link is stable and no errors are reported on ifaces
> > connected to SAN switch:
> >
>
> That is mostly to analyze on the target side or on the network, to
> start with.
>
> Usually, in a multinode clustered target seutp, when a takeover
> happens, connecitons from node1 will be dropped for a moment while it
> is taken over by node2.
>
> I don't know nothing about the Fujitsu target so I cannot generalize.
>
> > ens1f0: flags=4163  mtu 9000
> > inet 10.11.12.17  netmask 255.255.255.0  broadcast
> > 10.11.12.255
> > inet6 fe80::9640:c9ff:fe5f:9172  prefixlen 64  scopeid
> > 0x20
> > ether 94:40:c9:5f:91:72  txqueuelen 1000  (Ethernet)
> > RX packets 184466309231  bytes 146589180968929 (133.3 TiB)
> > RX errors 0  dropped 0  overruns 0  frame 0
> > TX packets 86121730517  bytes 284905539017306 (259.1 TiB)
> > TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> > device interrupt 102  memory 0xa582-a583
> >
> > ens1f1: flags=4163  mtu 9000
> > inet 10.11.14.17  netmask 255.255.255.0  broadcast
> > 10.11.14.255
> > inet6 fe80::9640:c9ff:fe5f:9173  prefixlen 64  scopeid
> > 0x20
> > ether 94:40:c9:5f:91:73  txqueuelen 1000  (Ethernet)
> > RX packets 114453605386  bytes 85545283479332 (77.8 TiB)
> > RX errors 0  dropped 0  overruns 0  frame 0
> > TX packets 236069263519  bytes 283187128321034 (257.5 TiB)
> > TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> > device interrupt 132  memory 0xa580-a581
> >
> >
>
> Thanks for sharing this. So as you mentioned there's no issue on the
> network side.
>
> > >
> > > > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI
> > > > connection
> > > > 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch)
> > > > state
> > > > (3)
> > > > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI
> > > > connection
> > > > 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch)
> > > > state
> > > > (3)
> > > > Feb 09 10:56:10 virt2n iscsid[2790]: connection1:0 is operational
> > > > after recovery (1 attempts)
> > > > Feb 09 10:56:10 virt2n iscsid[2790]: connection2:0 is operational
> > > > after recovery (1 attempts)
> > >
> > > Conn 2:0 recovered but the earlier error it gave looks misleading
> > > to
> > > me. What target controller you have ? What there a failover across
> > > the
> > > nodes ? How does the target handle the transition period to
> > > initiator
> > > queries ?
> > >
> >
> >
> > yes, two paths with failover
> >
>
> So did you trigger a cluster takeover ? My guess is it is your target
> initiating the connection drops, while taking over to the other node.
>
>
There was no intentional cluster takeover. This happens during normal
operation.


> How a target behaves during the transition is left to the target. The
> initiator will keep querying for recovery, until either it times out or
> recovers.
>

recovery seems to work fine. I've tested it by disconnecting one path to
target and all was OK, second path was used and when first one recovered is
switched to that one over.


>
> > > >
> > > > When this happens iscsi opration is interupted for few seconds,
> > > > multipath -ll reports (this is only example, i/o pending is
> > > > reported
> > > > dor all 30 dm-s:
> > > >
> > >
> > > This all looks normal to me. If you want the errors to be
> > > suppressed,
> > > you can increase the timeouts. On the other hand, otherwise, you

Bug#1031131: open-iscsi: Lot of iscsi/kernel errors in dmesg with Fujitsu Eternus DX100S4 connected with 2 10Gb ethernet paths with multipathd.

2023-02-15 Thread Milan Oravec
On Tue, Feb 14, 2023 at 1:04 PM Ritesh Raj Sarraf  wrote:

> Hello Migo,
>
>
Hello Ritesh Raj Sarraf,

On Sun, 2023-02-12 at 11:29 +0100, migo wrote:
> > Package: open-iscsi
> > Version: 2.1.3-5
> > Severity: normal
> > X-Debbugs-Cc: migo.ora...@gmail.com
> >
> > Dear Maintainer,
> >
> > I want to apologize if this is not the right place to report this
> > issue, but I need somewhere to start.
> >
> > I've lot of this error messages ind dmesg:
> >
> > Feb 09 10:56:05 virt2n kernel:  connection2:0: detected conn error
> > (1015)
> > Feb 09 10:56:05 virt2n iscsid[2790]: connection2:0 is operational
> > after recovery (1 attempts)
> > Feb 09 10:56:05 virt2n iscsid[2790]: connection1:0 is operational
> > after recovery (1 attempts)
>
> Connection dropped and re-established. Looks normal.
>

Yes I understand that message but WHY this happens, what can be the cause
for this? Link is stable and no errors are reported on ifaces connected to
SAN switch:

ens1f0: flags=4163  mtu 9000
inet 10.11.12.17  netmask 255.255.255.0  broadcast 10.11.12.255
inet6 fe80::9640:c9ff:fe5f:9172  prefixlen 64  scopeid 0x20
ether 94:40:c9:5f:91:72  txqueuelen 1000  (Ethernet)
RX packets 184466309231  bytes 146589180968929 (133.3 TiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 86121730517  bytes 284905539017306 (259.1 TiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device interrupt 102  memory 0xa582-a583

ens1f1: flags=4163  mtu 9000
inet 10.11.14.17  netmask 255.255.255.0  broadcast 10.11.14.255
inet6 fe80::9640:c9ff:fe5f:9173  prefixlen 64  scopeid 0x20
ether 94:40:c9:5f:91:73  txqueuelen 1000  (Ethernet)
RX packets 114453605386  bytes 85545283479332 (77.8 TiB)
RX errors 0  dropped 0  overruns 0  frame 0
TX packets 236069263519  bytes 283187128321034 (257.5 TiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
device interrupt 132  memory 0xa580-a581



>
> > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI connection
> > 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state
> > (3)
> > Feb 09 10:56:05 virt2n iscsid[2790]: Kernel reported iSCSI connection
> > 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state
> > (3)
> > Feb 09 10:56:10 virt2n iscsid[2790]: connection1:0 is operational
> > after recovery (1 attempts)
> > Feb 09 10:56:10 virt2n iscsid[2790]: connection2:0 is operational
> > after recovery (1 attempts)
>
> Conn 2:0 recovered but the earlier error it gave looks misleading to
> me. What target controller you have ? What there a failover across the
> nodes ? How does the target handle the transition period to initiator
> queries ?
>

yes, two paths with failover


>
> > Feb 09 10:56:28 virt2n kernel:  connection1:0: detected conn error
> > (1015)
> > Feb 09 10:56:28 virt2n kernel:  connection2:0: detected conn error
> > (1015)
> > Feb 09 10:56:28 virt2n iscsid[2790]: Kernel reported iSCSI connection
> > 1:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state
> > (3)
> > Feb 09 10:56:28 virt2n iscsid[2790]: Kernel reported iSCSI connection
> > 2:0 error (1015 - ISCSI_ERR_DATA_DGST: Data digest mismatch) state
> > (3)
> > Feb 09 10:56:33 virt2n kernel:  connection1:0: detected conn error
> > (1015)
> > Feb 09 10:56:33 virt2n kernel:  connection2:0: detected conn error
> > (1015)
> > Feb 09 10:56:33 virt2n iscsid[2790]: connection2:0 is operational
> > after recovery (1 attempts)
> > Feb 09 10:56:33 virt2n iscsid[2790]: connection1:0 is operational
> > after recovery (1 attempts)
>
> Same logs again.
>
> > ... snipped ...
>
> > or enventually this:
> >
> > Feb 10 20:39:39 virt2n kernel:  connection1:0: ping timeout of 5 secs
> > expired, recv timeout 5, last rx 19077132455, last ping 19077133712,
> > now 19077134976
> > Feb 10 20:39:39 virt2n kernel:  connection1:0: detected conn error
> > (1022)
> > Feb 10 20:39:39 virt2n kernel: scsi_io_completion_action: 11
> > callbacks suppressed
> > Feb 10 20:39:39 virt2n kernel: sd 1:0:0:5: [sdl] tag#122 FAILED
> > Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
> > Feb 10 20:39:39 virt2n kernel: sd 1:0:0:5: [sdl] tag#122 CDB: Test
> > Unit Ready 00 00 00 00 00 00
> > Feb 10 20:39:39 virt2n kernel: sd 1:0:0:16: [sdah] tag#103 FAILED
> > Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
> > Feb 10 20:39:39 virt2n kernel: sd 1:0:0:16: [sdah] tag#103 CDB: Test
> > Unit Ready 00 00 00 00 00 00
> > Feb 10 20:39:39 virt2n kernel: sd 1:0:0:24: [sdax] tag#115 FAILED
> > Result: hostbyte=DID_TRANSPORT_DISRUPTED driverbyte=DRIVER_OK
> > Feb 10 20:39:39 virt2n kernel: sd 1:0:0:24: [sdax] tag#115 CDB: Test
> > Unit Ready 00 00 00 00 00 00
> >
> > When this happens iscsi opration is interupted for few seconds,
> > multipath -ll reports (this is only example, i/o pending is reported
> > dor all 30 dm-s:
> >
>
> This all looks normal to me. If