Re: open-iscsi - Cant reconnect after moving a VM - connection1:0: detected conn error (1021)

Michael Christie Mon, 03 Dec 2012 20:35:14 -0800

On Dec 3, 2012, at 1:31 AM, Nico Visser <[email protected]> wrote:


> 
> Environment:
> NETAPP SAN storage
> Hyper-V Cluster
> OS: Cloud Linux 6.3
> 
> Im in the process of moving individual VM's off to a separate Hyper-V cluster 
> as we are having some stability issues.
> The VM connects fine to iscsi, but when I've copied the VHD over to the new 
> cluster and power it on , I cant reconnect to the iscsi targets anymore.
> 
> Even when attempting to connect to the a new lun with a different initiator 
> name the issue persists 
> 
> Ive attempted to 
> 
> stop iscsi 
> log out of the session 
> delete the session
> and to remove everything under /var/lib/iscisi
> 
> When attempting to reconnect, I can see the session and even the /dev/sdc and 
> /dev/sdd drives that the session provisions
> 
> 
> iscsiadm --mode session
> tcp: [1] x.x.x.x:3260,7 iqn.1992-08.com.netapp:xxxxxxx
> 
> 
> root@lnxwebr02 [~]# iscsiadm --mode session -P 3
> ..........
> ......….


Don't cut the part of the output that shows the session/connection state.



> 
>         ************************
>         Attached SCSI devices:
>         ************************
>         Host Number: 4    State: running
>         scsi4 Channel 00 Id 0 Lun: 0
>             Attached scsi disk sdc        State: running
>         scsi4 Channel 00 Id 0 Lun: 1
>             Attached scsi disk sdd        State: running
> 
> However the logs shows buffer io errors
> 
> Dec  3 09:23:21 lnxwebr02 kernel: [ 1094.018336] Buffer I/O error on device 
> sdc, logical block 0
> Dec  3 09:24:18 lnxwebr02 kernel: [ 1151.062556] Buffer I/O error on device 
> sdd, logical block 0
> 
> and connection errors
> 
> Dec  3 09:12:05 lnxwebr02 kernel: [  418.382452] scsi4 : iSCSI Initiator over 
> TCP/IP
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.698600] scsi 4:0:0:0: Direct-Access  
>    NETAPP   LUN              8020 PQ: 0 ANSI: 5
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.700035] sd 4:0:0:0: Attached scsi 
> generic sg2 type 0
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.706815] scsi 4:0:0:1: Direct-Access  
>    NETAPP   LUN              8020 PQ: 0 ANSI: 5
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.706842] sd 4:0:0:0: [sdc] 1572990976 
> 512-byte logical blocks: (805 GB/750 GiB)
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.708974] sd 4:0:0:1: Attached scsi 
> generic sg3 type 0
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.712961] sd 4:0:0:1: [sdd] 419430400 
> 512-byte logical blocks: (214 GB/200 GiB)
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.713334] sd 4:0:0:0: [sdc] Write 
> Protect is off
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.714268] sd 4:0:0:0: [sdc] Write 
> cache: disabled, read cache: enabled, doesn't support DPO or FUA
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.715730] sd 4:0:0:1: [sdd] Write 
> Protect is off
> Dec  3 09:12:06 lnxwebr02 kernel: [  418.716605] sd 4:0:0:1: [sdd] Write 
> cache: disabled, read cache: enabled, doesn't support DPO or FUA
> Dec  3 09:12:06 lnxwebr02 iscsid: Connection1:0 to [target: 
> iqn.1992-08.com.netapp:wnlsfas3240b, portal: 10.11.52.12,3260] through 
> [iface: default] is operational now
> Dec  3 09:12:29 lnxwebr02 PAM-hulk[2881]: failed to connect stream socket
> Dec  3 09:12:52 lnxwebr02 kernel: [  418.719682]  sdc:
> Dec  3 09:12:52 lnxwebr02 kernel: [  464.704180]  connection1:0: detected 
> conn error (1021)
> Dec  3 09:12:52 lnxwebr02 iscsid: Kernel reported iSCSI connection 1:0 error 
> (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of 
> SCSI error recovery) state (3)
> Dec  3 09:13:00 lnxwebr02 iscsid: connection1:0 is operational after recovery 
> (1 attempts)
> Dec  3 09:13:45 lnxwebr02 kernel: [  517.704102]  connection1:0: detected 
> conn error (1021)
> Dec  3 09:13:45 lnxwebr02 iscsid: Kernel reported iSCSI connection 1:0 error 
> (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of 
> SCSI error recovery) state (3)
> Dec  3 09:13:51 lnxwebr02 iscsid: connection1:0 is operational after recovery 
> (1 attempts)
> 
> 


It looks like you connected fine, but for some unknown reason the target stops 
responding to IO or we experience a slowdown. For some reason it takes a longer 
than the scsi command timeout worth of seconds (30 or 60 depending on your 
kernel and udev), and so the scsi error handler runs. It ends up dropping the 
connection and we end up reconnecting ok.

It could be that if you just copied the VHD file and now had 2 hosts with the 
same /etc/initiatorname.iscsi then each host will cause logouts for the other. 
The value in that file must be unique.



The Hung task error below just means the IO is taking loner than what you have 
set for your hung task timeout setting which looks like the default of 2 
minutes.



> I'm also seeing the following kernel message
> 
> 
> Dec  3 09:14:36 lnxwebr02 kernel: [  568.704067]  connection1:0: detected 
> conn error (1021)
> Dec  3 09:14:36 lnxwebr02 iscsid: Kernel reported iSCSI connection 1:0 error 
> (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of 
> SCSI error recovery) state (3)
> Dec  3 09:14:42 lnxwebr02 iscsid: connection1:0 is operational after recovery 
> (1 attempts)
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.729107] INFO: task async/0:2830 
> blocked for more than 120 seconds.
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.733430] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737702] async/0       D 
> ffff88020556c440     0  2830      2    0 0x00000080
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737718]  ffff880205595950 
> 0000000000000046 0000000000000000 ffff8802055959b8
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737729]  ffff8802023eb938 
> ffff8802023eb848 ffff880205590f28 ffff880205590f28
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737738]  ffff880028036fe8 
> ffff88020556c9f8 ffff880205595fd8 ffff880205595fd8
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737748] Call Trace:
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737761]  [<ffffffff811231b0>] ? 
> sync_page+0x0/0x50
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737772]  [<ffffffff814e9fe3>] 
> io_schedule+0x73/0xc0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737780]  [<ffffffff811231ed>] 
> sync_page+0x3d/0x50
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737786]  [<ffffffff814ea84a>] 
> __wait_on_bit_lock+0x5a/0xc0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737796]  [<ffffffff811cd2c0>] ? 
> blkdev_get_block+0x0/0x70
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737802]  [<ffffffff81123187>] 
> __lock_page+0x67/0x70
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737808]  [<ffffffff81095ac0>] ? 
> wake_bit_function+0x0/0x50
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737815]  [<ffffffff8112465a>] 
> do_read_cache_page+0xfa/0x1e0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737821]  [<ffffffff811ce270>] ? 
> blkdev_readpage+0x0/0x20
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737827]  [<ffffffff81124789>] 
> read_cache_page_async+0x19/0x20
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737833]  [<ffffffff8112479e>] 
> read_cache_page+0xe/0x20
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737841]  [<ffffffff81207c40>] 
> read_dev_sector+0x30/0x90
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737847]  [<ffffffff8120a9d1>] 
> read_lba+0x101/0x110
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737852]  [<ffffffff8120aec5>] 
> find_valid_gpt+0xd5/0x6b0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737861]  [<ffffffff8106c831>] ? 
> release_console_sem+0x1e1/0x230
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737866]  [<ffffffff8120b51f>] 
> efi_partition+0x7f/0x370
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737872]  [<ffffffff814e9155>] ? 
> printk+0x41/0x44
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737877]  [<ffffffff812089b7>] 
> rescan_partitions+0x1a7/0x470
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737899]  [<ffffffffa0020351>] ? 
> sd_open+0x81/0x1f0 [sd_mod]
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737905]  [<ffffffff811ce9d6>] 
> __blkdev_get+0x1b6/0x3c0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737910]  [<ffffffff811cebf0>] 
> blkdev_get+0x10/0x20
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737916]  [<ffffffff81207dee>] 
> register_disk+0x14e/0x1b0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737925]  [<ffffffff81258646>] 
> add_disk+0xa6/0x160
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737933]  [<ffffffffa00239cb>] 
> sd_probe_async+0x13b/0x210 [sd_mod]
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737938]  [<ffffffff81095de6>] ? 
> add_wait_queue+0x46/0x60
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737947]  [<ffffffff8109dd32>] 
> async_thread+0x102/0x250
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737955]  [<ffffffff81059ec0>] ? 
> default_wake_function+0x0/0x20
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737962]  [<ffffffff8109dc30>] ? 
> async_thread+0x0/0x250
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.737971]  [<ffffffff810954a6>] 
> kthread+0x96/0xa0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.738044]  [<ffffffff8100c20a>] 
> child_rip+0xa/0x20
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.738074]  [<ffffffff81095410>] ? 
> kthread+0x0/0xa0
> Dec  3 09:15:08 lnxwebr02 kernel: [  600.738079]  [<ffffffff8100c200>] ? 
> child_rip+0x0/0x20
> 
> 
> Im trying to understand why after copying the vhd file I cant reconnect via 
> iscsi to the NETAPP, even after logging out and deleting the session.
> Is there old references I need to remove or I'm I missing something else?
> 
> 
> 
> 
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "open-iscsi" group.
> To view this discussion on the web visit 
> https://groups.google.com/d/msg/open-iscsi/-/1wmHw7PvnP4J.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group at 
> http://groups.google.com/group/open-iscsi?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/open-iscsi?hl=en.

Re: open-iscsi - Cant reconnect after moving a VM - connection1:0: detected conn error (1021)

Reply via email to