Hi,

we are using KVMs with root backed by iSCSI LUNs mapped from 
Netapp. Occasionally, the device get write errors and it is remounted 
read-only

Mar 15 10:16:59 rb-vertica-hds2-devel dhclient[5053]: DHCPACK from 
172.30.40.175 (xid=0x47a97e90)
Mar 15 10:17:00 rb-vertica-hds2-devel dhclient[5053]: bound to 172.30.40.92 
-- renewal in 47 seconds.
Mar 15 10:17:03 rb-vertica-hds2-devel kernel: Buffer I/O error on device 
vda1, logical block 708624
Mar 15 10:17:03 rb-vertica-hds2-devel kernel: lost page write due to I/O 
error on vda1
..
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device 
vda1, logical block 903881
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O 
error on vda1
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device 
vda1, logical block 1705084
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O 
error on vda1
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: JBD2: Detected IO errors 
while flushing file data on vda1-8

When the problem happens there are NO errors in logs on compute node. I'm 
running 'iscsiadm -m session -P3' every 5s.
It shows no change in session or LUN state. I did also 'iscsid' with -d8 
which also shows nothing around the time.

How do I identify where are these write errors coming from ?
* problem on virtio-blk ?? Not likely.
* iscsi client problem connecting to target
* actual write problem on target

Example KVM device definition

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source 
dev='/dev/disk/by-path/ip-172.30.128.3:3260-iscsi-iqn.1992-08.com.netapp:node.netapp02-lun-17'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' 
function='0x0'/>
    </disk>

iSCSI session has default configuration

iscsiadm -m node -T iqn.1992-08.com.netapp:node.netapp02
...
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5

iscsiadm -m session -P3
...
Recovery Timeout: 120
Target Reset Timeout: 30
LUN Reset Timeout: 30
Abort Timeout: 15

This is the device which had IO errors few hours ago..
grep . /sys/block/sdk/device/*
grep: /sys/block/sdk/device/delete: Permission denied
/sys/block/sdk/device/device_blocked:0
/sys/block/sdk/device/dh_state:detached
/sys/block/sdk/device/evt_media_change:0
/sys/block/sdk/device/iocounterbits:32
/sys/block/sdk/device/iodone_cnt:0x29a
/sys/block/sdk/device/ioerr_cnt:0x0                             <-- error 
count ?
/sys/block/sdk/device/iorequest_cnt:0x29a
/sys/block/sdk/device/modalias:scsi:t-0x00
/sys/block/sdk/device/model:LUN             
/sys/block/sdk/device/queue_depth:32
/sys/block/sdk/device/queue_ramp_up_period:120000
/sys/block/sdk/device/queue_type:none
grep: /sys/block/sdk/device/rescan: Permission denied
/sys/block/sdk/device/rev:7360
/sys/block/sdk/device/scsi_level:5
/sys/block/sdk/device/state:running
/sys/block/sdk/device/timeout:30
/sys/block/sdk/device/type:0
/sys/block/sdk/device/uevent:DEVTYPE=scsi_device
/sys/block/sdk/device/uevent:DRIVER=sd
/sys/block/sdk/device/uevent:MODALIAS=scsi:t-0x00
/sys/block/sdk/device/vendor:NETAPP  

I assume that I'm not hitting any of those timeouts, otherwise I should see 
something in debug output. Do any of those values affect the kernel part of 
iSCSI client ?

Do the counts (iorequest_cnt, iodone_cnt, ioerr_cnt) mean that every 
request sent out was successfully completed ? ( iorequest_cnt == iodone_cnt 
)
I would like to know if the response to write is an actual error response 
sent by the target or is it a (network) problem related to client.

Thanks in advance for any tips. I'm desperate enough to start tcpdump-ing 
the whole thing.. ;)

Regards,

Brano Zarnovican

Host/Guest OS: Scientific Linux release 6.3
Host kernel: 2.6.32-358.0.1.el6.x86_64
Guest kernel: 2.6.32-279.9.1.el6.x86_64
iSCSI client: iscsi-initiator-utils-6.2.0.872-41.el6.x86_64

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to