Re: troubleshooting sporadic write problems from KVM

Fred Smit Fri, 22 Mar 2013 20:03:50 -0700


On Friday, March 15, 2013 6:02:05 AM UTC-5, [email protected] wrote:
>
> Hi,
>
> we are using KVMs with root backed by iSCSI LUNs mapped from 
> Netapp. Occasionally, the device get write errors and it is remounted 
> read-only
>
> Mar 15 10:16:59 rb-vertica-hds2-devel dhclient[5053]: DHCPACK from 
> 172.30.40.175 (xid=0x47a97e90)
> Mar 15 10:17:00 rb-vertica-hds2-devel dhclient[5053]: bound to 
> 172.30.40.92 -- renewal in 47 seconds.
> Mar 15 10:17:03 rb-vertica-hds2-devel kernel: Buffer I/O error on device 
> vda1, logical block 708624
> Mar 15 10:17:03 rb-vertica-hds2-devel kernel: lost page write due to I/O 
> error on vda1
> ..
> Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device 
> vda1, logical block 903881
> Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O 
> error on vda1
> Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device 
> vda1, logical block 1705084
> Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O 
> error on vda1
> Mar 15 10:17:32 rb-vertica-hds2-devel kernel: JBD2: Detected IO errors 
> while flushing file data on vda1-8
>
> When the problem happens there are NO errors in logs on compute node. I'm 
> running 'iscsiadm -m session -P3' every 5s.
> It shows no change in session or LUN state. I did also 'iscsid' with -d8 
> which also shows nothing around the time.
>
> How do I identify where are these write errors coming from ?
> * problem on virtio-blk ?? Not likely.
> * iscsi client problem connecting to target
> * actual write problem on target
>
> Example KVM device definition
>
>     <disk type='block' device='disk'>
>       <driver name='qemu' type='raw' cache='none'/>
>       <source 
> dev='/dev/disk/by-path/ip-172.30.128.3:3260-iscsi-iqn.1992-08.com.netapp:node.netapp02-lun-17'/>
>       <target dev='vda' bus='virtio'/>
>       <alias name='virtio-disk0'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x04' 
> function='0x0'/>
>     </disk>
>
> iSCSI session has default configuration
>
> iscsiadm -m node -T iqn.1992-08.com.netapp:node.netapp02
> ...
> node.session.cmds_max = 128
> node.session.queue_depth = 32
> node.session.timeo.replacement_timeout = 120
> node.session.err_timeo.abort_timeout = 15
> node.session.err_timeo.lu_reset_timeout = 30
> node.session.err_timeo.tgt_reset_timeout = 30
> node.session.err_timeo.host_reset_timeout = 60
> node.conn[0].timeo.logout_timeout = 15
> node.conn[0].timeo.login_timeout = 15
> node.conn[0].timeo.auth_timeout = 45
> node.conn[0].timeo.noop_out_interval = 5
> node.conn[0].timeo.noop_out_timeout = 5
>
> iscsiadm -m session -P3
> ...
> Recovery Timeout: 120
> Target Reset Timeout: 30
> LUN Reset Timeout: 30
> Abort Timeout: 15
>
> This is the device which had IO errors few hours ago..
> grep . /sys/block/sdk/device/*
> grep: /sys/block/sdk/device/delete: Permission denied
> /sys/block/sdk/device/device_blocked:0
> /sys/block/sdk/device/dh_state:detached
> /sys/block/sdk/device/evt_media_change:0
> /sys/block/sdk/device/iocounterbits:32
> /sys/block/sdk/device/iodone_cnt:0x29a
> /sys/block/sdk/device/ioerr_cnt:0x0                             <-- error 
> count ?
> /sys/block/sdk/device/iorequest_cnt:0x29a
> /sys/block/sdk/device/modalias:scsi:t-0x00
> /sys/block/sdk/device/model:LUN             
> /sys/block/sdk/device/queue_depth:32
> /sys/block/sdk/device/queue_ramp_up_period:120000
> /sys/block/sdk/device/queue_type:none
> grep: /sys/block/sdk/device/rescan: Permission denied
> /sys/block/sdk/device/rev:7360
> /sys/block/sdk/device/scsi_level:5
> /sys/block/sdk/device/state:running
> /sys/block/sdk/device/timeout:30
> /sys/block/sdk/device/type:0
> /sys/block/sdk/device/uevent:DEVTYPE=scsi_device
> /sys/block/sdk/device/uevent:DRIVER=sd
> /sys/block/sdk/device/uevent:MODALIAS=scsi:t-0x00
> /sys/block/sdk/device/vendor:NETAPP  
>
> I assume that I'm not hitting any of those timeouts, otherwise I should 
> see something in debug output. Do any of those values affect the kernel 
> part of iSCSI client ?
>
> Do the counts (iorequest_cnt, iodone_cnt, ioerr_cnt) mean that every 
> request sent out was successfully completed ? ( iorequest_cnt == iodone_cnt 
> )
> I would like to know if the response to write is an actual error response 
> sent by the target or is it a (network) problem related to client.
>
> Thanks in advance for any tips. I'm desperate enough to start tcpdump-ing 
> the whole thing.. ;)
>
> Regards,
>
> Brano Zarnovican
>
> Host/Guest OS: Scientific Linux release 6.3
> Host kernel: 2.6.32-358.0.1.el6.x86_64
> Guest kernel: 2.6.32-279.9.1.el6.x86_64
> iSCSI client: iscsi-initiator-utils-6.2.0.872-41.el6.x86_64
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/open-iscsi?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Re: troubleshooting sporadic write problems from KVM

Reply via email to