Re: Win2K3 SCSI errors

Ryan Harper Thu, 08 Jan 2009 07:29:46 -0800

* Anssi Kolehmainen <[email protected]> [2009-01-08 08:23]:
> Hi,

Hey,


Btw, thanks for all of the testing of the scsi bits in windows.  This
has been a great help in flushing out where we've got further work to
do.

> 
> Just to recap my problems with Win2K3 and SCSI...
> 
> Host: 2.6.28 x86_64 Intel Core2Duo E8500 (3.16GHz), 6gb ram
> Guest: Win2003 server, 32-bit, SP2
> KVM: 82 (both userspace and kernel modules)
> 
> qemu-system-x86_64 -name $name -smp 1 -m 1024 -vnc :$id -k fi -serial
> mon:telnet::1000$id,server,nowait -localtime -vga std -usb -usbdevice
> tablet -net nic,macaddr=00:16:3e:00:00:$id,model=e1000 -net
> tap,ifname=tap-$name -pidfile /var/run/kvm/$name.pid -boot c -drive
> index=0,media=disk,if=scsi,boot=on,file=/dev/mapper/vg0-$name
> 
> - Copying file from network drive (100mbps network) to local drive seems
>   to cause 'lsi_scsi: error: Bad Status move' 100% in one system,
>   usually somewhere after 300mb transferred.
> - Also sometimes installing Oracle DB, Bea Weblogic and/or MS SQL Server
>   seems to cause it.
> - Under normal loads it happens about once a week.
> - When running without kvm modules no errors occur.
> 
> - Writing null to local drive works just fine. Unbuffered and
>   write-trough write is about 50mb/s (whereas host can do about 95mb/s).
> - Once after such write I got 'Bad status move' error but only after
>   guest had been idling for about 30s. Resulted in Windows
>   KERNEL_INPAGE_ERROR (ntfs.sys) BSOD. Hasn't occured since.
> 
> - Not all 'Bad status moves' cause BSOD but if they do it is
>   KERNEL_INPAGE_ERROR... Feels like memory corruption to me.
> 
> - Windows event log contains few dozen of "The driver detected a
>   controller error on \Device\Scsi\sym_hu1." These come mostly in boot.
> - Another event log error 'The device did not respond within the timeout
>   period.' occurs at somewhat same time as 'Bad Status move' error.
> 
> - Bad Status move looks like it comes when scsi (write) command is
>   completed after adapter is reset. Executed SCRIPTS doesn't know
>   adapter has been reset and does bad things.
> 
> Ryan, have you been able to duplicate this? I can provide you access to
> my test system where you could try to debug this.

I've been able to duplicate the Bad Status moves on 2k8.  I've never
seen any of these issues on 2k3 R2, 32 or 64-bit.  Using 2k8 32-bit,
-smp 2, I've recreated the Bad Status moves and I have a pretty
good idea of what's wrong.  What I'm seeing is that we've queued a read
operation, proceeded to some other task, the completion function is
invoked when the io layer completes, but since we've started some other
task, lsi_command_complete gets confused (trips waiting=1 dbc!=0 , the
state of the current task) raises the phase mismatch thinking it's a
short transfer.  It then switches to the STATUS phase, but here is where
things go bad, the scripts decodes the instructions and the byte count
is wrong (!=1) and that trips the bad status BADF().

So, to fix this, first, when we complete a command and the tag we get in
the call back doesn';t match the tag we're currently executing, we have
to disconnect the current task from the scsi system.  Then, we need to
have the lsi device reselect the completed tag so we can send a proper
command complete signal to the guest.   This change requires some
reworking of the lsi_queue so we don't remove tasks from the queue until
we've signaled to the driver we've completed the task.

-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
[email protected]
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Win2K3 SCSI errors

Reply via email to