On 08/30/2013 06:21 AM, Timo Veith wrote:
> Am 27.08.2013 06:49, schrieb Mike Christie:
>> The scsi layer sets a timeout for each command. I think the default is
>> 30 or 60 secs in SLES 11. If a command does not complete within that
>> timeout, the scsi error handler runs. The scsi eh basically calls the
>> iscsi eh callouts to try and abort commands then restart them. If it
>> cannot abort them it tries lun and target resets and if those fail we
>> end up dripping the session and relogging in. So that is what is
>> happening here.
>>
>> You are probably sending too many commands to the device. Either the
>> storage cannot handle them or the connection is too slow or some combo
>> of both. Since you have 10 gig ethernet it is probably that the device
>> is too slow. You would want to check your target's logs and see if there
>> are any errors during this time. If not then lower the queue depth on
>> the initiator side (see the iscsi node.session.queue_depth and
>> node.session.cmds_max params) or increase the scsi command timeout via
>> udev or sysfs (however SUSE reccomends).
>>
> 
> Hello Mike,
> 
> thank you for your reply.
> 
> I've decreased
> 
> node.session.cmds_max = 128
> and
> node.session.queue_depth = 32
> 
> by a factor of 8 from the defaults down to
> 
> node.session.cmds_max = 16
> and
> node.session.queue_depth = 4

4 commands? Yeah, your target should be able to handle that. Could you
do iscsiadm -m node -T yourtarget and send the ouptput so I can see all
your settings.

> 
> And I increased the timeout of the block device from 60 to 180 by
> issueing the command, after I checked for the right block device of course
> 
> echo 180 > /sys/block/sda/device/timeout
> 
> The error still appears.
> 
> Meanwhile we have been testing a lot more. We also tried newer firmware
> and driver versions which are marked beta. But that only to get an idea

What target is this with? What vendor and model?


> where the root cause lies. Beta version are no go for production here.
> We also tried different Linux Distributions, Red Hat 6.4 and Arch Linux.
> Red Hat with latest stable firmware and Red Hat stock drivers -> no
> error. Also Arch Linux doesn't show the error.
> We also tried different file systems on SLES: xfs, ext3 and btrfs. All
> the same error. nobarrier mount option with xfs: same error.
> 
> We noticed that the ISCSI_ERR_SCSI_EH_SESSION_RST error only appears
> with fio's random read test and with that in the phase where the program
> lays out the files from which it will read later on for its test. Not in
> the read phase itself. So acutally it is writing in that moment!
> In contrast fio's random write test doesn't produce that error. I can
> hammer on the target with 96 jobs each writing 1 GB and I get no error.
> This very curious in my eyes.
> 
> I also reduced the number of jobs that the fio benchmark runs to only
> one job. File size staying at 8 gb. Error still comes.
> 
> I reduced the file size to 4gb -> error, then again to 2 GB and behold
> the error didn't appear! I raised to 3 gb and got the error again. Then
> back to 2 GB and got the error again, too.
> So there seems to be no direct connection between the file size and the
> error. Feels like some buffers getting filled, and when they are full,
> it happens. This is puzzling. :(
> 
> Some times I think fio is the culprit but our database import (which we
> will need regularly in production) triggers the error also. So we should
> be glad that fio triggers it too. But we arn't, because we don't know
> where it comes from.
> 
> We have no access to the iscsi target's logs yet, so we cannot take a
> look at them. :(
> 

Does the problem occur quickly into the test?

Let's enable all IO logging on the initiator side. Do

echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_session
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh
echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_conn
echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp
echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp

That will lots and lots of info to /var/log/messages. Send it all.

At the same time would be it be possible to take a wireshark/tcpdump
trace? Send that file too.

-- 
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to open-iscsi+unsubscr...@googlegroups.com.
To post to this group, send email to open-iscsi@googlegroups.com.
Visit this group at http://groups.google.com/group/open-iscsi.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to