On 08/30/2013 06:21 AM, Timo Veith wrote: > Am 27.08.2013 06:49, schrieb Mike Christie: >> The scsi layer sets a timeout for each command. I think the default is >> 30 or 60 secs in SLES 11. If a command does not complete within that >> timeout, the scsi error handler runs. The scsi eh basically calls the >> iscsi eh callouts to try and abort commands then restart them. If it >> cannot abort them it tries lun and target resets and if those fail we >> end up dripping the session and relogging in. So that is what is >> happening here. >> >> You are probably sending too many commands to the device. Either the >> storage cannot handle them or the connection is too slow or some combo >> of both. Since you have 10 gig ethernet it is probably that the device >> is too slow. You would want to check your target's logs and see if there >> are any errors during this time. If not then lower the queue depth on >> the initiator side (see the iscsi node.session.queue_depth and >> node.session.cmds_max params) or increase the scsi command timeout via >> udev or sysfs (however SUSE reccomends). >> > > Hello Mike, > > thank you for your reply. > > I've decreased > > node.session.cmds_max = 128 > and > node.session.queue_depth = 32 > > by a factor of 8 from the defaults down to > > node.session.cmds_max = 16 > and > node.session.queue_depth = 4
4 commands? Yeah, your target should be able to handle that. Could you do iscsiadm -m node -T yourtarget and send the ouptput so I can see all your settings. > > And I increased the timeout of the block device from 60 to 180 by > issueing the command, after I checked for the right block device of course > > echo 180 > /sys/block/sda/device/timeout > > The error still appears. > > Meanwhile we have been testing a lot more. We also tried newer firmware > and driver versions which are marked beta. But that only to get an idea What target is this with? What vendor and model? > where the root cause lies. Beta version are no go for production here. > We also tried different Linux Distributions, Red Hat 6.4 and Arch Linux. > Red Hat with latest stable firmware and Red Hat stock drivers -> no > error. Also Arch Linux doesn't show the error. > We also tried different file systems on SLES: xfs, ext3 and btrfs. All > the same error. nobarrier mount option with xfs: same error. > > We noticed that the ISCSI_ERR_SCSI_EH_SESSION_RST error only appears > with fio's random read test and with that in the phase where the program > lays out the files from which it will read later on for its test. Not in > the read phase itself. So acutally it is writing in that moment! > In contrast fio's random write test doesn't produce that error. I can > hammer on the target with 96 jobs each writing 1 GB and I get no error. > This very curious in my eyes. > > I also reduced the number of jobs that the fio benchmark runs to only > one job. File size staying at 8 gb. Error still comes. > > I reduced the file size to 4gb -> error, then again to 2 GB and behold > the error didn't appear! I raised to 3 gb and got the error again. Then > back to 2 GB and got the error again, too. > So there seems to be no direct connection between the file size and the > error. Feels like some buffers getting filled, and when they are full, > it happens. This is puzzling. :( > > Some times I think fio is the culprit but our database import (which we > will need regularly in production) triggers the error also. So we should > be glad that fio triggers it too. But we arn't, because we don't know > where it comes from. > > We have no access to the iscsi target's logs yet, so we cannot take a > look at them. :( > Does the problem occur quickly into the test? Let's enable all IO logging on the initiator side. Do echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_session echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_eh echo 1 > /sys/module/libiscsi/parameters/debug_libiscsi_conn echo 1 > /sys/module/libiscsi_tcp/parameters/debug_libiscsi_tcp echo 1 > /sys/module/iscsi_tcp/parameters/debug_iscsi_tcp That will lots and lots of info to /var/log/messages. Send it all. At the same time would be it be possible to take a wireshark/tcpdump trace? Send that file too. -- You received this message because you are subscribed to the Google Groups "open-iscsi" group. To unsubscribe from this group and stop receiving emails from it, send an email to open-iscsi+unsubscr...@googlegroups.com. To post to this group, send email to open-iscsi@googlegroups.com. Visit this group at http://groups.google.com/group/open-iscsi. For more options, visit https://groups.google.com/groups/opt_out.