from:"Vladislav Bolkhovitin"


[EMAIL PROTECTED] wrote:

On Thu, 7 Feb 2008, Vladislav Bolkhovitin wrote:


Bart Van Assche wrote:


- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?



I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated 
for SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing 
framework (although, frankly, I don't think there's a possibility for 
in-kernel iSCSI target and user space SCSI target framework) and if he 
decide to go with SCST, I'll be glad to offer my help and support and 
wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The better 
one should win.



why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


the most common use of this sort of thing that I would see is to load up 
a bunch of 1TB SATA drives in a commodity PC, run software RAID, and 
then export the resulting volume to other servers via iSCSI. not a 
'real' SCSI device in sight.


As far as how good a standard iSCSI is, at this point I don't think it 
really matters. There are too many devices and manufacturers out there 
that implement iSCSI as their storage protocol (from both sides, 
offering storage to other systems, and using external storage). 
Sometimes the best technology doesn't win, but Linux should be 
interoperable with as much as possible and be ready to support the 
winners and the loosers in technology options, for as long as anyone 
chooses to use the old equipment (after all, we support things like 
Arcnet networking, which lost to Ethernet many years ago)


David, your question surprises me a lot. From where have you decided 
that SCST supports only pass-through backstorage? Does the RAM disk, 
which Bart has been using for performance tests, look like a SCSI device?


SCST supports all backstorage types you can imagine and Linux kernel 
supports.



David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Luben Tuikov wrote:

Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?


What do you mean? To call directly low level backstorage SCSI drivers 
queuecommand() routine? What are advantages of it?



   Luben

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

On Thu, 2008-02-07 at 12:37 -0800, Luben Tuikov wrote:


Is there an open iSCSI Target implementation which does NOT
issue commands to sub-target devices via the SCSI mid-layer, but
bypasses it completely?

  Luben




Hi Luben,

I am guessing you mean futher down the stack, which I don't know this to
be the case.  Going futher up the layers is the design of v2.9 LIO-SE.
There is a diagram explaining the basic concepts from a 10,000 foot
level.

http://linux-iscsi.org/builds/user/nab/storage-engine-concept.pdf

Note that only traditional iSCSI target is currently implemented in v2.9
LIO-SE codebase in the list of target mode fabrics on left side of the
layout.  The API between the protocol headers that does
encoding/decoding target mode storage packets is probably the least
mature area of the LIO stack (because it has always been iSCSI looking
towards iSER :).  I don't know who has the most mature API between the
storage engine and target storage protocol for doing this between SCST
and STGT, I am guessing SCST because of the difference in age of the
projects.  Could someone be so kind to fill me in on this..?


SCST uses scsi_execute_async_fifo() function to submit commands to SCSI 
devices in the pass-through mode. This function is slightly modified 
version of scsi_execute_async(), which submits requests in FIFO order 
instead of LIFO as scsi_execute_async() does (so with 
scsi_execute_async() they are executed in the reverse order). 
Scsi_execute_async_fifo() added as a separate patch to the kernel.



Also note, the storage engine plugin for doing userspace passthrough on
the right is also currently not implemented.  Userspace passthrough in
this context is an target engine I/O that is enforcing max_sector and
sector_size limitiations, and encodes/decodes target storage protocol
packets all out of view of userspace.  The addressing will be completely
different if we are pointing SE target packets at non SCSI target ports
in userspace.

--nab

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Nicholas A. Bellinger wrote:

- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for SCST 
framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work over 
some standard SCSI target framework. Hence the choice gets narrower: SCST vs 
STGT. I don't think there's a way for a dedicated iSCSI target (i.e. PyX/LIO) 
in the mainline, because of a lot of code duplication. Nicholas could decide 
to move to either existing framework (although, frankly, I don't think 
there's a possibility for in-kernel iSCSI target and user space SCSI target 
framework) and if he decide to go with SCST, I'll be glad to offer my help 
and support and wouldn't care if LIO-SCST eventually replaced iSCSI-SCST. The 
better one should win.


why should linux as an iSCSI target be limited to passthrough to a SCSI 
device.


nod

I don't think anyone is saying it should be.  It makes sense that the
more mature SCSI engines that have working code will be providing alot
of the foundation as we talk about options..


From comparing the designs of SCST and LIO-SE, we know that SCST has

supports very SCSI specific target mode hardware, including software
target mode forks of other kernel code.  This code for the target mode
pSCSI, FC and SAS control paths (more for the state machines, that CDB
emulation) that will most likely never need to be emulated on non SCSI
target engine.


...but required for SCSI. So, it must be, anyway.


SCST has support for the most SCSI fabric protocols of
the group (although it is lacking iSER) while the LIO-SE only supports
traditional iSCSI using Linux/IP (this means TCP, SCTP and IPv6).  The
design of LIO-SE was to make every iSCSI initiator that sends SCSI CDBs
and data to talk to every potential device in the Linux storage stack on
the largest amount of hardware architectures possible.

Most of the iSCSI Initiators I know (including non Linux) do not rely on
heavy SCSI task management, and I think this would be a lower priority
item to get real SCSI specific recovery in the traditional iSCSI target
for users.  Espically things like SCSI target mode queue locking
(affectionally called Auto Contingent Allegiance) make no sense for
traditional iSCSI or iSER, because CmdSN rules are doing this for us.


Sorry, it isn't correct. ACA provides possibility to lock commands queue 
in case of CHECK CONDITION, so allows to keep commands execution order 
in case of errors. CmdSN keeps commands execution order only in case of 
success, in case of error the next queued command will be executed 
immediately after the failed one, although application might require to 
have all subsequent after the failed one commands aborted. Think about 
journaled file systems, for instance. Also ACA allows to retry the 
failed command and then resume the queue.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-07 Thread Vladislav Bolkhovitin


Bart Van Assche wrote:

Since the focus of this thread shifted somewhat in the last few
messages, I'll try to summarize what has been discussed so far:
- There was a number of participants who joined this discussion
spontaneously. This suggests that there is considerable interest in
networked storage and iSCSI.
- It has been motivated why iSCSI makes sense as a storage protocol
(compared to ATA over Ethernet and Fibre Channel over Ethernet).
- The direct I/O performance results for block transfer sizes below 64
KB are a meaningful benchmark for storage target implementations.
- It has been discussed whether an iSCSI target should be implemented
in user space or in kernel space. It is clear now that an
implementation in the kernel can be made faster than a user space
implementation (http://kerneltrap.org/mailarchive/linux-kernel/2008/2/4/714804).
Regarding existing implementations, measurements have a.o. shown that
SCST is faster than STGT (30% with the following setup: iSCSI via
IPoIB and direct I/O block transfers with a size of 512 bytes).
- It has been discussed which iSCSI target implementation should be in
the mainstream Linux kernel. There is no agreement on this subject
yet. The short-term options are as follows:
1) Do not integrate any new iSCSI target implementation in the
mainstream Linux kernel.
2) Add one of the existing in-kernel iSCSI target implementations to
the kernel, e.g. SCST or PyX/LIO.
3) Create a new in-kernel iSCSI target implementation that combines
the advantages of the existing iSCSI kernel target implementations
(iETD, STGT, SCST and PyX/LIO).

As an iSCSI user, I prefer option (3). The big question is whether the
various storage target authors agree with this ?


I tend to agree with some important notes:

1. IET should be excluded from this list, iSCSI-SCST is IET updated for 
SCST framework with a lot of bugfixes and improvements.


2. I think, everybody will agree that Linux iSCSI target should work 
over some standard SCSI target framework. Hence the choice gets 
narrower: SCST vs STGT. I don't think there's a way for a dedicated 
iSCSI target (i.e. PyX/LIO) in the mainline, because of a lot of code 
duplication. Nicholas could decide to move to either existing framework 
(although, frankly, I don't think there's a possibility for in-kernel 
iSCSI target and user space SCSI target framework) and if he decide to 
go with SCST, I'll be glad to offer my help and support and wouldn't 
care if LIO-SCST eventually replaced iSCSI-SCST. The better one should win.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-02-06 Thread Vladislav Bolkhovitin


James Bottomley wrote:

On Tue, 2008-02-05 at 21:59 +0300, Vladislav Bolkhovitin wrote:


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?


No, it has to do with the way invalidation occurs.  When you mmap a
region from a device or file, the kernel places page translations for
that region into your vm_area.  The regions themselves aren't backed
until faulted.  For write (i.e. incoming command to target) you specify
the write flag and send the area off to receive the data.  The gather,
expecting the pages to be overwritten, backs them with pages marked
dirty but doesn't fault in the contents (unless it already exists in the
page cache).  The kernel writes the data to the pages and the dirty
pages go back to the user.  msync() flushes them to the device.

The disadvantage of all this is that the handle for the I/O if you will
is a virtual address in a user process that doesn't actually care to see
the data. non-x86 architectures will do flushes/invalidates on this
address space as the I/O occurs.


I more or less see, thanks. But (1) pages still needs to be mmaped to 
the user space process before the data transmission, i.e. they must be 
zeroed before being mmaped, which isn't much faster, than data copy, and 
(2) I suspect, it would be hard to make it race free, e.g. if another 
process would want to write to the same area simultaneously



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?


Well, there's no real evidence that zero copy or lack of it is a problem
yet.


The performance improvement from zero copy can be easily estimated, 
knowing the link throughput and data copy throughput, which are about 
the same for 20Gbps links (I did that few e-mail ago).


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

Erez Zilber wrote:

Bart Van Assche wrote:

As you probably know there is a trend in enterprise computing towards
networked storage. This is illustrated by the emergence during the
past few years of standards like SRP (SCSI RDMA Protocol), iSCSI
(Internet SCSI) and iSER (iSCSI Extensions for RDMA). Two different
pieces of software are necessary to make networked storage possible:
initiator software and target software. As far as I know there exist
three different SCSI target implementations for Linux:
- The iSCSI Enterprise Target Daemon (IETD,
http://iscsitarget.sourceforge.net/);
- The Linux SCSI Target Framework (STGT, http://stgt.berlios.de/);
- The Generic SCSI Target Middle Level for Linux project (SCST,
http://scst.sourceforge.net/).
Since I was wondering which SCSI target software would be best suited
for an InfiniBand network, I started evaluating the STGT and SCST SCSI
target implementations. Apparently the performance difference between
STGT and SCST is small on 100 Mbit/s and 1 Gbit/s Ethernet networks,
but the SCST target software outperforms the STGT software on an
InfiniBand network. See also the following thread for the details:
http://sourceforge.net/mailarchive/forum.php?thread_name=e2e108260801170127w2937b2afg9bef324efa945e43%40mail.gmail.comforum_name=scst-devel.

Sorry for the late response (but better late than never).

One may claim that STGT should have lower performance than SCST because
its data path is from userspace. However, your results show that for
non-IB transports, they both show the same numbers. Furthermore, with IB
there shouldn't be any additional difference between the 2 targets
because data transfer from userspace is as efficient as data transfer
from kernel space.

And now consider if one target has zero-copy cached I/O. How much that
will improve its performance?

The only explanation that I see is that fine tuning for iSCSI iSER is
required. As was already mentioned in this thread, with SDR you can get
~900 MB/sec with iSER (on STGT).

Erez

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Jeff Garzik wrote:
iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


Actually, think about those multiple connections...  we already had to 
implement fast-failover (and load bal) SCSI multi-pathing at a higher 
level.  IMO that portion of the protocol is redundant:   You need the 
same capability elsewhere in the OS _anyway_, if you are to support 
multi-pathing.


I'm thinking about MC/S as about a way to improve performance using 
several physical links. There's no other way, except MC/S, to keep 
commands processing order in that case. So, it's really valuable 
property of iSCSI, although with a limited application.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Jeff Garzik wrote:

Alan Cox wrote:

better. So for example, I personally suspect that ATA-over-ethernet is way 
better than some crazy SCSI-over-TCP crap, but I'm biased for simple and 
low-level, and against those crazy SCSI people to begin with.


Current ATAoE isn't. It can't support NCQ. A variant that did NCQ and IP
would probably trash iSCSI for latency if nothing else.



AoE is truly a thing of beauty.  It has a two/three page RFC (say no more!).

But quite so...  AoE is limited to MTU size, which really hurts.  Can't 
really do tagged queueing, etc.



iSCSI is way, way too complicated. 


I fully agree. From one side, all that complexity is unavoidable for 
case of multiple connections per session, but for the regular case of 
one connection per session it must be a lot simpler.


And now think about iSER, which brings iSCSI on the whole new complexity 
level ;)


It's an Internet protocol designed 
by storage designers, what do you expect?


For years I have been hoping that someone will invent a simple protocol 
(w/ strong auth) that can transit ATA and SCSI commands and responses. 
Heck, it would be almost trivial if the kernel had a TLS/SSL implementation.


Jeff

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:

I'd assumed the move was primarily because of the difficulty of getting
correct semantics on a shared filesystem



.. not even shared. It was hard to get correct semantics full stop. 

Which is a traditional problem. The thing is, the kernel always has some 
internal state, and it's hard to expose all the semantics that the kernel 
knows about to user space.


So no, performance is not the only reason to move to kernel space. It can 
easily be things like needing direct access to internal data queues (for a 
iSCSI target, this could be things like barriers or just tagged commands - 
yes, you can probably emulate things like that without access to the 
actual IO queues, but are you sure the semantics will be entirely right?


The kernel/userland boundary is not just a performance boundary, it's an 
abstraction boundary too, and these kinds of protocols tend to break 
abstractions. NFS broke it by having file handles (which is not 
something that really exists in user space, and is almost impossible to 
emulate correctly), and I bet the same thing happens when emulating a SCSI 
target in user space.


Yes, there is something like that for SCSI target as well. It's a local 
initiator or local nexus, see 
http://thread.gmane.org/gmane.linux.scsi/31288 and 
http://news.gmane.org/find-root.php?message_id=%3c463F36AC.3010207%40vlnb.net%3e 
for more info about that.


In fact, existence of local nexus is one more point why SCST is better, 
than STGT, because for STGT it's pretty hard to support it (all locally 
generated commands would have to be passed through its daemon, which 
would be a total disaster for performance), while for SCST it can be 
done relatively simply.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:

On Mon, 2008-02-04 at 21:38 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:


On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote:



James Bottomley wrote:



On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:




James Bottomley wrote:



So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsim=120164008302435

and here:

http://marc.info/?l=linux-scsim=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernelm=120169189504361w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.



Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O



from the backing store and then send this mmapped region off for target



I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. 


James, have you checked how fast is mmaped I/O if work size  size of 
RAM? It's several times slower comparing to buffered I/O. It was many 
times discussed in LKML and, seems, VM people consider it unavoidable. 



Erm, but if you're using the case of work size  size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.


James, just check and you will see, buffered I/O is a lot faster.


So in an out of memory situation the buffers you don't have are a lot
faster than the pages I don't have?


There isn't OOM in both cases. Just pages reclamation/readahead work 
much better in the buffered case.


So, using mmaped IO isn't an option for high performance. Plus, mmaped 
IO isn't an option for high reliability requirements, since it doesn't 
provide a practical way to handle I/O errors.


I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region. 


Err, to whom return? If you try to read from a mmaped page, which can't 
be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't 
remember exactly. It's quite tricky to get back to the faulted command 
from the signal handler.


Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you 
think that such mapping/unmapping is good for performance?




msync does something
similar if there's a write failure.



You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.


Can you be more exact and specify what kind of tricks should be done for 
that?


Actually, just avoid touching it seems to do the trick with a recent
kernel.


Hmm, how can one write to an mmaped page and don't touch it?


I meant from user space ... the writes are done inside the kernel.


Sure, the mmap() approach agreed to be unpractical, but could you 
elaborate more on this anyway, please? I'm just curious. Do you think 
about implementing a new syscall, which would put pages with data in the 
mmap'ed area?



However, as Linus has pointed out, this discussion is getting a bit off
topic. 


No, that isn't off topic. We've just proved that there is no good way to 
implement zero-copy cached I/O for STGT. I see the only practical way 
for that, proposed by FUJITA Tomonori some time ago: duplicating Linux 
page cache in the user space. But will you like it?



There's no actual evidence that copy problems are causing any
performatince issues issues for STGT.  In fact, there's evidence that
they're not for everything except IB networks.


The zero-copy cached I/O has not yet been implemented in SCST, I simply 
so far have not had time for that. Currently SCST performs better STGT, 
because of simpler processing path and less context switches per 
command. Memcpy() speed on modern systems is about

Re: Integration of SCST in the mainstream Linux kernel


Linus Torvalds wrote:
So just going by what has happened in the past, I'd assume that iSCSI 
would eventually turn into connecting/authentication in user space with 
data transfers in kernel space.


This is exactly how iSCSI-SCST (iSCSI target driver for SCST) is 
implemented, credits to IET and Ardis target developers.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Bart Van Assche wrote:

On Feb 4, 2008 1:27 PM, Vladislav Bolkhovitin [EMAIL PROTECTED] wrote:


So, James, what is your opinion on the above? Or the overall SCSI target
project simplicity doesn't matter much for you and you think it's fine
to duplicate Linux page cache in the user space to keep the in-kernel
part of the project as small as possible?



It's too early to draw conclusions about performance. I'm currently
performing more measurements, and the results are not easy to
interpret. My plan is to measure the following:
* Setup: target with RAM disk of 2 GB as backing storage.
* Throughput reported by dd and xdd (direct I/O).
* Transfers with dd/xdd in units of 1 KB to 1 GB (the smallest
transfer size that can be specified to xdd is 1 KB).
* Target SCSI software to be tested: IETD iSCSI via IPoIB, STGT iSCSI
via IPoIB, STGT iSER, SCST iSCSI via IPoIB, SCST SRP, LIO iSCSI via
IPoIB.

The reason I chose dd/xdd for these tests is that I want to measure
the performance of the communication protocols, and that I am assuming
that this performance can be modeled by the following formula:
(transfer time in s) = (transfer setup latency in s) + (transfer size
in MB) / (bandwidth in MB/s).


It isn't fully correct, you forgot about link latency. More correct one is:

(transfer time) = (transfer setup latency on both initiator and target, 
consisting from software processing time, including memory copy, if 
necessary, and PCI setup/transfer time) + (transfer size)/(bandwidth) + 
(link latency to deliver request for READs or status for WRITES) + 
(2*(link latency) to deliver R2T/XFER_READY request in case of WRITEs, 
if necessary (e.g. iSER for small transfers might not need it, but SRP 
most likely always needs it)). Also you should note that it's correct 
only in case of single threaded workloads with one outstanding command 
at time. For other workloads it depends from how well they manage to 
keep the link full in interval from (transfer size)/(transfer time) to 
bandwidth.



Measuring the time needed for transfers
with varying block size allows to compute the constants in the above
formula via linear regression.


Unfortunately, it isn't so easy, see above.


One difficulty I already encountered is that the performance of the
Linux IPoIB implementation varies a lot under high load
(http://bugzilla.kernel.org/show_bug.cgi?id=9883).

Another issue I have to look further into is that dd and xdd report
different results for very large block sizes ( 1 MB).


Look at /proc/scsi_tgt/sgv (for SCST) and you will see, which transfer 
sizes are actually used. Initiators don't like sending big requests and 
often split them on smaller ones.


Look at this message as well, it might be helpful: 
http://lkml.org/lkml/2007/5/16/223



Bart Van Assche.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:
So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsim=120164008302435

and here:

http://marc.info/?l=linux-scsim=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernelm=120169189504361w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:

On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:

So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsim=120164008302435

and here:

http://marc.info/?l=linux-scsim=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernelm=120169189504361w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.



Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. 


James, have you checked how fast is mmaped I/O if work size  size of 
RAM? It's several times slower comparing to buffered I/O. It was many 
times discussed in LKML and, seems, VM people consider it unavoidable. 
So, using mmaped IO isn't an option for high performance. Plus, mmaped 
IO isn't an option for high reliability requirements, since it doesn't 
provide a practical way to handle I/O errors.



You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.


Can you be more exact and specify what kind of tricks should be done for 
that?



 However, none of this involves data copies.

James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel


Vladislav Bolkhovitin wrote:

James Bottomley wrote:


The two target architectures perform essentially identical functions, so
there's only really room for one in the kernel.  Right at the moment,
it's STGT.  Problems in STGT come from the user-kernel boundary which
can be mitigated in a variety of ways.  The fact that the figures are
pretty much comparable on non IB networks shows this.

I really need a whole lot more evidence than at worst a 20% performance
difference on IB to pull one implementation out and replace it with
another.  Particularly as there's no real evidence that STGT can't be
tweaked to recover the 20% even on IB.



James,

Although the performance difference between STGT and SCST is apparent, 
this isn't the only point why SCST is better. I've already written about 
it many times in various mailing lists, but let me summarize it one more 
time here.


As you know, almost all kernel parts can be done in user space, 
including all the drivers, networking, I/O management with block/SCSI 
initiator subsystem and disk cache manager. But does it mean that 
currently Linux kernel is bad and all the above should be (re)done in 
user space instead? I believe, not. Linux isn't a microkernel for very 
pragmatic reasons: simplicity and performance. So, additional important 
point why SCST is better is simplicity.


For SCSI target, especially with hardware target card, data are came 
from kernel and eventually served by kernel, which does actual I/O or 
getting/putting data from/to cache. Dividing requests processing between 
user and kernel space creates unnecessary interface layer(s) and 
effectively makes the requests processing job distributed with all its 
complexity and reliability problems. From my point of view, having such 
distribution, where user space is master side and kernel is slave is 
rather wrong, because:


1. It makes kernel depend from user program, which services it and 
provides for it its routines, while the regular paradigm is the 
opposite: kernel services user space applications. As a direct 
consequence from it that there is no real protection for the kernel from 
faults in the STGT core code without excessive effort, which, no 
surprise, wasn't currently done and, seems, is never going to be done. 
So, on practice debugging and developing under STGT isn't easier, than 
if the whole code was in the kernel space, but, actually, harder (see 
below why).


2. It requires new complicated interface between kernel and user spaces 
that creates additional maintenance and debugging headaches, which don't 
exist for kernel only code. Linus Torvalds some time ago perfectly 
described why it is bad, see http://lkml.org/lkml/2007/4/24/451, 
http://lkml.org/lkml/2006/7/1/41 and http://lkml.org/lkml/2007/4/24/364.


3. It makes for SCSI target impossible to use (at least, on a simple and 
sane way) many effective optimizations: zero-copy cached I/O, more 
control over read-ahead, device queue unplugging-plugging, etc. One 
example of already implemented such features is zero-copy network data 
transmission, done in simple 260 lines put_page_callback patch. This 
optimization is especially important for the user space gate (scst_user 
module), see below for details.


The whole point that development for kernel is harder, than for user 
space, is totally nonsense nowadays. It's different, yes, in some ways 
more limited, yes, but not harder. For ones who need gdb (I for many 
years - don't) kernel has kgdb, plus it also has many not available for 
user space or more limited there debug facilities like lockdep, lockup 
detection, oprofile, etc. (I don't mention wider choice of more 
effectively implemented synchronization primitives and not only them).


For people who need complicated target devices emulation, like, e.g., in 
case of VTL (Virtual Tape Library), where there is a need to operate 
with large mmap'ed memory areas, SCST provides gateway to the user space 
(scst_user module), but, in contrast with STGT, it's done in regular 
kernel - master, user application - slave paradigm, so it's reliable 
and no fault in user space device emulator can break kernel and other 
user space applications. Plus, since SCSI target state machine and 
memory management are in the kernel, it's very effective and allows only 
one kernel-user space switch per SCSI command.


Also, I should note here, that in the current state STGT in many aspects 
doesn't fully conform SCSI specifications, especially in area of 
management events, like Unit Attentions generation and processing, and 
it doesn't look like somebody cares about it. At the same time, SCST 
pays big attention to fully conform SCSI specifications, because price 
of non-conformance is a possible user's data corruption.


Returning to performance, modern SCSI transports, e.g. InfiniBand, have 
as low link latency as 1(!) microsecond. For comparison, the 
inter-thread context switch time on a modern system is about the same, 
syscall time

Re: Integration of SCST in the mainstream Linux kernel


James Bottomley wrote:

On Mon, 2008-02-04 at 20:56 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:


On Mon, 2008-02-04 at 20:16 +0300, Vladislav Bolkhovitin wrote:



James Bottomley wrote:


So, James, what is your opinion on the above? Or the overall SCSI target 
project simplicity doesn't matter much for you and you think it's fine 
to duplicate Linux page cache in the user space to keep the in-kernel 
part of the project as small as possible?



The answers were pretty much contained here

http://marc.info/?l=linux-scsim=120164008302435

and here:

http://marc.info/?l=linux-scsim=120171067107293

Weren't they?


No, sorry, it doesn't look so for me. They are about performance, but 
I'm asking about the overall project's architecture, namely about one 
part of it: simplicity. Particularly, what do you think about 
duplicating Linux page cache in the user space to have zero-copy cached 
I/O? Or can you suggest another architectural solution for that problem 
in the STGT's approach?



Isn't that an advantage of a user space solution?  It simply uses the
backing store of whatever device supplies the data.  That means it takes
advantage of the existing mechanisms for caching.


No, please reread this thread, especially this message: 
http://marc.info/?l=linux-kernelm=120169189504361w=2. This is one of 
the advantages of the kernel space implementation. The user space 
implementation has to have data copied between the cache and user space 
buffer, but the kernel space one can use pages in the cache directly, 
without extra copy.



Well, you've said it thrice (the bellman cried) but that doesn't make it
true.

The way a user space solution should work is to schedule mmapped I/O
from the backing store and then send this mmapped region off for target
I/O.  For reads, the page gather will ensure that the pages are up to
date from the backing store to the cache before sending the I/O out.
For writes, You actually have to do a msync on the region to get the
data secured to the backing store. 


James, have you checked how fast is mmaped I/O if work size  size of 
RAM? It's several times slower comparing to buffered I/O. It was many 
times discussed in LKML and, seems, VM people consider it unavoidable. 



Erm, but if you're using the case of work size  size of RAM, you'll
find buffered I/O won't help because you don't have the memory for
buffers either.


James, just check and you will see, buffered I/O is a lot faster.

So, using mmaped IO isn't an option for high performance. Plus, mmaped 
IO isn't an option for high reliability requirements, since it doesn't 
provide a practical way to handle I/O errors.


I think you'll find it does ... the page gather returns -EFAULT if
there's an I/O error in the gathered region. 


Err, to whom return? If you try to read from a mmaped page, which can't 
be populated due to I/O error, you will get SIGBUS or SIGSEGV, I don't 
remember exactly. It's quite tricky to get back to the faulted command 
from the signal handler.


Or do you mean mmap(MAP_POPULATE)/munmap() for each command? Do you 
think that such mapping/unmapping is good for performance?



msync does something
similar if there's a write failure.


You also have to pull tricks with
the mmap region in the case of writes to prevent useless data being read
in from the backing store.


Can you be more exact and specify what kind of tricks should be done for 
that?


Actually, just avoid touching it seems to do the trick with a recent
kernel.


Hmm, how can one write to an mmaped page and don't touch it?


James





-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-01 Thread Vladislav Bolkhovitin


Bart Van Assche wrote:

On Jan 31, 2008 5:25 PM, Joe Landman [EMAIL PROTECTED] wrote:


Vladislav Bolkhovitin wrote:


Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or slower
with increasing number of threads?), it's a good stress test tool, but
not more.


Unfortunately, I agree.  Bonnie++, dd tests, and a few others seem to
bear far closer to real world tests than disktest and iozone, the
latter of which does more to test the speed of RAM cache and system call
performance than actual IO.



I have ran some tests with Bonnie++, but found out that on a fast
network like IB the filesystem used for the test has a really big
impact on the test results.

If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.


I would suggest you to try something from real life, like:

 - Copying large file tree over a single or multiple IB links

 - Measure of some DB engine's TPC

 - etc.


Bart Van Assche.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Scst-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/scst-devel



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-01 Thread Vladislav Bolkhovitin


Vladislav Bolkhovitin wrote:

Bart Van Assche wrote:

On Jan 31, 2008 5:25 PM, Joe Landman [EMAIL PROTECTED] 
wrote:



Vladislav Bolkhovitin wrote:


Actually, I don't know what kind of conclusions it is possible to make
from disktest's results (maybe only how throughput gets bigger or 
slower

with increasing number of threads?), it's a good stress test tool, but
not more.



Unfortunately, I agree.  Bonnie++, dd tests, and a few others seem to
bear far closer to real world tests than disktest and iozone, the
latter of which does more to test the speed of RAM cache and system call
performance than actual IO.




I have ran some tests with Bonnie++, but found out that on a fast
network like IB the filesystem used for the test has a really big
impact on the test results.

If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.



I would suggest you to try something from real life, like:

 - Copying large file tree over a single or multiple IB links

 - Measure of some DB engine's TPC

 - etc.


Forgot to mention. During those tests make sure that imported devices 
from both SCST and STGT report in the kernel log the same write cache 
and FUA capabilities, since they significantly affect initiator's 
behavior. Like:


sd 4:0:0:5: [sdf] Write cache: enabled, read cache: enabled, supports 
DPO and FUA


For SCST the fastest mode is NV_CACHE, refer to its README file for details.


Bart Van Assche.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Scst-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/scst-devel



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Integration of SCST in the mainstream Linux kernel

2008-02-01 Thread Vladislav Bolkhovitin


David Dillow wrote:

On Thu, 2008-01-31 at 18:08 +0100, Bart Van Assche wrote:


If anyone has a suggestion for a better test than dd to compare the
performance of SCSI storage protocols, please let it know.



xdd on /dev/sda, sdb, etc. using -dio to do direct IO seems to work
decently, though it is hard (ie, impossible) to get a repeatable
sequence of IO when using higher queue depths, as it uses threads to
generate multiple requests.


This utility seems to be a good one, but it's basically the same as 
disktest, although much more advanced.



You may also look at sgpdd_survey from Lustre's iokit, but I've not done
much with that -- it uses the sg devices to send lowlevel SCSI commands.


Yes, it might be worth to try. Since fundamentally it's the same as 
O_DIRECT dd, but with a bit less overhead on the initiator side (hence 
less initiator side latency), most likely it will show ever bigger 
difference, than it is with dd.



I've been playing around with some benchmark code using libaio, but it's
not in generally usable shape.

xdd:
http://www.ioperformance.com/products.htm

Lustre IO Kit:
http://manual.lustre.org/manual/LustreManual16_HTML/DynamicHTML-20-1.html


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-01-31 Thread Vladislav Bolkhovitin


Bart Van Assche wrote:

On Jan 31, 2008 2:25 PM, Nicholas A. Bellinger [EMAIL PROTECTED] wrote:


Since this particular code is located in a non-data path critical
section, the kernel vs. user discussion is a wash.  If we are talking
about data path, yes, the relevance of DD tests in kernel designs are
suspect :p.  For those IB testers who are interested, perhaps having a
look with disktest from the Linux Test Project would give a better
comparision between the two implementations on a RDMA capable fabric
like IB for best case performance.  I think everyone is interested in
seeing just how much data path overhead exists between userspace and
kernel space in typical and heavy workloads, if if this overhead can be
minimized to make userspace a better option for some of this very
complex code.


I can run disktest on the same setups I ran dd on. This will take some
time however.


Disktest was already referenced in the beginning of the performance 
comparison thread, but its results are not very interesting if we are 
going to find out, which implementation is more effective, because in 
the modes, in which usually people run this utility, it produces latency 
insensitive workload (multiple threads working in parallel). So, such 
multithreaded disktests results will be different between STGT and SCST 
only if STGT's implementation will get target CPU bound. If CPU on the 
target is powerful enough, even extra busy loops in the STGT or SCST hot 
path code will change nothing.


Additionally, multithreaded disktest over RAM disk is a good example of 
a synthetic benchmark, which has almost no relation with real life 
workloads. But people like it, because it produces nice looking results.


Actually, I don't know what kind of conclusions it is possible to make 
from disktest's results (maybe only how throughput gets bigger or slower 
with increasing number of threads?), it's a good stress test tool, but 
not more.



Disktest is new to me -- any hints with regard to suitable
combinations of command line parameters are welcome. The most recent
version I could find on http://ltp.sourceforge.net/ is ltp-20071231.

Bart Van Assche.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Vladislav Bolkhovitin


James Bottomley wrote:

The two target architectures perform essentially identical functions, so
there's only really room for one in the kernel.  Right at the moment,
it's STGT.  Problems in STGT come from the user-kernel boundary which
can be mitigated in a variety of ways.  The fact that the figures are
pretty much comparable on non IB networks shows this.

I really need a whole lot more evidence than at worst a 20% performance
difference on IB to pull one implementation out and replace it with
another.  Particularly as there's no real evidence that STGT can't be
tweaked to recover the 20% even on IB.


James,

Although the performance difference between STGT and SCST is apparent, 
this isn't the only point why SCST is better. I've already written about 
it many times in various mailing lists, but let me summarize it one more 
time here.


As you know, almost all kernel parts can be done in user space, 
including all the drivers, networking, I/O management with block/SCSI 
initiator subsystem and disk cache manager. But does it mean that 
currently Linux kernel is bad and all the above should be (re)done in 
user space instead? I believe, not. Linux isn't a microkernel for very 
pragmatic reasons: simplicity and performance. So, additional important 
point why SCST is better is simplicity.


For SCSI target, especially with hardware target card, data are came 
from kernel and eventually served by kernel, which does actual I/O or 
getting/putting data from/to cache. Dividing requests processing between 
user and kernel space creates unnecessary interface layer(s) and 
effectively makes the requests processing job distributed with all its 
complexity and reliability problems. From my point of view, having such 
distribution, where user space is master side and kernel is slave is 
rather wrong, because:


1. It makes kernel depend from user program, which services it and 
provides for it its routines, while the regular paradigm is the 
opposite: kernel services user space applications. As a direct 
consequence from it that there is no real protection for the kernel from 
faults in the STGT core code without excessive effort, which, no 
surprise, wasn't currently done and, seems, is never going to be done. 
So, on practice debugging and developing under STGT isn't easier, than 
if the whole code was in the kernel space, but, actually, harder (see 
below why).


2. It requires new complicated interface between kernel and user spaces 
that creates additional maintenance and debugging headaches, which don't 
exist for kernel only code. Linus Torvalds some time ago perfectly 
described why it is bad, see http://lkml.org/lkml/2007/4/24/451, 
http://lkml.org/lkml/2006/7/1/41 and http://lkml.org/lkml/2007/4/24/364.


3. It makes for SCSI target impossible to use (at least, on a simple and 
sane way) many effective optimizations: zero-copy cached I/O, more 
control over read-ahead, device queue unplugging-plugging, etc. One 
example of already implemented such features is zero-copy network data 
transmission, done in simple 260 lines put_page_callback patch. This 
optimization is especially important for the user space gate (scst_user 
module), see below for details.


The whole point that development for kernel is harder, than for user 
space, is totally nonsense nowadays. It's different, yes, in some ways 
more limited, yes, but not harder. For ones who need gdb (I for many 
years - don't) kernel has kgdb, plus it also has many not available for 
user space or more limited there debug facilities like lockdep, lockup 
detection, oprofile, etc. (I don't mention wider choice of more 
effectively implemented synchronization primitives and not only them).


For people who need complicated target devices emulation, like, e.g., in 
case of VTL (Virtual Tape Library), where there is a need to operate 
with large mmap'ed memory areas, SCST provides gateway to the user space 
(scst_user module), but, in contrast with STGT, it's done in regular 
kernel - master, user application - slave paradigm, so it's reliable 
and no fault in user space device emulator can break kernel and other 
user space applications. Plus, since SCSI target state machine and 
memory management are in the kernel, it's very effective and allows only 
one kernel-user space switch per SCSI command.


Also, I should note here, that in the current state STGT in many aspects 
doesn't fully conform SCSI specifications, especially in area of 
management events, like Unit Attentions generation and processing, and 
it doesn't look like somebody cares about it. At the same time, SCST 
pays big attention to fully conform SCSI specifications, because price 
of non-conformance is a possible user's data corruption.


Returning to performance, modern SCSI transports, e.g. InfiniBand, have 
as low link latency as 1(!) microsecond. For comparison, the 
inter-thread context switch time on a modern system is about the same, 
syscall time - about 0.1 microsecond. So,

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Vladislav Bolkhovitin


FUJITA Tomonori wrote:

On Tue, 29 Jan 2008 13:31:52 -0800
Roland Dreier [EMAIL PROTECTED] wrote:



 .   .   STGT read SCST read.STGT read 
 SCST read.
 .   .  performance   performance   . performance
performance   .
 .   .  (0.5K, MB/s)  (0.5K, MB/s)  .   (1 MB, MB/s)   
(1 MB, MB/s)  .
 . iSER (8 Gb/s network) . 250N/A   .   360
   N/A   .
 . SRP  (8 Gb/s network) . N/A421   .   N/A
   683   .

 On the comparable figures, which only seem to be IPoIB they're showing a
 13-18% variance, aren't they?  Which isn't an incredible difference.

Maybe I'm all wet, but I think iSER vs. SRP should be roughly
comparable.  The exact formatting of various messages etc. is
different but the data path using RDMA is pretty much identical.  So
the big difference between STGT iSER and SCST SRP hints at some big
difference in the efficiency of the two implementations.



iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?


Anyway, here's the results from Robin Humble:

iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd,
initiator end booted with mem=512M, target with 8G ram

 direct i/o dd
  write/read  800/751 MB/s
dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct
dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct

http://www.mail-archive.com/linux-scsi@vger.kernel.org/msg13502.html

I think that STGT is pretty fast with the fast backing storage. 


How fast SCST will be on the same hardware?


I don't think that there is the notable perfornace difference between
kernel-space and user-space SRP (or ISER) implementations about moving
data between hosts. IB is expected to enable user-space applications
to move data between hosts quickly (if not, what can IB provide us?).

I think that the question is how fast user-space applications can do
I/Os ccompared with I/Os in kernel space. STGT is eager for the advent
of good asynchronous I/O and event notification interfances.

One more possible optimization for STGT is zero-copy data
transfer. STGT uses pre-registered buffers and move data between page
cache and thsse buffers, and then does RDMA transfer. If we implement
own caching mechanism to use pre-registered buffers directly with (AIO
and O_DIRECT), then STGT can move data without data copies.


Great! So, you are going to duplicate Linux page cache in the user 
space. You will continue keeping the in-kernel code as small as possible 
and its mainteinership effort as low as possible by the cost that the 
user space part's code size and complexity (and, hence, its 
mainteinership effort) will rocket to the sky. Apparently, this doesn't 
look like a good design decision.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-01-30 Thread Vladislav Bolkhovitin


FUJITA Tomonori wrote:

On Wed, 30 Jan 2008 09:38:04 +0100
Bart Van Assche [EMAIL PROTECTED] wrote:



On Jan 30, 2008 12:32 AM, FUJITA Tomonori [EMAIL PROTECTED] wrote:


iSER has parameters to limit the maximum size of RDMA (it needs to
repeat RDMA with a poor configuration)?


Please specify which parameters you are referring to. As you know I



Sorry, I can't say. I don't know much about iSER. But seems that Pete
and Robin can get the better I/O performance - line speed ratio with
STGT.

The version of OpenIB might matters too. For example, Pete said that
STGT reads loses about 100 MB/s for some transfer sizes for some
transfer sizes due to the OpenIB version difference or other unclear
reasons.

http://article.gmane.org/gmane.linux.iscsi.tgt.devel/135

It's fair to say that it takes long time and need lots of knowledge to
get the maximum performance of SAN, I think.

I think that it would be easier to convince James with the detailed
analysis (e.g. where does it take so long, like Pete did), not just
'dd' performance results.

Pushing iSCSI target code into mainline failed four times: IET, SCST,
STGT (doing I/Os in kernel in the past), and PyX's one (*1). iSCSI
target code is huge. You said SCST comprises 14,000 lines, but it's
not iSCSI target code. The SCSI engine code comprises 14,000
lines. You need another 10,000 lines for the iSCSI driver. Note that
SCST's iSCSI driver provides only basic iSCSI features. PyX's iSCSI
target code implemenents more iSCSI features (like MC/S, ERL2, etc)
and comprises about 60,000 lines and it still lacks some features like
iSER, bidi, etc.

I think that it's reasonable to say that we need more than 'dd'
results before pushing about possible more than 60,000 lines to
mainline.


Tomo, please stop counting in-kernel lines only (see 
http://lkml.org/lkml/2007/4/24/364). The amount of the overall project 
lines for the same feature set is a lot more important.



(*1) http://linux-iscsi.org/



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


Bart Van Assche wrote:

On Jan 24, 2008 8:06 AM, Robin Humble [EMAIL PROTECTED] wrote:


On Tue, Jan 22, 2008 at 01:32:08PM +0100, Bart Van Assche wrote:


.
.   .   STGT read SCST read.STGT read  
SCST read.
.   .  performance   performance   .   performance
performance   .
.   .  (0.5K, MB/s)  (0.5K, MB/s)  .   (1 MB MB/s)   
(1 MB, MB/s)  .
.
. Ethernet (1 Gb/s network) .  77 78   . 77 
   89   .
. IPoIB(8 Gb/s network) . 163185   .201 
  239   .
. iSER (8 Gb/s network) . 250N/A   .360 
  N/A   .
. SRP  (8 Gb/s network) . N/A421   .N/A 
  683   .



how are write speeds with SCST SRP?
for some kernels and tests tgt writes at 2x the read speed.


Robin,

There is a fundamental difference between regular dd-like reads and 
writes: reads are sync, i.e. latency sensitive, but writes are async, 
i.e. latency insensitive. You should use O_DIRECT dd writes for the fair 
comparison.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Stgt-devel] Performance of SCST versus STGT


Robin Humble wrote:

On Thu, Jan 24, 2008 at 11:36:45AM +0100, Bart Van Assche wrote:


On Jan 24, 2008 8:06 AM, Robin Humble [EMAIL PROTECTED] wrote:


On Tue, Jan 22, 2008 at 01:32:08PM +0100, Bart Van Assche wrote:


.
.   .   STGT read SCST read.STGT read  
SCST read.
.   .  performance   performance   .   performance
performance   .
.   .  (0.5K, MB/s)  (0.5K, MB/s)  .   (1 MB MB/s)   
(1 MB, MB/s)  .
.
. Ethernet (1 Gb/s network) .  77 78   . 77 
   89   .
. IPoIB(8 Gb/s network) . 163185   .201 
  239   .
. iSER (8 Gb/s network) . 250N/A   .360 
  N/A   .
. SRP  (8 Gb/s network) . N/A421   .N/A 
  683   .



how are write speeds with SCST SRP?
for some kernels and tests tgt writes at 2x the read speed.

also I see much higher speeds that what you report in my DDR 4x IB tgt
testing... which could be taken as inferring that tgt is scaling quite
nicely on the faster fabric?
 ib_write_bw of 1473 MB/s
 ib_read_bw  of 1378 MB/s

iSER to 7G ramfs, x86_64, centos4.6, 2.6.22 kernels, git tgtd,
initiator end booted with mem=512M, target with 8G ram

direct i/o dd
 write/read  800/751 MB/s
   dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct
   dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct

buffered i/o dd
 write/read 1109/350 MB/s
   dd if=/dev/zero of=/dev/sdc bs=1M count=5000
   dd of=/dev/null if=/dev/sdc bs=1M count=5000

buffered i/o lmdd
write/read  682/438 MB/s
  lmdd if=internal of=/dev/sdc bs=1M count=5000
  lmdd of=internal if=/dev/sdc bs=1M count=5000




The tests I performed were read performance tests with dd and with
buffered I/O. For this test you obtained 350 MB/s with STGT on a DDR



... and 1.1GB/s writes :)
presumably because buffer aggregation works well.



4x InfiniBand network, while I obtained 360 MB/s on a SDR 4x
InfiniBand network. I don't think that we can call this scaling up
...



the direct i/o read speed being twice the buffered i/o speed would seem
to imply that Linux's page cache is being slow and confused with this
particular set of kernel + OS + OFED versions.
I doubt that this result actually says that much about tgt really.


Buffered dd read is, actually, one of the best benchmarks if you want to 
compare STGT vs SCST, because it's single threaded with one outstanding 
command most of the time, i.e. it's a latency bound workload. Plus, most 
of the applications reading files do exactly what dd does.


Both SCST and STGT suffer equally from possible problems on the 
initiator, but SCST bears it much better, because it has much less 
processing latency (e.g., because there are no extra user-kernel 
spaces switches and other related overhead).



Regarding write performance: the write tests were performed with a
real target (three disks in RAID-0, write bandwidth about 100 MB/s). I



I'd be interested to see ramdisk writes.

cheers,
robin
___
Stgt-devel mailing list
[EMAIL PROTECTED]
https://lists.berlios.de/mailman/listinfo/stgt-devel



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


Robin Humble wrote:

On Thu, Jan 24, 2008 at 02:10:06PM +0300, Vladislav Bolkhovitin wrote:


On Jan 24, 2008 8:06 AM, Robin Humble [EMAIL PROTECTED] wrote:


how are write speeds with SCST SRP?
for some kernels and tests tgt writes at 2x the read speed.


There is a fundamental difference between regular dd-like reads and writes: 
reads are sync, i.e. latency sensitive, but writes are async, i.e. latency 
insensitive. You should use O_DIRECT dd writes for the fair comparison.


I agree, although the vast majority of applications don't use O_DIRECT.


Sorry, it isn't about O_DIRECT usage. It's about latency bound or not 
workload.



anwyay, the direct i/o results were in the email:

  direct i/o dd
   write/read  800/751 MB/s
 dd if=/dev/zero of=/dev/sdc bs=1M count=5000 oflag=direct
 dd of=/dev/null if=/dev/sdc bs=1M count=5000 iflag=direct

I couldn't find a direct i/o option for lmdd.

cheers,
robin
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


Bart Van Assche wrote:

On Jan 24, 2008 8:06 AM, Robin Humble [EMAIL PROTECTED] wrote:


On Tue, Jan 22, 2008 at 01:32:08PM +0100, Bart Van Assche wrote:


.
.   .   STGT read SCST read.STGT read  
SCST read.
.   .  performance   performance   .   performance
performance   .
.   .  (0.5K, MB/s)  (0.5K, MB/s)  .   (1 MB MB/s)   
(1 MB, MB/s)  .
.
. Ethernet (1 Gb/s network) .  77 78   . 77 
   89   .
. IPoIB(8 Gb/s network) . 163185   .201 
  239   .
. iSER (8 Gb/s network) . 250N/A   .360 
  N/A   .
. SRP  (8 Gb/s network) . N/A421   .N/A 
  683   .





Results with /dev/ram0 configured as backing store on the target (buffered I/O):
Read  Write Read  Write
  performance   performance   performance   performance
  (0.5K, MB/s)  (0.5K, MB/s)  (1 MB, MB/s)  (1 MB, MB/s)
STGT + iSER   250  48 349  781
SCST + SRP411  66 659  746


Ib_rdma_bw now reports 933 MB/s on the same system, correct? Those 
~250MB/s difference is what you will gain with zero-copy IO implemented 
and what STGT with the current architecture has no chance to achieve.



Results with /dev/ram0 configured as backing store on the target (direct I/O):
Read  Write Read  Write
  performance   performance   performance   performance
  (0.5K, MB/s)  (0.5K, MB/s)  (1 MB, MB/s)  (1 MB, MB/s)
STGT + iSER 7.9 9.8   589  647
SCST + SRP 12.3 9.7   811  794

Bart.



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Integration of SCST in the mainstream Linux kernel

2008-01-23 Thread Vladislav Bolkhovitin

Bart Van Assche wrote:

About the design of the SCST software: while one of the goals of the
STGT project was to keep the in-kernel code minimal, the SCST project
implements the whole SCSI target in kernel space. SCST is implemented
as a set of new kernel modules, only minimal changes to the existing
kernel are necessary before the SCST kernel modules can be used. This
is the same approach that will be followed in the very near future in
the OpenSolaris kernel (see also
http://opensolaris.org/os/project/comstar/). More information about
the design of SCST can be found here:
http://scst.sourceforge.net/doc/scst_pg.html.

My impression is that both the STGT and SCST projects are well
designed, well maintained and have a considerable user base. According
to the SCST maintainer (Vladislav Bolkhovitin), SCST is superior to
STGT with respect to features, performance, maturity, stability, and
number of existing target drivers. Unfortunately the SCST kernel code
lives outside the kernel tree, which makes SCST harder to use than
STGT.

As an SCST user, I would like to see the SCST kernel code integrated
in the mainstream kernel because of its excellent performance on an
InfiniBand network. Since the SCST project comprises about 14 KLOC,
reviewing the SCST code will take considerable time. Who will do this
reviewing work ? And with regard to the comments made by the
reviewers: Vladislav, do you have the time to carry out the
modifications requested by the reviewers ? I expect a.o. that
reviewers will ask to move SCST's configuration pseudofiles from
procfs to sysfs.

Sure, I do, although I personally don't see much sense in such move.

Bart Van Assche.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


FUJITA Tomonori wrote:

The big problem of stgt iSER is disk I/Os (move data between disk and
page cache). We need a proper asynchronous I/O mechanism, however,
Linux doesn't provide such and we use a workaround, which incurs large
latency. I guess, we cannot solve this until syslets is merged into
mainline.


Hmm, SCST also doesn't have ability to use asynchronous I/O, but that 
doesn't prevent it from showing good performance.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


Bart Van Assche wrote:

On Jan 17, 2008 6:45 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:


There's nothing particularly stunning here.  Suspect Bart has
configuration issues if not even IPoIB will do  100 MB/s.



By this time I found out that the BIOS of the test systems (Intel
Server Board S5000PAL) set the PCI-e parameter MaxReadReq to 128
bytes, which explains the low InfiniBand performance. After changing
this parameter to 4096 bytes the InfiniBand throughput was as
expected: ib_rdma_bw now reports a
bandwidth of 933 MB/s.


What are the new SRPT/iSER numbers?


Bart.



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


FUJITA Tomonori wrote:

On Tue, 22 Jan 2008 14:33:13 +0300
Vladislav Bolkhovitin [EMAIL PROTECTED] wrote:



FUJITA Tomonori wrote:


The big problem of stgt iSER is disk I/Os (move data between disk and
page cache). We need a proper asynchronous I/O mechanism, however,
Linux doesn't provide such and we use a workaround, which incurs large
latency. I guess, we cannot solve this until syslets is merged into
mainline.


Hmm, SCST also doesn't have ability to use asynchronous I/O, but that 
doesn't prevent it from showing good performance.



I don't know how SCST performs I/Os, but surely, in kernel space, you
can performs I/Os asynchronously.


Sure, but currently it all synchronous


Or you use an event notification
mechanism with multiple kernel threads performing I/Os synchronously.

Xen blktap has the same problem as stgt. IIRC, Xen mainline uses a
kernel patch to add a proper event notification to AIO though redhat
uses the same workaround as stgt instead of applying the kernel patch.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


Bart Van Assche wrote:
On Jan 22, 2008 12:33 PM, Vladislav Bolkhovitin [EMAIL PROTECTED] 
mailto:[EMAIL PROTECTED] wrote:


 What are the new SRPT/iSER numbers?


You can find the new performance numbers below. These are all numbers 
for reading from the remote buffer cache, no actual disk reads were 
performed. The read tests have been performed with dd, both for a block 
size of 512 bytes and of 1 MB. The tests with small block size learn 
more about latency, while the tests with large block size learn more 
about the maximal possible throughput.


If you want to compare performance of 512b vs 1MB blocks, your 
experiment isn't fully correct. You should use iflag=direct dd option 
for that.


. 

.   .   STGT read SCST read.STGT 
read  SCST read.
.   .  performance   performance   .   
performanceperformance   .
.   .  (0.5K, MB/s)  (0.5K, MB/s)  .   (1 MB, 
MB/s)   (1 MB, MB/s)  .

.
. Ethernet (1 Gb/s network) .  77 78   .
77 89   .
. IPoIB(8 Gb/s network) . 163185   .   
201239   .
. iSER (8 Gb/s network) . 250N/A   .   
360N/A   .
. SRP  (8 Gb/s network) . N/A421   .   
N/A683   .

.

My conclusion from the above numbers: the performance difference between 
STGT and SCST is small for a Gigabit Ethernet network. The faster the 
network technology, the larger the difference between SCST and STGT.


This is what I expected


Bart.


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT

2008-01-21 Thread Vladislav Bolkhovitin


Bart Van Assche wrote:

On Jan 18, 2008 1:08 PM, Vladislav Bolkhovitin [EMAIL PROTECTED] wrote:


[ ... ]
So, seems I understood your slides correctly: the more valuable data for
our SCST SRP vs STGT iSER comparison should be on page 26 for 1 command
read (~480MB/s, i.e. ~60% from Bart's result on the equivalent hardware).



At least in my tests SCST performed significantly better than STGT.
These tests were performed with the currently available
implementations of SCST and STGT. Which performance improvements are
possible for these projects (e.g. zero-copying), and by how much is it
expected that these performance improvements will increase throughput
and will decrease latency ?


Sure, zero-copying cache support is well possible for SCST and hopefully 
will be available soon. The performance (throughput) improvement will 
depend from used hardware and data access pattern, but the upper bound 
estimation can be made knowing memory copy throughput on your system 
(1.6GB/s according to your measurements). For 10Gbps link with 0.9GB/s 
wire speed it should be up to 30%, for 20Gbps link with wire speed 
1.5GB/s (PCI-E 8x limitation) - something up to 70-80%.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT

2008-01-18 Thread Vladislav Bolkhovitin


Pete Wyckoff wrote:

I have performed a test to compare the performance of SCST and STGT.
Apparently the SCST target implementation performed far better than
the STGT target implementation. This makes me wonder whether this is
due to the design of SCST or whether STGT's performance can be
improved to the level of SCST ?

Test performed: read 2 GB of data in blocks of 1 MB from a target (hot
cache -- no disk reads were performed, all reads were from the cache).
Test command: time dd if=/dev/sde of=/dev/null bs=1M count=2000

STGT read SCST read
 performance (MB/s)   performance (MB/s)
Ethernet (1 Gb/s network)7789
IPoIB (8 Gb/s network)   82   229
SRP (8 Gb/s network)N/A   600
iSER (8 Gb/s network)80   N/A

These results show that SCST uses the InfiniBand network very well
(effectivity of about 88% via SRP), but that the current STGT version
is unable to transfer data faster than 82 MB/s. Does this mean that
there is a severe bottleneck  present in the current STGT
implementation ?



I don't know about the details but Pete said that he can achieve more
than 900MB/s read performance with tgt iSER target using ramdisk.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg4.html


Please don't confuse multithreaded latency insensitive workload with 
single threaded, hence latency sensitive one.


Seems that he can get good performance with single threaded workload:

http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf

But I don't know about the details so let's wait for Pete to comment
on this.


Page 16 is pretty straight forward.  One command outstanding from
the client.  It is an OSD read command.  Data on tmpfs. 


Hmm, I wouldn't say it's pretty straight forward. It has data for 
InfiniBand and it's unclear if it's using iSER or some IB performance 
test tool. I would rather interpret those data as for IB, not iSER.



500 MB/s is
pretty easy to get on IB.

The other graph on page 23 is for block commands.  600 MB/s ish.
Still single command; so essentially a latency test.  Dominated by
the memcpy time from tmpfs to pinned IB buffer, as per page 24.

Erez said:



We didn't run any real performance test with tgt, so I don't have
numbers yet. I know that Pete got ~900 MB/sec by hacking sgp_dd, so all
data was read/written to the same block (so it was all done in the
cache). Pete - am I right?


Yes (actually just 1 thread in sg_dd).  This is obviously cheating.
Take the pread time to zero in SCSI Read analysis on page 24 to show
max theoretical.  It's IB theoretical minus some initiator and stgt
overheads.


Yes, that's obviously cheating and its result can't be compared with 
what Bart had. Full data footprint on target fit in the CPU cache, so 
you had rather results for NULLIO (SCST term).


So, seems I understood your slides correctly: the more valuable data for 
our SCST SRP vs STGT iSER comparison should be on page 26 for 1 command 
read (~480MB/s, i.e. ~60% from Bart's result on the equivalent hardware).



The other way to get more read throughput is to throw multiple
simultaneous commands at the server.

There's nothing particularly stunning here.  Suspect Bart has
configuration issues if not even IPoIB will do  100 MB/s.

-- Pete




-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


FUJITA Tomonori wrote:

On Thu, 17 Jan 2008 10:27:08 +0100
Bart Van Assche [EMAIL PROTECTED] wrote:



Hello,

I have performed a test to compare the performance of SCST and STGT.
Apparently the SCST target implementation performed far better than
the STGT target implementation. This makes me wonder whether this is
due to the design of SCST or whether STGT's performance can be
improved to the level of SCST ?

Test performed: read 2 GB of data in blocks of 1 MB from a target (hot
cache -- no disk reads were performed, all reads were from the cache).
Test command: time dd if=/dev/sde of=/dev/null bs=1M count=2000

 STGT read SCST read
  performance (MB/s)   performance (MB/s)
Ethernet (1 Gb/s network)7789
IPoIB (8 Gb/s network)   82   229
SRP (8 Gb/s network)N/A   600
iSER (8 Gb/s network)80   N/A

These results show that SCST uses the InfiniBand network very well
(effectivity of about 88% via SRP), but that the current STGT version
is unable to transfer data faster than 82 MB/s. Does this mean that
there is a severe bottleneck  present in the current STGT
implementation ?



I don't know about the details but Pete said that he can achieve more
than 900MB/s read performance with tgt iSER target using ramdisk.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg4.html


Please don't confuse multithreaded latency insensitive workload with 
single threaded, hence latency sensitive one.



To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


FUJITA Tomonori wrote:

On Thu, 17 Jan 2008 12:48:28 +0300
Vladislav Bolkhovitin [EMAIL PROTECTED] wrote:



FUJITA Tomonori wrote:


On Thu, 17 Jan 2008 10:27:08 +0100
Bart Van Assche [EMAIL PROTECTED] wrote:




Hello,

I have performed a test to compare the performance of SCST and STGT.
Apparently the SCST target implementation performed far better than
the STGT target implementation. This makes me wonder whether this is
due to the design of SCST or whether STGT's performance can be
improved to the level of SCST ?

Test performed: read 2 GB of data in blocks of 1 MB from a target (hot
cache -- no disk reads were performed, all reads were from the cache).
Test command: time dd if=/dev/sde of=/dev/null bs=1M count=2000

STGT read SCST read
 performance (MB/s)   performance (MB/s)
Ethernet (1 Gb/s network)7789
IPoIB (8 Gb/s network)   82   229
SRP (8 Gb/s network)N/A   600
iSER (8 Gb/s network)80   N/A

These results show that SCST uses the InfiniBand network very well
(effectivity of about 88% via SRP), but that the current STGT version
is unable to transfer data faster than 82 MB/s. Does this mean that
there is a severe bottleneck  present in the current STGT
implementation ?



I don't know about the details but Pete said that he can achieve more
than 900MB/s read performance with tgt iSER target using ramdisk.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg4.html


Please don't confuse multithreaded latency insensitive workload with 
single threaded, hence latency sensitive one.



Seems that he can get good performance with single threaded workload:

http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf


Hmm, I can't find which IB hardware did he use and it's declared Gbps 
speed. He declared only Mellanox 4X SDR, switch. What does it mean?



But I don't know about the details so let's wait for Pete to comment
on this.


I added him on CC


Perhaps Voltaire people could comment on the tgt iSER performances.



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] [Stgt-devel] Performance of SCST versus STGT


Robin Humble wrote:

On Thu, Jan 17, 2008 at 01:34:46PM +0300, Vladislav Bolkhovitin wrote:

Hmm, I can't find which IB hardware did he use and it's declared Gbps 
speed. He declared only Mellanox 4X SDR, switch. What does it mean?



SDR is 10Gbit carrier, at most about  ~900MB/s data rate.
DDR is 20Gbit carrier, at most about ~1400MB/s data rate.


Thanks. Then the single threaded rate with one outstanding command 
between SCST SRP on 8Gbps link vs STGT iSRP on 10Gbps link (according to 
that paper) is 600MB/s vs ~480MB/s (page 26). Still SCST based target is 
about 60% faster.



On Thu, 17 Jan 2008 10:27:08 +0100 Bart Van Assche [EMAIL PROTECTED] wrote:

Test performed: read 2 GB of data in blocks of 1 MB from a target (hot   
cache -- no disk reads were performed, all reads were from the cache).   
Test command: time dd if=/dev/sde of=/dev/null bs=1M count=2000  

 STGT read SCST read
  performance (MB/s)   performance (MB/s)   
Ethernet (1 Gb/s network)7789
IPoIB (8 Gb/s network)   82   229
SRP (8 Gb/s network)N/A   600
iSER (8 Gb/s network)80   N/A



it kinda looks to me like the tgt iSER tests were waaay too slow to be
using RDMA :-/
I use tgt to get 500MB/s writes over iSER DDR IB to real files (not
ramdisk). Reads are a little slower, but that changes a bit with distro
vs. mainline kernels.

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Performance of SCST versus STGT


Erez Zilber wrote:

FUJITA Tomonori wrote:


On Thu, 17 Jan 2008 12:48:28 +0300
Vladislav Bolkhovitin [EMAIL PROTECTED] wrote:

 


FUJITA Tomonori wrote:
   


On Thu, 17 Jan 2008 10:27:08 +0100
Bart Van Assche [EMAIL PROTECTED] wrote:


 


Hello,

I have performed a test to compare the performance of SCST and STGT.
Apparently the SCST target implementation performed far better than
the STGT target implementation. This makes me wonder whether this is
due to the design of SCST or whether STGT's performance can be
improved to the level of SCST ?

Test performed: read 2 GB of data in blocks of 1 MB from a target (hot
cache -- no disk reads were performed, all reads were from the cache).
Test command: time dd if=/dev/sde of=/dev/null bs=1M count=2000

STGT read SCST read
 performance (MB/s)   performance (MB/s)
Ethernet (1 Gb/s network)7789
IPoIB (8 Gb/s network)   82   229
SRP (8 Gb/s network)N/A   600
iSER (8 Gb/s network)80   N/A

These results show that SCST uses the InfiniBand network very well
(effectivity of about 88% via SRP), but that the current STGT version
is unable to transfer data faster than 82 MB/s. Does this mean that
there is a severe bottleneck  present in the current STGT
implementation ?
   


I don't know about the details but Pete said that he can achieve more
than 900MB/s read performance with tgt iSER target using ramdisk.

http://www.mail-archive.com/[EMAIL PROTECTED]/msg4.html
 


Please don't confuse multithreaded latency insensitive workload with 
single threaded, hence latency sensitive one.
   


Seems that he can get good performance with single threaded workload:

http://www.osc.edu/~pw/papers/wyckoff-iser-snapi07-talk.pdf


But I don't know about the details so let's wait for Pete to comment
on this.

Perhaps Voltaire people could comment on the tgt iSER performances.


We didn't run any real performance test with tgt, so I don't have
numbers yet. I know that Pete got ~900 MB/sec by hacking sgp_dd, so all
data was read/written to the same block (so it was all done in the
cache). Pete - am I right?

As already mentioned, he got that with IB SDR cards that are 10 Gb/sec
cards in theory (actual speed is ~900 MB/sec). With DDR cards (20
Gb/sec), you can get even more. I plan to test that in the near future.


Are you writing about a maximum possible speed which he got, including 
multithreded tests with many outstanding commands or about speed he got 
 on single threaded reads with one outstanding command? This thread is 
about the second one.



Erez
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Open-FCoE on linux-scsi

2008-01-05 Thread Vladislav Bolkhovitin


FUJITA Tomonori wrote:

What's the general opinion on this? Duplicate code vs. more kernel code?
I can see that you're already starting to clean up the code that you
ported. Does that mean the duplicate code isn't an issue to you? When we
fix bugs in the initiator they're not going to make it into your tree
unless you're diligent about watching the list.


It's hard to convince the kernel maintainers to merge something into
mainline that which can be implemented in user space. I failed twice
(with two iSCSI target implementations).


Tomonori and the kernel maintainers,

In fact, almost all of the kernel can be done in user space, including 
all the drivers, networking, I/O management with block/SCSI initiator 
subsystem and disk cache manager. But does it mean that currently kernel 
is bad and all the above should be (re)done in user space instead? I 
think, not. Linux isn't a microkernel for very pragmatic reasons: 
simplicity and performance.


1. Simplicity.

For SCSI target, especially with hardware target card, data are come 
from kernel and eventually served by kernel doing actual I/O or 
getting/putting data from/to cache. Dividing the requests processing job 
between user and kernel space creates unnecessary interface layer(s) and 
effectively makes the requests processing job distributed with all its 
complexity and reliability problems. As the example, what will currently 
happen in STGT if the user space part suddenly dies? Will the kernel 
part gracefully recover from it? How much effort will be needed to 
implement that?


Another example is the mentioned above code duplication. Is it good? 
What will it bring? Or you care only about amount of the kernel's code 
and don't care about the overall amount of code? If so, you should 
(re)read what Linus Torvalds thinks about that: 
http://lkml.org/lkml/2007/4/24/364 (I don't consider myself as an 
authoritative in this question)


I agree that some of the processing, which can be clearly separated, can 
and should be done in user space. The good example of such approach is 
connection negotiation and management in the way, how it's done in 
open-iscsi. But I don't agree that this idea should be driven to the 
absolute. It might look good, but it's unpractical, it will only make 
things more complicated and harder for maintainership.


2. Performance.

Modern SCSI transports, e.g. Infiniband, have as low link latency as 
1(!) microsecond. For comparison, the inter-thread context switch time 
on a modern system is about the same, syscall time - about 0.1 
microsecond. So, only ten empty syscalls or one context switch add the 
same latency as the link. Even 1Gbps Ethernet has less, than 100 
microseconds of round-trip latency.


You, most likely, know, that QLogic target driver for SCST allows 
commands being executed either directly from soft IRQ, or from the 
corresponding thread. There is a steady 5% difference in IOPS between 
those modes on 512 bytes reads on nullio using 4Gbps link. So, a single 
additional inter-kernel-thread context switch costs 5% of IOPS.


Another source of additional unavoidable with the user space approach 
latency is data copy to/from cache. With the fully kernel space 
approach, cache can be used directly, so no extra copy will be needed.


So, putting code in the user space you should accept the extra latency 
it adds. Many, if not most, real-life workloads more or less latency, 
not throughput, bound, so you shouldn't be surprised that single stream 
dd if=/dev/sdX of=/dev/null on initiator gives too low values. Such 
benchmark isn't less important and practical, than all the 
multithreaded latency insensitive benchmarks, which people like running.


You may object me that the backstorage's latency is a lot more, than 1 
microsecond, but that is true only if data are read/written from/to the 
actual backstorage media, not from the cache, even from the backstorage 
device's cache. Nothing prevents a target from having 8 or even 64GB of 
cache, so most even random accesses could be served by it. This is 
especially important for sync. writes.


Thus, I believe, that partial user space, partial kernel space approach 
for building SCSI targets is the move in the wrong direction, because it 
brings practically nothing, but costs a lot.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9405] New: iSCSI does not implement ordering guarantees required by e.g. journaling filesystems

2007-11-21 Thread Vladislav Bolkhovitin


James Bottomley wrote:

if you specifically set TAS=1 you're giving up the right to know what
caused the command termination.  With insufficient information, it's
really unsafe to simply retry, which is why the mid layer just returns
TASK ABORTED as an error.  If you set TAS=0 we'll get a check
condition/unit attention explaining what happened (usually commands
cleared by another initiator) and we'll explicitly do the right thing
based on the sense data.


Actually, having TAS=1 has a considerably advantage over TAS=0 from 
error recovery point of view. With TAS=1 all aborted commands are 
supposed to be returned immediately to all affected initiators. With 
TAS=0 affected initiators will not receive any notification about 
aborted commands, only COMMANDS CLEARED BY ANOTHER INITIATOR UA will be 
established. So, they will know about that only after there will be 
timeout for their commands.


Thus, with TAS=1 almost immediate error recovery is possible, but with 
TAS=0 error recovery is possible after timeout, which for SSC devices 
can be hours.


But having TAS=1 is legal, right? So it should be handled well. If 
TAS=0, TASK ABORTED can't be returned, it would be illegal. So, TASK 
ABORTED status can only be returned with TAS=1.


Driving with your handbrake on is legal too ... that doesn't mean you
should do it ... and it certainly doesn't give you a legitimate
complaint against the manufacturer of your car for excessive brake pad
wear.

We handle TASK ABORTED as well as we can (by failing it).  For better
handling set TAS=0 and we'll handle the individual cases according to
the sense codes.


After some digging in SAM/SPC I've figured out that TASK ABORTED status 
can be returned exactly in the same circumstances as COMMANDS CLEARED BY 
ANOTHER INITIATOR UA, it only depends from TAS bit, which way of the 
notification is used. So, TASK ABORTED status carries the same 
information as COMMANDS CLEARED BY ANOTHER INITIATOR UA and should be 
handled at the same way. I.e., if for COMMANDS CLEARED BY ANOTHER 
INITIATOR UA the affected commands are restarted, they should be 
restarted for TASK ABORTED status as well.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9405] New: iSCSI does not implement ordering guarantees required by e.g. journaling filesystems

2007-11-20 Thread Vladislav Bolkhovitin


James Bottomley wrote:

On Tue, 2007-11-20 at 19:15 +0300, Vladislav Bolkhovitin wrote:


James Bottomley wrote:


I'm not sure your conclusions necessarily follow your data.  What was
the reason for the TASK ABORTED (I'd guess QErr settings, right)?


It was my desire/curiosity during tests of SCST (http://scst.sf.net), 
when it working with several initiators with different transports over 
the same set of devices, each of them having with TAS bit in the control 
mode page set. According to SAM, in this case TASK ABORTED status can be 
returned at any time, similarly to QUEUE FULL, i.e. IMHO such command 
just should be retried. But QUEUE FULL status handled well, but TASK 
ABORTED leads to filesystem corruption.


So this is with a soft target implementation ... so it could be an
ordering issue inside the target that's causing the filesystem
corruption on error.


Target offers no ordering guarantees for SIMPLE commands and frankly 
says that to initiator via QUEUE ALGORITHM MODIFIER value 1 in the 
control mode page. As we know, initiator doesn't use ORDERED tags (and 
it really doesn't use them according to the logs), so if it's an 
ordering issue, it's at the initiator's side.



if you specifically set TAS=1 you're giving up the right to know what
caused the command termination.  With insufficient information, it's
really unsafe to simply retry, which is why the mid layer just returns
TASK ABORTED as an error.  If you set TAS=0 we'll get a check
condition/unit attention explaining what happened (usually commands
cleared by another initiator) and we'll explicitly do the right thing
based on the sense data.


But having TAS=1 is legal, right? So it should be handled well. If 
TAS=0, TASK ABORTED can't be returned, it would be illegal. So, TASK 
ABORTED status can only be returned with TAS=1.



One of my test suites has an initiator which randomly spits errors.
I've yet to see it cause an error that an ext3 journal can't recover
from.  So, if there's a genuine problem we need a nice test case to pass
to the filesystem people.


If you need a clear testcase (IMHO, in this case it isn't needed, 
because it's clear without it), I can prepare a patch for SCST to 
randomly return TASK ABORTED status.


You can get the latest version of SCST and the target drivers using SVN:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst


James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9405] New: iSCSI does not implement ordering guarantees required by e.g. journaling filesystems

2007-11-20 Thread Vladislav Bolkhovitin


James Bottomley wrote:

I'm not sure your conclusions necessarily follow your data.  What was
the reason for the TASK ABORTED (I'd guess QErr settings, right)?


It was my desire/curiosity during tests of SCST (http://scst.sf.net), 
when it working with several initiators with different transports over 
the same set of devices, each of them having with TAS bit in the control 
mode page set. According to SAM, in this case TASK ABORTED status can be 
returned at any time, similarly to QUEUE FULL, i.e. IMHO such command 
just should be retried. But QUEUE FULL status handled well, but TASK 
ABORTED leads to filesystem corruption.


So this is with a soft target implementation ... so it could be an
ordering issue inside the target that's causing the filesystem
corruption on error.


Target offers no ordering guarantees for SIMPLE commands and frankly 
says that to initiator via QUEUE ALGORITHM MODIFIER value 1 in the 
control mode page. As we know, initiator doesn't use ORDERED tags (and 
it really doesn't use them according to the logs), so if it's an 
ordering issue, it's at the initiator's side.




if you specifically set TAS=1 you're giving up the right to know what
caused the command termination.  With insufficient information, it's
really unsafe to simply retry, which is why the mid layer just returns
TASK ABORTED as an error.  If you set TAS=0 we'll get a check
condition/unit attention explaining what happened (usually commands
cleared by another initiator) and we'll explicitly do the right thing
based on the sense data.


But having TAS=1 is legal, right? So it should be handled well. If 
TAS=0, TASK ABORTED can't be returned, it would be illegal. So, TASK 
ABORTED status can only be returned with TAS=1.


Driving with your handbrake on is legal too ... that doesn't mean you
should do it ... and it certainly doesn't give you a legitimate
complaint against the manufacturer of your car for excessive brake pad
wear.

We handle TASK ABORTED as well as we can (by failing it).  For better
handling set TAS=0 and we'll handle the individual cases according to
the sense codes.


So, should I consider your words as you think that it's perfectly fine 
to corrupt file system for devices with TAS=1? Absolutely legal devices, 
repeat. Hence, in your opinion, no further investigation should be done?



One of my test suites has an initiator which randomly spits errors.
I've yet to see it cause an error that an ext3 journal can't recover
from.  So, if there's a genuine problem we need a nice test case to pass
to the filesystem people.


If you need a clear testcase (IMHO, in this case it isn't needed, 
because it's clear without it), I can prepare a patch for SCST to 
randomly return TASK ABORTED status.


You can get the latest version of SCST and the target drivers using SVN:

$ svn co https://scst.svn.sourceforge.net/svnroot/scst


There's no real need to bother with setting all this up ... a simple
initiator modification randomly to return TASK ABORTED should suffice.


Yes, you're right. Then, I suppose, Mike Christie should be the best 
person to do it?


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9405] New: iSCSI does not implement ordering guarantees required by e.g. journaling filesystems

2007-11-20 Thread Vladislav Bolkhovitin


James Bottomley wrote:

I'm not sure your conclusions necessarily follow your data.  What was
the reason for the TASK ABORTED (I'd guess QErr settings, right)?


It was my desire/curiosity during tests of SCST (http://scst.sf.net), 
when it working with several initiators with different transports over 
the same set of devices, each of them having with TAS bit in the control 
mode page set. According to SAM, in this case TASK ABORTED status can be 
returned at any time, similarly to QUEUE FULL, i.e. IMHO such command 
just should be retried. But QUEUE FULL status handled well, but TASK 
ABORTED leads to filesystem corruption.


So this is with a soft target implementation ... so it could be an
ordering issue inside the target that's causing the filesystem
corruption on error.


Target offers no ordering guarantees for SIMPLE commands and frankly 
says that to initiator via QUEUE ALGORITHM MODIFIER value 1 in the 
control mode page. As we know, initiator doesn't use ORDERED tags (and 
it really doesn't use them according to the logs), so if it's an 
ordering issue, it's at the initiator's side.



if you specifically set TAS=1 you're giving up the right to know what
caused the command termination.  With insufficient information, it's
really unsafe to simply retry, which is why the mid layer just returns
TASK ABORTED as an error.  If you set TAS=0 we'll get a check
condition/unit attention explaining what happened (usually commands
cleared by another initiator) and we'll explicitly do the right thing
based on the sense data.


But having TAS=1 is legal, right? So it should be handled well. If 
TAS=0, TASK ABORTED can't be returned, it would be illegal. So, TASK 
ABORTED status can only be returned with TAS=1.


Driving with your handbrake on is legal too ... that doesn't mean you
should do it ... and it certainly doesn't give you a legitimate
complaint against the manufacturer of your car for excessive brake pad
wear.

We handle TASK ABORTED as well as we can (by failing it).  For better
handling set TAS=0 and we'll handle the individual cases according to
the sense codes.


So, should I consider your words as you think that it's perfectly fine 
to corrupt file system for devices with TAS=1? Absolutely legal devices, 
repeat. Hence, in your opinion, no further investigation should be done?


Logic wouldn't support such a conclusion.


Sorry, lately I've got too many I won't bother, this is your problem 
style answers



You have intertwined two issues

 1. How should the mid layer handle TASK ABORTED.  I think we've
reached the point where returning I/O error is the best we can
do, but if TAS=0 we could have used the sense data to do better.
 2. Should a request I/O error cause corruption in ext3 that can't
be recovered by a journal replay. I think the answer here is
no, so there needs to be an easily reproducible test case to
pass to the filesystem people.


OK, I see you point. As I already wrote, I can assist only in testing here.


James


-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Target mode support for qlogic chipsets isp2422/2432/5422/5432

2007-10-23 Thread Vladislav Bolkhovitin


FUJITA Tomonori wrote:

On Tue, 23 Oct 2007 13:47:20 +0530
Thayumanavar Sachithanantham [EMAIL PROTECTED] wrote:



Hi All,

Does the recent target mode support added for tgt support target mode
for qla chipset (qla24xx series)?



We've been trying:

http://marc.info/?t=11885798674r=1w=2

But I heard that the qla24xx firmware doesn't support target mode (I
use QLA2340).


Standard QLogic QLA24xx firmware supports target mode.

Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qla2xxx behavior with changing volumes

2007-09-21 Thread Vladislav Bolkhovitin


Sean Bruno wrote:

What is the expected behavior when volumes on a SAN change size and LUN
ID order?

I've noticed that if a volume changes size, leaves the SAN or changes
target ID it isn't auto-magically picked up by a 2.6.18 based
system(running CentOS 5).

If a new target appears on the SAN however, it is noticed and assigned a
new drive letter.


For changes in the volume size the target (SAN) should generate 
CAPACITY DATA HAS CHANGED Unit Attention.


For changes in the LUN ID order the target should generate REPORTED 
LUNS DATA HAS CHANGED Unit Attention.


On these notifications initiator is supposed to make the appropriate 
actions, like rescan the SAN in case of REPORTED LUNS DATA HAS 
CHANGED. Unfortunately, Linux just ignores them as well as the majority 
of other Unit Attentions, hence you have to restart the system or, at 
least, the corresponding driver to see the changes.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: qla2xxx behavior with changing volumes

2007-09-21 Thread Vladislav Bolkhovitin


Vladislav Bolkhovitin wrote:

Sean Bruno wrote:


What is the expected behavior when volumes on a SAN change size and LUN
ID order?

I've noticed that if a volume changes size, leaves the SAN or changes
target ID it isn't auto-magically picked up by a 2.6.18 based
system(running CentOS 5).

If a new target appears on the SAN however, it is noticed and assigned a
new drive letter.



For changes in the volume size the target (SAN) should generate 
CAPACITY DATA HAS CHANGED Unit Attention.


For changes in the LUN ID order the target should generate REPORTED 
LUNS DATA HAS CHANGED Unit Attention.


On these notifications initiator is supposed to make the appropriate 
actions, like rescan the SAN in case of REPORTED LUNS DATA HAS 
CHANGED. Unfortunately, Linux just ignores them as well as the majority 
of other Unit Attentions, hence you have to restart the system or, at 
least, the corresponding driver to see the changes.


Or, I forgot, do the manual rescan via sysfs rescan. Sometimes it 
helps too.



Vlad



-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Stgt-devel] Question for pass-through target design

2007-06-01 Thread Vladislav Bolkhovitin


Vladislav Bolkhovitin wrote:

So, if you need in-kernel pass-through I would suggest you to look at
SCST project (http://scst.sf.net), which is currently stable and mature,
although also not fully finished yet. It was historically from the very
beginning designed for full feature in-kernel pass-through for not only
stateless SCSI devices, like disks, but also for stateful SCSI devices
(like SSC ones a.k.a. tapes), where the correct handling of all above is
essential. In additional to considerably better performance, the
complete in-kernel approach makes the code simpler, smaller and cleaner
as well as allows such things as zero-copy buffered file IO, i.e. when 
data are sent to remote initiators or received from them directly 
from/to the page cache (currently under development). For those who need 
implementing SCSI devices in the user space scst_user module is about to 
be added. Since the SCSI state machine is in kernel the interface 
provided by scst_user is very simple, it essentially consists from only 
a single IOCTL and allows to have overhead as low as a single syscall 
per SCSI command without any additional context switches. It is already 
implemented and works. For some legal reasons I can't at the moment 
publish it, but you can see its full description in the project's SVN 
docs (you can get them using command svn co 
https://svn.sourceforge.net/svnroot/scst/trunk/doc;).


Now I released scst_user module and it is available from the SCST SVN, 
so you can check how simply it allows to write SCSI devices, like a VTL, 
in the user space.


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Stgt-devel] Question for pass-through target design

2007-05-25 Thread Vladislav Bolkhovitin


Robert Jennings wrote:

* Vladislav Bolkhovitin ([EMAIL PROTECTED]) wrote:


Robert Jennings wrote:


What I meant that is that the kernel tgt code (scsi_tgt*) receives
SCSI commands from one lld and send them to another lld instead of
sending them to user space.


Although the approach of passing SCSI commands from a target LLD to an
initiator one without any significant interventions from the target
software looks to be nice and simple, you should realize how limited,
unsafe and illegal it is, since it badly violates SCSI specs.


I think that 'implemented cleanly' means that one scsi_host is assigned
to only one initiator.


Vladislav listed a number of issues that are inherent in an implementation
that does not have a 1:1 relationship of initiators to targets.  The vscsi
architecture defines the 1:1 relationship; it's imposible to have more
than one initiator per target.


Just few small notes:

1. As I already wrote, complete 1:1 relationship isn't practically 
possible, because there is always a local access on the target (i.e. one 
more initiator) and you can't disable it on practice.


I was proposing a 1:1 relationship of initiator to target within the
target framework for in-kernel pass-through.  We would still have the
case that local access on the target is possible; an administrator with
privileges neccessary to create a target would have the responsibility
to not then access the device locally.  


This is no different than if I create my root file system on /dev/sda1,
I should not also 'dd' data to /dev/sda1 while the system is running.
It's a bad idea, but nothing stops me; however this is something that
only a root level user can do.  This would be the same, these targets in
pass-through have permissions by default that do not allow local access
by non-root users.


In principle, yes, but, as usually, on practice it's not so easy. In 
your file system example the device is accessed via the FS, which 
provides a shared mode, and everybody doesn't have any need to do 
anything directly with the device. But in case of non-disk devices they 
are always accessed directly, so to explain your limitation you will 
have to write it with HUGE letters everywhere. Once one SCST user 
cleared Unit Attention on his exported tape device using st driver and 
asked then me why it isn't delivered to his remote initiator.


2. 1:1 relationship is a serious limitation for usage cases like an SPI 
tape library serving backup for several servers on an FC net.


Restricting the relationship to 1:1 would be for pass-through devices
only, this would not necessarily dictate other target types which could
be used for such cases.


The tape library from my example is the pass-through device. You can't 
access a parallel SCSI (SPI) device on an Fibre Channel (FC) in any 
other mode, right?


Vlad
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Problems with SCST and QLA 2432 FC Cards

2007-05-18 Thread Vladislav Bolkhovitin


sandip shete wrote:

Hi,

I am working with the SCST 0.9.4 version on linux-2.6.15 with the
linux-2.6-qla2xxx-target.patch patch applied.
I was using a QLA2312 card on this setup and things were just fine when
i used this system as a Target.

Now I have switched to a qla2432 card and even though i do enable
Target Mode (echo 1 
/sys/class/scsi_host/host../target_mode_enabled) on the corresponding
port, this port fails to work as a target, and none of the Fileio Luns
are exported to the initiator.
Also, on the initiator side the /sys/class/fc_remote_port//role
file should show as FC Target, which it used to with QLA 2312, but
with QLA 2432 the initiator side shows the role of the remote port as
FC Initiator

The initiator has 2312 cards and 2.6.15 kernel compiled on it.

Also note that, i have the corresponding ql2400_fw.bin firmware binary
at the right location and it gets loaded when i load the modules.
To check if the qla2432 card was working fine, i connected this to a
different 2312 based target system and had it work as a Initiator, this
worked fine and i could see all the luns exported on this box.

Now the only problem that i can think of in target mode is, maybe, scst
doesn't support the qla 24xx series.


Yes, that's correct. Unfortunately, 24xx+ series are not supported yet.


But i fail to see any part of the code pointing towards that.


You can see in the README for the driver that only 22xx and 23xx series 
are currently supported.



When i  did some debugging on the initiator side i see that the
qla2x00_get_port_database does return the status
of the remote port as FCT_INITIATOR, i couldn't actually figure out the
code wherein the target returns the response to these mbox_commands. I
was wondering if SCST plays a part here and sends a different response
when 24xx cards are used.


Unfortunately, 24xx+ cards have very different interface, so add support 
for them is almost the same as write another driver.



I saw some posts regarding the problems that people were facing with qla24xx
series. If this has been fixed in a different verison of Linux/SCST 
that what i am using, please let me know.


Thanks and Regards.
Sandip S

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Scst-devel] Problems with SCST and QLA 2432 FC Cards

2007-05-18 Thread Vladislav Bolkhovitin


sandip shete wrote:

Hi,

I wish to develop support for QLA 24xx series. If you already have a 
partial implementaion of the same, i would like to take it forward.
And if there isn't, i would appreciate if you could give me some 
pointers in that direction.


Most probably, the driver by link sent by Matthew Jacob will be a good 
starting point, where you can see examples how to work with the card, so 
you can add it to the qla2x00t driver. Also you will need a manual the 
firmware interface specification for 2400 series of the cards. It is 
under NDA, but you maybe lucky to get one from QLogic. Feel free to ask 
me any SCST or qla2x00t driver related questions.


I have adequate experience of programming in the SCSI domain, however i 
am not much conversant with the QLA driver code.


Thanks and Regards,
Sandip S

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Stgt-devel] Question for pass-through target design


FUJITA Tomonori wrote:

It looks like the pass-through target support is currently broken, at
least as I've checked for ibmvstgt, but I think it's a general problem.
I wanted to check my assumptions and get ideas.


Yeah, unfortunately, it works only with the iSCSI target driver (which
runs in user space).




The code isn't allocating any memory to pass along to the sg code to store
the result of a read or data for a write.  Currently, dxferp for sg_io_hdr
or dout_xferp/din_xferp for sg_io_v4 are assigned to the value of uaddr,
which is set to 0 in kern_queue_cmd.  With the pointer set to NULL,
the pass-through target isn't going to function.  Even if we had memory
allocated, there isn't a means of getting data to be written via sg down
this code path.

What ideas are there as to how the data will get to user-space so that
we can use sg?


For kernel-space drivers, we don't need to go to user-space. We can do
the pass-through in kernel space. I talked with James about this last
year and he said that if the code is implemented cleanly, he would
merges it into mainline.


We already have a pass-through in the kernel space for
kernel space drivers. It is the scsi_tgt* code.



Could you elaborate more?

What I meant that is that the kernel tgt code (scsi_tgt*) receives
SCSI commands from one lld and send them to another lld instead of
sending them to user space.


Although the approach of passing SCSI commands from a target LLD to an
initiator one without any significant interventions from the target
software looks to be nice and simple, you should realize how limited,
unsafe and illegal it is, since it badly violates SCSI specs.

Before I elaborate, let's have the following terminology in addition to
one described in SAM:

 - Target system - the overall system containing target and initiator
devices (and their LDDs). Target system exports one or more initiator
devices via the target device(s).

 - Target device - a SCSI device on the target system in the target mode.

 - Initiator device - a SCSI device on the target system in the
initiator mode. It actually serves commands that come from remote
initiators via target device(s).

 - Remote initiator - a SCSI initiator device connected to the target
device on the target system and uses (i.e. sends SCSI commands) exported
by it devices.

 - Target software - software that runs on the target system and
implements the necessary pass-through functionality

Let's consider a simplest case when a target system has one target
device, one initiator device and it exports the initiator device via the
target device as pass-through. The problem is that then the target
system creates a new SCSI target device, which is not the same as the
exported initiator device. Particularly, the new device could have 1
nexuses with remote initiators connected to it, while the initiator
device has no glue about them, it sees a single nexus with the target
system and only it.

And so? All the event notifications, which should be seen by all remote
initiators will be delivered to only one of them or not generated at
all, since some events are generated only for I_T nexuses other, than
one on which the command causing the event is received. The most common
example of such events is Unit Attentions. For example, after MODE
SELECT command, all remote initiators, except one, who sent the command,
shall receive MODE PARAMETERS CHANGED Unit Attention. Otherwise a bad
and quiet data corruption could happen.

More complicated example is SCSI reservations, doesn't matter persistent
or SPC-2 ones. Since the initiator device knows only about one nexus,
instead of actual many of them, the reservation commands should be
completely handled by target software on the target system. Having
delivery of Unit Attentions to all remote initiators especially
important for reservations, since they could mean that a reservation was
revoked by another initiator via, e.g., some task management function.

Things get even worse if we realize that (1) the initiator device could
report about its capabilities (like ACA support), which aren't supported
by the target software, hence misinform the remote initiators and again
could provoke a quiet data corruption, and (2) accesses to the initiator
devices from local programs on the target systems create another I_T
nexus, which needs to be handled as well.

(I suppose it is obvious that if the target system exports 1 initiator
devices via a single target device, since the initiator devices don't
know about each other, the target software in any case needs to
implement its own LUN addressing as well as own REPORT LUNS command
handler).

Thus, such in-kernel pass-through mode could be used only for limited
set of SCSI commands and SCSI device types with a big caution and
complete comprehension what's going on and how it should be. The latter
isn't true in the absolute majority of uses and users, so such approach
will give users a perfect weapon to shoot themselfs.

If you

Re: [Stgt-devel] Question for pass-through target design

FUJITA Tomonori wrote:

From: Vladislav Bolkhovitin [EMAIL PROTECTED]
Subject: Re: [Stgt-devel] Question for pass-through target design
Date: Mon, 07 May 2007 18:24:44 +0400

FUJITA Tomonori wrote:

It looks like the pass-through target support is currently broken, at
least as I've checked for ibmvstgt, but I think it's a general problem.
I wanted to check my assumptions and get ideas.

Yeah, unfortunately, it works only with the iSCSI target driver (which
runs in user space).

The code isn't allocating any memory to pass along to the sg code to store
the result of a read or data for a write.  Currently, dxferp for sg_io_hdr
or dout_xferp/din_xferp for sg_io_v4 are assigned to the value of uaddr,
which is set to 0 in kern_queue_cmd.  With the pointer set to NULL,
the pass-through target isn't going to function.  Even if we had memory
allocated, there isn't a means of getting data to be written via sg down
this code path.

What ideas are there as to how the data will get to user-space so that
we can use sg?

For kernel-space drivers, we don't need to go to user-space. We can do
the pass-through in kernel space. I talked with James about this last
year and he said that if the code is implemented cleanly, he would
merges it into mainline.

We already have a pass-through in the kernel space for
kernel space drivers. It is the scsi_tgt* code.

Could you elaborate more?

What I meant that is that the kernel tgt code (scsi_tgt*) receives
SCSI commands from one lld and send them to another lld instead of
sending them to user space.

Although the approach of passing SCSI commands from a target LLD to an
initiator one without any significant interventions from the target
software looks to be nice and simple, you should realize how limited,
unsafe and illegal it is, since it badly violates SCSI specs.

I think that 'implemented cleanly' means that one scsi_host is assigned
to only one initiator.

Sorry, I don't fully understand you. If you mean you are going to limit 
only one remote initiator per-target device, then, well, is it even more 
limited (and limiting) or not?

___
Stgt-devel mailing list
[EMAIL PROTECTED]
https://lists.berlios.de/mailman/listinfo/stgt-devel

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Stgt-devel] Question for pass-through target design

FUJITA Tomonori wrote:

From: Vladislav Bolkhovitin [EMAIL PROTECTED]
Subject: Re: [Stgt-devel] Question for pass-through target design
Date: Mon, 07 May 2007 19:27:23 +0400

FUJITA Tomonori wrote:

From: Vladislav Bolkhovitin [EMAIL PROTECTED]
Subject: Re: [Stgt-devel] Question for pass-through target design
Date: Mon, 07 May 2007 18:24:44 +0400

FUJITA Tomonori wrote:

It looks like the pass-through target support is currently broken, at
least as I've checked for ibmvstgt, but I think it's a general problem.
I wanted to check my assumptions and get ideas.

Yeah, unfortunately, it works only with the iSCSI target driver (which
runs in user space).

The code isn't allocating any memory to pass along to the sg code to store
the result of a read or data for a write.  Currently, dxferp for sg_io_hdr
or dout_xferp/din_xferp for sg_io_v4 are assigned to the value of uaddr,
which is set to 0 in kern_queue_cmd.  With the pointer set to NULL,
the pass-through target isn't going to function.  Even if we had memory
allocated, there isn't a means of getting data to be written via sg down
this code path.

What ideas are there as to how the data will get to user-space so that
we can use sg?

For kernel-space drivers, we don't need to go to user-space. We can do
the pass-through in kernel space. I talked with James about this last
year and he said that if the code is implemented cleanly, he would
merges it into mainline.

We already have a pass-through in the kernel space for
kernel space drivers. It is the scsi_tgt* code.

Could you elaborate more?

What I meant that is that the kernel tgt code (scsi_tgt*) receives
SCSI commands from one lld and send them to another lld instead of
sending them to user space.

Although the approach of passing SCSI commands from a target LLD to an
initiator one without any significant interventions from the target
software looks to be nice and simple, you should realize how limited,
unsafe and illegal it is, since it badly violates SCSI specs.

I think that 'implemented cleanly' means that one scsi_host is assigned
to only one initiator.

Sorry, I don't fully understand you. If you mean you are going to limit 
only one remote initiator per-target device, then, well, is it even more 
limited (and limiting) or not?

The target software assigns one scsi_host to only one remote
initiator. For FC, NPIV works nicely.

OK, if such limitation is OK for your users, then I'm happy for you.

___
Stgt-devel mailing list
[EMAIL PROTECTED]
https://lists.berlios.de/mailman/listinfo/stgt-devel

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Stgt-devel] Question for pass-through target design