[EMAIL PROTECTED] wrote on Sun, 19 Aug 2007 16:26 -0700:
> Your proposal for separating the I/O path from the metadata path using
> existing iSCSI kind of protocols sounds quite interesting and
> intriguing. Just to clarify my understanding and to also  spark a
> discussion along these lines, I have jotted down my thoughts and
> please let me know if I understood your proposal completely.

It is still crazy talk.  Nobody need take this too seriously, but we
can think about it for fun.

> - THis proposal calls for a split fast path I/O that will start out
> optional (possibly remain optional) since we don't know what the
> performance implications of this path is at scale.
> Presumably this can be tweaked to be a per-file/per-open option..?

Yes, no problem to do that.

> - mount & all metadata operations of non-opened files remain the same
> using either the existing client-core model or possibly the fuse
> alternative.

Right, perhaps letting us use the simpler fuse interface.  And we
control the data path and avoid the page cache if we want.  The
major point is we use existing, in-kernel supported data paths to
talk to PVFS servers instead of this painful
data-through-userspace-helper approach we have now.

> - iscsi target mode code should be added to the servers to that they
> can service iSCSI PDU's. THis will need some fair amount of tweaking
> and could possibly leverage from Pete's recent OSD work.

Hopefully not a big deal.  OSD SCSI commands include:

    read(handle, offset, length)
    write(handle, offset, length)

and other fun stuff, but this is hopefully all we will care about
for the data path.  Now we map read/write to files, but it is
conceptually simple to use handle to find the bstream.  And
offset/length should be the physical ones on the server, as
calculated by the client.  Or we could run this through trove if
there are some data caching issues to worry about.

> - on an open, we upcall to the client-core and fetch a list of the
> data file handles & BMI addresses of corresponding servers to the
> kernel. Assume for simplicity that we will only handle simple stripe
> distribution (round-robin) across all the servers.
> When we return to the kernel module, we send an iscsi login request to
> each of the data servers that are part of the striped file's backing.
> Once that is done, the call returns back to the caller. Consequently,
> every open of a file on PVFS will result in creation on "n" scsi
> initiator end-points where "n" is the # of data servers. (DOn't know
> what impact this will have on the scalability of the Linux SCSI
> stack/iSCSI initiator??)
> - DO we need to login each time? I think login can/should be made a
> one-time operation to each server.

You could also login to the datafile servers at mount time, and
logout at unmount.  Might be quite a bit simpler that way.  Get the
config, wire up communication with the data servers.  iSCSI handles
re-login if they go down/up or migrate.

> - any metadata operation involving an already opened file such as
> fstat, read, write etc should be mapped to a SCSI command, packetized
> and sent over the previously created iSCSI session's connection. Some
> of the offset calculations etc would therefore need to be moved to the
> kernel module.

Suggest you just think about read and write for now.  Send fstat to
the metadata servers.  Some complexity to find file sizes and such,
though, but we can use the normal server path to query bstream sizes
on the existing data servers, even though we read the bytes through
the SCSI path.  To get fancier later you can send more commands
through the SCSI path, but that's not the point of this exercise,
that's my research.  :)

> - Essentially, the bulk of the work involved is in presenting each
> data file handle on the server as a LUN. Since we do this only for
> opened files, this shouldn't be that big a scalability issue.??

Not as a lun each.  Instead, present all the datafiles as objects in
a single lun.  There's no concept of open in OSD, just access the
handle when you want to.  I'm pushing the OSD stuff rather than a
block interface since it makes these issues so much easier.  Note
you don't need a kernel patch for this.  It's all just SCSI.  There
just happens not to be a cute in-kernel interface to generate the
right sequence of bytes that are OSD commands, so we'll do them in
the pvfs module.  It's easy.

> - What do we respond to a REPORT LUNs command? See below on one possibility
>   If as part of the open system call implementation, we send an
> out-of-band PVFS message (scalability...?) to intimate to the servers
> to add the corresponding data file handles to be eligible LUNs then we
> could report all those LUN ids..
> When does the Linux ISCSI initiator stack send a report LUNs btw?

That magic is all pretty clean.  We can fake a single lun of type
OSD as we do now.  This all still has a full PVFS server in the same
process that can do the heavy lifting for create, remove, stat, etc.

> At the end of the day, it looks like we will incur a heavy cost on
> open() to improve the cost of I/O which is ok, if we can do
> openg()/openfh() type of calls..
> 
> Did I understand your proposal correctly? Will this work?
> thoughts?

Yeah, except open isn't a problem as I see it.

We have the full stack to do this in userspace from a PVFS client.
What is missing are the kernel bits to generate SCSI commands, and
the integration of the OSD target with the PVFS server.

If you want to think about the kernel side, you can pretend you have
two kernel functions:

    struct request *issue_osd_read_to_user(
        struct request_queue *q,  /* from login, opaque to pvfs */
        uint64_t handle,          /* datafile handle */
        void __user *buf,         /* user buf, len */
        uint32_t len);

and similar for write_from_user.  They issue requests to scsi (in
user context) and return asynchronously.  Then a callback function
that hands you back the struct request *, in bottom-half context, in
which you can look at the actual read len, success/failure, etc.
And a request_free function that cleans up.

For the login, we'll probably want a user space helper, say
pvfs2-client, to invoke existing linux tools (iscsiadm).  All you
need is an IP address and port, to get a connection up.

Let me know how I can help you with this.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to