On 9/14/05, Roland Kuhn <[EMAIL PROTECTED]> wrote: > Hi Tom! > > On 14 Sep 2005, at 11:15, Tom Keiser wrote: > > > On 9/14/05, Roland Kuhn <[EMAIL PROTECTED]> wrote: > > > >> Dear experts! > >> > >> Having just strace'd the fileserver (non-LWP, single-threaded) on > >> Linux, I noticed that the data are read from disk using readv in > >> packets of 1396bytes, 16kB per syscall. In the face of chunksize=1MB > >> from the client side this does not seem terribly efficient to me, but > >> of course I see the benefit of reading chunks which can readily be > >> transferred. If my interpretation is wrong or this is an artifact of > >> not using tviced, please say so (if possible with a short reference > >> to the source), otherwise it would be nice to know why the fileserver > >> cannot read(fd, buf, 1048576) as that would give at least one order > >> of magnitude better performance from the RAID and (journalled) > >> filesystem. > >> > >> > > > > This is an artifact of the bad decisions that were made when > > implemeting the rx jumbogram protocol many years ago. Unfortunately, > > jumbogram extension headers are interspersed between each data > > continuation vector. Thus, we need a separate system iovec for each > > rx packet continuation buffer. The end result is storedata_rxstyle > > and fetchdata_rxstyle end up doing two vector io syscalls > > (recvmsg+writev or readv+sendmsg) per ~16kb of data. The jumbogram > > protocol needs to be replaced. > > Thanks for the explanation. Wouldn't it be possible to keep the > network protocol (including the sendmsg) as it is, but still to read > bigger chunks? The outgoing messages are constructed using iovecs > anyway, so why not intersperse the extension headers at sendmsg time? >
There are some "workarounds" to this problem. First, we could abandon the current zero-copy semantics and just do very large reads and writes to the disk, and then do memcpy's in userspace. For fast machines, this will almost certainly beat the current algorithm for raw throughput. But, it's certainly not what I'd call an elegant solution. Second, we could use iovecs for the extension headers. Unfortunately, most OS's limit us to 16 iovecs, so this would cut our max jumbogram size nearly in half. There is a third alternative, however: using posix async io's lio_listio() method to perform read-ahead / async write-behind. For storedata_rxstyle, we could queue as much i/o as possible, and only block on disk i/o once all the data is queued in the kernel (or when the async queue fills). Implementing fetchdata_rxstyle would be more involved, as we would probably want to implement some form of adaptive read-ahead scheduler. -- Tom Keiser [EMAIL PROTECTED] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
