On 9/17/05, Marcus Watts <[EMAIL PROTECTED]> wrote: > Various wrote: > > Hi Chas! > > > > On 16 Sep 2005, at 14:42, chas williams - CONTRACTOR wrote: > > > > > In message <[EMAIL PROTECTED] > > > muenchen.de>,Roland Kuhn writes: > > > > > >> Why can't this be replaced by read(big segment)->buffer->sendmsg > > >> (small > > >> segments). AFAIK readv() is implemented in terms of read() in the > > >> kernel for almost all filesystems, so it should really only have the > > >> effect of making the disk transfer more efficient. The msg headers > > >> interspersed with the data have to come from userspace in any case, > > >> right? > > >> > > > > > > no reason you couldnt do this i suppose. you would need twice the > > > number of entries in the iovec though. you would need a special > > > version > > > of rx_AllocWritev() that only allocated packet headers and chops up a > > > buffer you pass in. > > > > > > curious, i rewrote rx_FetchData() to read into a single buffer and > > > then > > > memcpy() into the already allocated rx packets. this had no impact on > > > performance as far as i could tell (my typical test read was a 16k > > > read > > > split across 12/13 rx packets). the big problem with iovec is not > > > iovec > > > really but rather than you only get 1k for each rx packet you process. > > > it quite a bit of work to handle an rx packet. (although if your > > > lower > > > level disk driver didnt support scatter/gather you might seem some > > > benefit from this). > > > > I know already that 16k-reads are non-optimal ;-) What I meant was > > doing chunksize (1MB in my case) reads. But what I gather from this > > discussion is that this would really be some work as this read-ahead > > would have to be managed across several rx jumbograms, wouldn't it? > > > > Ciao, > > Roland > > I'm not surprised you're not seeing a difference changing things. > > There are several potential bottlenecks in this process: > > /1/ reading off of disk > /2/ cpu handling - kernel & user mode > /3/ network output > > All the work you're doing with iovecs and such is mainly manipulating > the cpu handling. If you were doing very small disk reads (say, < 4096 > bytes), > there's a win here. The difference between 16k and 1m is much less extreme. >
Syscall overhead on many of our supported platforms is still the dominant player. For instance, statistical profiling on Solaris points to about 75% of fileserver time being spent in syscalls. Mode switches are very expensive operations. I'd have to say aggregating more data into each syscall will have to be one of the main goals of any fileserver improvement project. > Regardless of whether you do small reads or big, the net result of all > this is that the system somehow has to schedule disk reads on a regular > basis, and can't do anything with the data until it has it. > Once you are doing block aligned reads, there's little further win > to doing larger reads. The system probably already has read-ahead > logic engaged - it will read the next block into the buffer cache while > your logic is crunching on the previous one. That's already giving you > all the advantages "async I/O" would have given you. > Read-ahead is hardly equivalent to true aync i/o. The fileserver performs synchronous reads. This means we can only parallelize disk i/o transactions _across_ several RPC calls, and where the kernel happens to guess correctly with a read-ahead However, the kernel has far less information at its disposal regarding future i/o patterns than does the fileserver itself. Thus, the read-ahead decisions made are far from optimal. Async i/o gives the kernel i/o scheduler (and SCSI TCQ scheduler) more atomic ops to deal with. This added level of asynchronosity can dramatically improve performance by allowing the lower levels to elevator seek more optimally. After all, I think the real goal here is to improve throughput, not necessarily latency. Obviously, this would be a bigger win for storedata_rxstyle, where there's less of a head-of-line blocking effect, and thus order of iop completion would not negatively affect QoS. Let's face it, the stock configuration fileserver has around a dozen worker threads. With synchronous i/o that's nowhere near enough independent i/o ops to keep even one moderately fast SCSI disk's TCQ filled to the point where it can do appropriate seek optimization. > It's difficult to do much to improve the network overhead, > without making incompatible changes to things. This is your > most likely bottlneck though. If you try this with repeated reads > from the same file (so it doesn't have to go out to disk), this > will be the dominant factor. > Yes, network syscalls are taking up over 50% of the fileserver kernel time in such cases, but the readv()/writev() syscalls are taking up a nontrivial amount of time too. The mode switch and associated icache flush are big hits to performance. We need to aggregate as much data as possible into each mode switch if we ever hope to substantially improve throughput. > If you have a real load, with lots of people accessing parts of your > filesystem, the next most serious bottleneck after network is > your disk. If you have more than one person accessing your > fileserver, seek time is probably a dominant factor. > The disk driver is probably better at minimum response time > rather than maximum throughput. So if somebody else requests > 1k while you are in the midst of your 1mb transfer, chances > there will be a head seek to satify their request at the > expense of yours. Also, there may not be much you can do from > the application layer to alter this behavior. > Which is why aync read-ahead under the control of userspace would dramatically improve QoS. Since we have a 1:1 call to thread mapping, we need to do as much in the background as possible. The best way to optimize this is to give the kernel sufficient information about expected future i/o patterns. As it stands right now, the posix aync interfaces are one of the best ways to do that. Regards, -- Tom Keiser [EMAIL PROTECTED] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
