On 9/20/05, Marcus Watts <[EMAIL PROTECTED]> wrote: > > Received: by 10.70.78.1 with HTTP; Sat, 17 Sep 2005 16:11:38 -0700 (PDT) > > Message-ID: <[EMAIL PROTECTED]> > > From: Tom Keiser <[EMAIL PROTECTED]> > > Reply-To: [EMAIL PROTECTED] > > To: Marcus Watts <[EMAIL PROTECTED]> > > Cc: Roland Kuhn <[EMAIL PROTECTED]>, > > chas williams - CONTRACTOR <[EMAIL PROTECTED]>, > > Harald Barth <[EMAIL PROTECTED]>, [email protected] > > In-Reply-To: <[EMAIL PROTECTED]> > > Mime-Version: 1.0 > > References: <[EMAIL PROTECTED]> > > <[EMAIL PROTECTED]> > > Subject: Re: [OpenAFS-devel] how does fileserver read from disk? > > Sender: [EMAIL PROTECTED] > > Errors-To: [EMAIL PROTECTED] > > Date: Sat, 17 Sep 2005 19:11:38 -0400 > > Status: U > > > > On 9/17/05, Marcus Watts <[EMAIL PROTECTED]> wrote: > > > Various wrote: > > > > Hi Chas! > > > > > > > > On 16 Sep 2005, at 14:42, chas williams - CONTRACTOR wrote: > > > > > > > > > In message <[EMAIL PROTECTED] > > > > > muenchen.de>,Roland Kuhn writes: > > > > > > > > > >> Why can't this be replaced by read(big segment)->buffer->sendmsg > > > > >> (small > > > > >> segments). AFAIK readv() is implemented in terms of read() in the > > > > >> kernel for almost all filesystems, so it should really only have the > > > > >> effect of making the disk transfer more efficient. The msg headers > > > > >> interspersed with the data have to come from userspace in any case, > > > > >> right? > > > > >> > > > > > > > > > > no reason you couldnt do this i suppose. you would need twice the > > > > > number of entries in the iovec though. you would need a special > > > > > version > > > > > of rx_AllocWritev() that only allocated packet headers and chops up a > > > > > buffer you pass in. > > > > > > > > > > curious, i rewrote rx_FetchData() to read into a single buffer and > > > > > then > > > > > memcpy() into the already allocated rx packets. this had no impact o= > > n > > > > > performance as far as i could tell (my typical test read was a 16k > > > > > read > > > > > split across 12/13 rx packets). the big problem with iovec is not > > > > > iovec > > > > > really but rather than you only get 1k for each rx packet you process= > > . > > > > > it quite a bit of work to handle an rx packet. (although if your > > > > > lower > > > > > level disk driver didnt support scatter/gather you might seem some > > > > > benefit from this). > > > > > > > > I know already that 16k-reads are non-optimal ;-) What I meant was > > > > doing chunksize (1MB in my case) reads. But what I gather from this > > > > discussion is that this would really be some work as this read-ahead > > > > would have to be managed across several rx jumbograms, wouldn't it? > > > > > > > > Ciao, > > > > Roland > > >=20 > > > I'm not surprised you're not seeing a difference changing things. > > >=20 > > > There are several potential bottlenecks in this process: > > >=20 > > > /1/ reading off of disk > > > /2/ cpu handling - kernel & user mode > > > /3/ network output > > >=20 > > > All the work you're doing with iovecs and such is mainly manipulating > > > the cpu handling. If you were doing very small disk reads (say, < 4096 b= > > ytes), > > > there's a win here. The difference between 16k and 1m is much less extre= > > me. > > >=20 > > > > Syscall overhead on many of our supported platforms is still the > > dominant player. > > For instance, statistical profiling on Solaris points to about 75% of > > fileserver time being spent in syscalls. Mode switches are very > > expensive operations. I'd have to say aggregating more data into each > > syscall will have to be one of the main goals of any fileserver > > improvement project. > > I don't think you have enough information to conclude that using > aio will improve anything. Mode switches are expensive, but that's > not necessarily the most expensive part of whatever's happening. > Validating parameters, managing address space, handling device interrupts > and doing kernel/user space memory copies are also expensive, and some of > that is unavoidable. You also need to check your results on several > different platforms. >
I have data from several platforms. I'm using Solaris as a canonical example. Obviously, mode switches are just part of the CPU time usage. However, the other time sinks you mentioned are strongly correlated to the number of IO related syscalls being processed. So, I'm not sure what your point is. Forget the CPU time for a minute. My arguments regarding aio and cpu utilization are totally orthogonal. AIO is about improving disk utilization, not about reducing CPU utilization. Not to mention, it's a fundamental design principle that when it comes to disk i/o scheduling, wasting some CPU time is ok since the ratio of instructions retired per completed disk IO op is so large. Fundamentally, we're dealing with a disk throughput optimization problem. Disk IO schedulers are designed to balance the tradeoff between QoS of each individual IO transaction, and seek optimization over the pool of outstanding transactions at any given point in time. Obviously, maintaining a large queue of independent IO transactions at all times is essential to making the tradeoff result in an efficient outcome. The current fileserver implementation cannot do this. If we can improve CPU utilization, that's great. But, the bigger problem right now is that we're not utilizing our disks efficiently. Since I'm not sure why you believe parallelizing i/o will not help, let me concisely re-iterate why you are wrong: 1) the fileserver uses sync i/o over about a dozen worker threads (by default) 2) each rx worker thread can only have one outstanding i/o at a time 3) one singular scsi disk can handle an order of magnitude more outstanding atomic IO ops at a time in its TCQ than the fileserver can presently provide 4) fileservers generally have more than one disk 5) despite what you say below, many platforms have very robust aio implementations that do not use pthreads, and in some cases do not use kernel threads either, and are instead purely even-driven 6) even for those platforms that are stuck in the early 90's using pthreads to emulate kernel aio, there is a distinct advantage: dedicating a thread pool to aio greatly reduces the strain on icache and stack-related dcache for these threads compared to turning them into full-blown rx worker threads. (not to mention the rx lock contention benefits, ability to have multiple outstanding IOs for a single rpc call, less register window thrashing on SPARC, etc...) 7) as I've pointed out before, the fileserver is in a much better position to do optimal read-aheads than an adaptive algorithm in the kernel i/o scheduler 8) fileserver controlled read-aheads either need aio, or you have to implement your own equivalent of junky userspace aio with a thread pool 9) on robust platforms, aio along with a redesign of the rx server api would allow the number of threads to come close to the number of cpus, which would dramatically reduce context switch rates I'm not a steadfast supporter of the posix async spec. But, code reuse bets writing our own pthreaded i/o pool. Plus, it would let us leverage the more advanced kernel aio implementations available on some platforms. In case you still don't believe me, here's a quick intuitionistic proof: If you have a sequence of IO transactions, execution time will be dominated by seek time. However, if you perform dependency analysis on the sequence, and issue all of the independent IO transactions in parallel, and continue to do this as transactions get retired, seek time per transaction will be less since the seek costs are amortized over all the transactions performed during each elevator seek. The current fileserver design simply can't keep disks busy. Sure, they may appear "busy" with tools such as iostat. But, if you examine the output more closely, you'll quickly realize that there aren't enough concurrent transactions in flight at any one time to make disk busy AND efficient. Utilization can be high, but it is NOT efficient utilization. > > > > > > > Regardless of whether you do small reads or big, the net result of all > > > this is that the system somehow has to schedule disk reads on a regular > > > basis, and can't do anything with the data until it has it. > > > Once you are doing block aligned reads, there's little further win > > > to doing larger reads. The system probably already has read-ahead > > > logic engaged - it will read the next block into the buffer cache while > > > your logic is crunching on the previous one. That's already giving you > > > all the advantages "async I/O" would have given you. > > >=20 > > > > Read-ahead is hardly equivalent to true aync i/o. The fileserver > > performs synchronous reads. This means we can only parallelize disk > > i/o transactions _across_ several RPC calls, and where the kernel > > happens to guess correctly with a read-ahead However, the kernel has > > far less information at its disposal regarding future i/o patterns > > than does the fileserver itself. Thus, the read-ahead decisions made > > are far from optimal. Async i/o gives the kernel i/o scheduler (and > > SCSI TCQ scheduler) more atomic ops to deal with. This added level of > > asynchronosity can dramatically improve performance by allowing the > > lower levels to elevator seek more optimally. After all, I think the > > real goal here is to improve throughput, not necessarily latency.=20 > > Obviously, this would be a bigger win for storedata_rxstyle, where > > there's less of a head-of-line blocking effect, and thus order of iop > > completion would not negatively affect QoS. > > > > Let's face it, the stock configuration fileserver has around a dozen > > worker threads. With synchronous i/o that's nowhere near enough > > independent i/o ops to keep even one moderately fast SCSI disk's TCQ > > filled to the point where it can do appropriate seek optimization. > > On some if not most systems, aio is done using pthreads. > The syscall overhead you fear is still there, it's just hidden. > For many platforms, this hasn't been true in a long time. Yes, many platforms fall back on a pthreads implementation when a specific filesystem doesn't support kernel async io, or when the flags passed to open aren't supported. But more fundamentally, who cares that it's backed by pthreads in some cases? Your assertion that this leads to the same performance problems is patently false. For the purposes of this argument, I don't care about CPU utilization. Faster CPUs are cheap, whereas storage draws a lot of power, and costs a considerable amount. As I've stated many times, disk i/o subsystems are designed to deal with high degrees of asynchrony. They don't scale well with low levels of concurrency because physics and disk geometry just don't make that feasible. Thus, having a large pool of userspace threads performing blocking i/o on behalf of a small collection of concurrently executing RPC calls beats the heck out of the current i/o model. Sure, it will use the CPUs in a suboptimal manner. But, AFS is a filesystem -- the goal should be to increase disk throughput, not to keep CPU utilization low on the fileserver. > On linux, the documentation sucks, which is hardly confidence > inspiring. > Well, Linux documentation is bad in general. SGI donated a fairly robust aio implementation for 2.4 a long time ago. It used a special syscall, and then had threads wait for i/o completion. The 2.6 aio implementation was developed as part of the LSE, and it is fully event-driven for unbuffered i/o. With patches contributed by IBM LTC, that support is extended to buffered i/o as well. > On solaris, the implementation of aio depends on which solaris rev > and also on the filesystem type. > True, but what does this have to do with aio being useful? That just points to a software engineering issue, not a fundamental problem. > Aio doesn't handle file opens, creates or unlinks, which are liable > to be particularly expensive operations in terms of filesystem overhead. > Generally these operations involve scattered disk I/O, and most > filesystems take extra pains for reliability at the expense of performance. > Indeed. But how does the premise that aio cannot solve every corner case i/o problem lead to the conclusion that aio is not useful? This thread is about optimizing reading and writing of data files, not dealing with corner case metadata operations. Optimizing metadata ops is an entirely orthogonal topic, and unfortunately pthreads is the best answer to that problem for the moment. > > > > > It's difficult to do much to improve the network overhead, > > > without making incompatible changes to things. This is your > > > most likely bottlneck though. If you try this with repeated reads > > > from the same file (so it doesn't have to go out to disk), this > > > will be the dominant factor. > > > > > > > Yes, network syscalls are taking up over 50% of the fileserver kernel > > time in such cases, but the readv()/writev() syscalls are taking up a > > nontrivial amount of time too. The mode switch and associated icache > > flush are big hits to performance. We need to aggregate as much data > > as possible into each mode switch if we ever hope to substantially > > improve throughput. > > Too bad there isn't any portable way to send or receive multiple messages > at once. > I'm aware of at least one OS that's working on a new syscall to mitigate this bottleneck. > > > > > If you have a real load, with lots of people accessing parts of your > > > filesystem, the next most serious bottleneck after network is > > > your disk. If you have more than one person accessing your > > > fileserver, seek time is probably a dominant factor. > > > The disk driver is probably better at minimum response time > > > rather than maximum throughput. So if somebody else requests > > > 1k while you are in the midst of your 1mb transfer, chances > > > there will be a head seek to satify their request at the > > > expense of yours. Also, there may not be much you can do from > > > the application layer to alter this behavior. > > >=20 > > > > Which is why aync read-ahead under the control of userspace would > > dramatically improve QoS. Since we have a 1:1 call to thread mapping, > > we need to do as much in the background as possible. The best way to > > optimize this is to give the kernel sufficient information about > > expected future i/o patterns. As it stands right now, the posix aync > > interfaces are one of the best ways to do that. > > The posix async interface is a library interface standard, not a kernel > interface. I believe you're making far too many assumptions regarding > its implementation. > Huh? What are you reading in that paragraph that conflates userspace posix compliance libraries with kernel apis? What do the implementation details of aio on arbitrary platform X have to do with my claim that in order for the kernel to make optimal i/o scheduling decisions, it needs to have many more in-flight IO transactions? For arguments sake, assume that platform X emulates aio with pthreads. Even the most simplistic implementation gives the kernel orders of magnitude more concurrent atomic IO transactions to schedule. The only difference between pthread emulation and a real kernel aio layer is the amount of overhead. Even with such a sub-par aio implementation, we're still able to give the kernel orders of magnitude more atomic IO ops to play with. And, we can do this all without scaling the number of rx threads, and thus we simultaneously lift the limitation of one pending i/o per rpc call. Obviously, this improves the kernel io scheduler's ability to optimize seeks. Furthermore, your argument regarding two independent IOs from userspace reducing each other's QoS is totally mitigated by read-ahead. So long as you have adequate buffering somewhere between the disk and network interface, this will have absolutely no effect on QoS, assuming the disk's bandwidth can sustain both rpcs at wire speed. And, as I've mentioned numerous times, this type of read-ahead is best handled by the fileserver itself, since it knows the expected size of all running io transfers a priori. As it stands right now, the only form of read-ahead we have is past a mode-switch boundary, and is subject to predictive algorithms outside of our control. That's far from optimal. The operating systems I deal with on a daily basis have entire kernel subsystems dedicated to aio, aio-specific system calls, and posix compliance libraries wrapping the syscalls. The days of aio being a joke are over (well, except for sockets...aio support for sockets is still a tad rough even on the better commercial unices). Any way you slice it, increasing i/o parallelism is the only way to make disks busy AND efficient. In the worst case aio implementation, you're simply looking at a configartion where the number of threads is orders of magnitude higher than the number of cpus. Sure, this is unwanted, but on those sub-par platforms you're either going to increase parallelism this way, or by increasing the number of rx worker threads. And, it's pretty obvious that a bunch of dedicated io worker threads is going to be faster for reasons I mentioned above. Not to mention, it's also a more flexible i/o architecture. -- Tom Keiser [EMAIL PROTECTED] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
