On 6/16/05, Roland Kuhn <[EMAIL PROTECTED]> wrote: > Dear experts! > > We are fighting with the fileserver performance since a long time. > Once I got the advice to use the single threaded fileserver, which > helped, but didn't get me more than 10MB/s. Now we upgraded to Debian > sarge (openafs 1.3.81), which comes again with the threaded server. > With default settings we get 1MB/s (the underlying RAID can easily > deliver >200MB/s, which shows that the VM settings are okay). Now I > tried with -L -vc 10000 -cb 100000 -udpsize 12800, which brings it > back to about 6MB/s (all numbers with >>1 simultanous clients > reading). This is still a factor 30 below the capabilities of the > RAID (okay, we only have 1GB/s ethernet ;-) ). I've seen excessive > context switch rates (>>100000/s), which obviously don't happen with > the single threaded fileserver. > > So, can anybody comment on these numbers? Those are dual Opteron > boxes with enough RAM, so please make some suggestions what options I > should try to get more like the real performance of a fileserver...
This is very interesting. On much much older hardware (2x 300MHz sun e450 running solaris 10) I can get >15MB/s aggregate off a single fc-al disk with >> 1 clients over gigE with absolutely no tweaking of fileserver parameters. Of course, there are many performance bottlenecks in multithreading that are actually exacerbated by faster cpu's, so the results you're seeing are plausible. I'd be interested in seeing a comparison of 1.3.81 and 1.3.84 performance. Several threading patches were integrated between these revisions, and it would be interesting to see how they affect your problem. I know on sparc they are making a difference, but that doesn't necessarily correllate to amd64. If you upgrade to 1.3.84, there is another fileserver option that you will want to experiment with: -rxpck. Sometime after 1.3.81, thread-local packet queues were integrated, and they may reduce your context switch rate due to less contention for the global packet queue lock. The default value for -rxpck will give you approximately 500 rx_packet structures. I recommend trying several values in the range 1000-5000. At some point, you will reach an optimal tradeoff between a small value that fits within your cache hierarchy, and a large value that reduces the number of transfers between the thread-local and global rx_packet queues. Before submitting the thread-local patch to RT, I was only able to test on a few architectures, and I'd like to get feedback for amd64. Another option you might care to experiment with is: -p . IIRC, the default will give you 12 worker threads. It sounds like many of your worker threads are busy handling calls, but are constantly contending over locks, and blocking on i/o. You will need to experiment with this, but you may find that reducing the number of worker threads will actually improve performance by forcing new calls to queue up, thereby allowing your active calls to complete with less contention. Of course, this won't alleviate the problems caused by blocking i/o. Reducing this value too far is dangerous because some calls have high latencies (e.g. some calls make calls to the ptserver). Have you looked at the xstat results from your servers? afsmonitor is a great little tool, and it can even dump these results periodically to a log. This data could help us to understand your workload. Seeing those numbers would also help us with suggesting changes to parameters in the volume package. And now I'll digress, and talk about the more fundament issues. After spending a lot of time with dtrace, here's my list of 5 major bottlenecks in rx and the fileserver: 1) 1:1 mapping of calls to threads The 1:1 mapping of calls to threads was a fine model for LWP, but with kernel threads we're doing a lot of unnecessary context switches. Fixing this will involve a major rewrite of rx. Basically, we have to move call state off the stack, and implement an asynchronous event-driven model for assigning calls to threads as they are ready to proceed. This would also require switching to asynchronous i/o. 2) single-threaded on udp receive At the moment, one thread at a time calls recvmsg(), so we're essentially limited to 1 cpu on the receive side. After recvmsg() returns, this thread will parse the packet header, possibly call into the security object associated with the conn, and then route it through the rx call mux. If it's a new call, and hot threads is enabled, this means we need to signal a waiting worker thread so it can become the new listener, and we will then begin processing the new call. Otherwise, we drop the packet into the approrpiate call struct's receive queue, signal the waiting worker, and go back to blocking on recvmsg(). Depending on your fileserver's workload, hot threads can be a good thing or a bad thing. They reduce the latency on new call creation, but the tradeoff is increased latency between recvmsg() syscalls following a new call. Whether this is an improvement for you is very dependent on your workload. If you want new call latency to be low, then hot threads should be on, but if you want to maximize server throughput, then I would turn off hot threads. On a related note, we incur a lot of extra mode switches to handle ACKs. I'm working on a patch to scale the number of concurrent listeners up to the number of cpu's. So far, it just involves adding a few mutex enter/exit's, and changing the the serverproc logic a little bit. 3) can only receive one datagram per syscall Well, I have to blame the standards bodies here. For POSIX asynch i/o, nobody bothered to put a sockaddr field in the aiocb struct. Oh well. There is one possible way to handle this right now. I don't know if others would approve of this, but I've been looking into the possibility of using libafs to handle the fileserver's RX endpoint in the kernel, and only return to userspace when we have a call that's ready to proceed. Of course, this would only work for platforms where we can build libafs (not sure about linux 26), but that's quite a few. 4) blocking i/o Ten years ago, asynch i/o was new and not exactly ready for prime time. Well, times have changed. I think moving to an event-driven asynchronous i/o model will allow us to keep the number of threads close to the number of cpu's, which should drastically reduce context switches. 5) storedata_rxstyle() / fetchdata_rxstyle() There is a lot of room for improvement here. As others have pointed out, we make way too many readv/writev syscalls per MB of data. Part of the problem here is that readv/writev only take 16 iovecs, and each cbuffer is a little less than the size of an ethernet mtu. Thus, we're moving very little data per mode switch. I'd like to hack together a way to store multiple cbuffer's contiguously in an expanded rx_packet struct, which would let us move a lot more data in 16 iovec's, but the security trailer is in the way. I've only been able to come up with two ways of handling this, while preserving the current zero-copy behavior. The first way is to separate the trailer and make the payload contiguous when copying into the process's virtual address space during a syscall. The second way is to avoid the mode switches altogether by adding two new syscalls to the afs_syscall mux that are basically rx equivalents of the sendfile() and recvfile() syscalls. Of course, both of these methods assume libafs. I guess if we wanted to be as crazy as the nfs guys, we could just port the whole fileserver over to the osi api, and dump it into the kernel... While I'm talking about storedata, copyonwrite() is going to become a headache as largefile becomes more heavily used. I guess fixing that will involve a new vice format :( All of this is rather theoretical, and I only have a little bit of code thus far. I hope to find the time to make some of these changes happen, but these are ambitious suggestions. I can't promise anything, unless other people with lots of time volunteer... ;) Comments? Criticisms? Volunteers? Regards, -- Tom Keiser [EMAIL PROTECTED] _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
