On 12 May 2011, at 12:48, Anton Lundin wrote: > The smp-scaling in the fileserver is really bad. Have anyone done any > profiling on what is causing this? Is any work getting done on this?
In general, recent work on the fileserver has been focussing on correctness, rather than on performance. We do have a number of results that point at poor SMP scaling of both the 1.4.x, and (sadly) the 1.6.x fileservers. In particular, many workloads seem to benefit from having a lower number of threads than the maximum permitted. This is obviously not ideal As Derrick noted, the first thing would be to try the 1.6.0 prerelease fileserver. There are substantial changes in various parts of the fileserver in 1.6.x, even if you don't end up running demand attach. As far as I'm aware, little benchmarking has been performed of these changes, so it would be very interesting to see how both the demand attach, and normal, fileservers perform in your tests. What has received substantial performance attention in the 1.6.x series is the RX transport protocol. We know that the RX that will ship in 1.6.x is substantially faster than that in 1.4.x. If you are on an i?86 platform some of these performance improvements will only be apparent if you build for an i586 (or i686) architecture. There are also a couple of RX "features" that will cause single user workloads to scale particularly badly. Firstly, hot threads. In a typical dispatcher architecture one thread will read from the network, and hand incoming packets off to worker threads to handle them. This obviously entails a context switch, and for the data to be passed between threads. To avoid this, RX has "hot threads". The process which receives an incoming packet is the one which will handle it. The next free process then starts listening on the network. So, the process handling a given packet is constantly switching. Where there is a substantial amount of context associated with a packet (connection data, volume data, inode data, etc), if these two threads are scheduled on different cores, then a lot of data is constantly being swapped around. You might find, therefore, that disabling hot threads actually improves your performance. Secondly, the way we round robin processes. In effect, we use an LRU queue to schedule idle threads. If we have five threads A, B, C, D, E, then packet 1 will be handled by A, whilst B becomes the listener. packet 2 goes to B, and C starts listening, packet 3 to C, packet 4 to D, packet 5 to E, and packet 6 back to A again. On a machine with 128 threads, and only 4 cores, there's a lot of churn here. Pulling the last entry, rather than the first entry from the idle queue would solve this problem - I have a patch for this change, but don't currently have access to any machines to test it on. It's worth noting that both of these are likely to be particular issues in the single user case. On busier fileservers (where the number of connections is more than the number of cores) there will inevitably be churn, so I suspect that the performance degredation as cores come online will be much less marked. Cheers, Simon. _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
