On Fri, Mar 8, 2019 at 11:39 PM Ciprian Dorin Craciun <[email protected]> wrote: > On Fri, Mar 8, 2019 at 11:11 PM Jeffrey Altman <[email protected]> wrote: > > The performance issues could be anywhere and everywhere between the > > application being used for testing and the disk backing the vice partition.
OK, so first of all I want to thank Jeffrey for the support via IRC, as we've solved the issue. Basically it boils down to: * lower the number of threads from the `fileserver` to a proper value based on available CPU's / cores; (in my case `-p 4` or `-p 8`;) * properly configure jumbo frames on the network cards `ip link set dev eth0 mtu 9000`; (this configuration has to be made in the "proper" place else it will be lost after restart;) * (after changing MTU restart both server and clients;) * disable encryption `fs setcrypt -crypt off`; (in the end based on what I understood it's not too powerful, and given that I'll use it mostly on LAN it's not an issue; moreover for WAN I don't need to saturate GigaBit network;) * (after changing re-authenticate, i.e. `unlog && klog`); In order to check the correct configuration one has to: * `cmdebug -server 192.168.0.2 -addrs` (on the client) to see if the MTU is correctly picked up; (else restart the cache manager;) * `rxdebug -server 192.168.0.1 -peer -long` (on the server) to see if the `ifMTU / natMTU / maxMTU` for the client connection have proper values; (in my case they were `8524 / 7108 / 7108`;) * use `top -H` and check if the kernel thread `afs_rxlistener` (on the client) and the many of the `fileserver` threads (on the server) are not maxed-out (i.e. > ~90%); if so, that is the bottleneck (after encryption is disabled and jumbo frames are enabled); A note about the benchmark: in order to saturate the link I've tested only with the large files (i.e. ~20 MiB each), else I'll end up "trashing" the disk, and thus that would become the bottleneck. BTW, I've taken the liberty to copy-paste the log from the IRC channel (I've keep only the relevant lines, and also grouped reordered some of them), because they are very insightful into OpenAFS performance tuning. So once more thank's Jeffrey for the help, Ciprian. ~~~~ 23:43 < auristor> first question, when you are writing to the fileserver, does "top -H" show a fileserver thread at or near 100% cpu? 23:45 < auristor> -H will break them out by process thread instead providing one value for the fileserver as a whole 23:46 < auristor> I ask because one thread is the RX listener thread and that thread is the data pump. If that thread reaches 100% then you are out of capacity to receive and transmit packets 00:00 < auristor> Since you have a single client and 8 processor threads on the fileserver, I would recommend lowering the -p configuration value to reduce lock contention. 23:55 < auristor> there are two major bottlenecks in the OpenAFS. First, the rx listener thread which does all of the work associated with packet allocation, population, transmission, restransmission, and freeing on the sender and packet allocation, population, application queuing, acknowledging, and freeing on the receiver. 23:56 < auristor> In OpenAFS this process is not as efficient as it could be and its architecture limits it to using a single processor thread which means that its ability to scale correlates to the processor clock speed 23:58 < auristor> Second, there are many global locks in play. On the fileserver, there is one global lock for each fileserver subsystem required to process an RPC. For directories there are 8 global locks that must be acquired and 7 for non-directories. 23:59 < auristor> These global locks in the fileserver result in serialization of calls received in parallel. 00:00 < ciprian_craciun> (Even if they are for different directories / files?) 00:00 < ciprian_craciun> (I.e. is there some sort of actual "global lock" that basically serializes all requests from all clients?) 00:01 < auristor> The global locks I mentioned do serialize the startup and shutdown of calls even when the calls touch different objects. 00:02 < auristor> Note that an afs family fileserver is really an object store. unlike a nfs or cifs fileserver, an afs fileserver does not perform path evaluation. path evaluation to object id is performed by the cache managers. 00:04 < auristor> The Linux cache manager also has a single global lock that protects all other locks and data structures. This lock is dropped frequently to permit parallel processing but it does severely limit the amount of a parallel execution 00:09 < ciprian_craciun> Trying now with `-p 4` seems to yield ~35 MiB/s of `cat` throughput. 00:11 < auristor> that would imply that the fileserver is not releasing worker threads from the call channel fast enough to permit the thread to be available for the next incoming call from the client. 00:12 < auristor> are you tests using authentication? 00:14 < auristor> So the fcrypt encrypt and decrypt is probably the culprit 00:15 < auristor> fcrypt is weaker than des and very inefficient. 00:16 < auristor> The delays the encryption introduces in the sender can lead to network stalls 00:24 < auristor> "aklog -force" is equivalent tot aht 00:24 < ciprian_craciun> And yes, now it's seems I reach ~100 MiB/s. 00:25 < ciprian_craciun> Now RX listener is < 50%. 00:25 < auristor> and what is the afs_rxlistener thread utilization and the fileserver threads usage? 00:26 < auristor> I suspect you have now moved the bottleneck from the client's rx listener thread to the fileserver 00:27 < ciprian_craciun> The fileserver threads are < ~15% 00:29 < auristor> no jumbo grams 00:32 < auristor> cmdebug client -addr 00:45 < auristor> establish a call from the client to the fileserver and what does "rxdebug <fileserver> 7000 -peer -long" report for the client? 00:49 < auristor> remove the -rxmaxmtu from the fileserver config ~~~~ _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
