Hi all, I've just updated a patch  that implements a new thread pool based on a wait-free queue provided by userspace-rcu library. The patch also includes an auto scaling mechanism that only keeps running the needed amount of threads for the current workload.
This new approach has some advantages: - It's provided globally inside libglusterfs instead of inside an xlator This makes it possible that fuse thread and epoll threads transfer the received request to another thread sooner, wating less CPU and reacting sooner to other incoming requests. - Adding jobs to the queue used by the thread pool only requires an atomic operation This makes the producer side of the queue really fast, almost with no delay. - Contention is reduced The producer side has negligible contention thanks to the wait-free enqueue operation based on an atomic access. The consumer side requires a mutex, but the duration is very small and the scaling mechanism makes sure that there are no more threads than needed contending for the mutex. This change disables io-threads, since it replaces part of its functionality. However there are two things that could be needed from io-threads: - Prioritization of fops Currently, io-threads assigns priorities to each fop, so that some fops are handled before than others. - Fair distribution of execution slots between clients Currently, io-threads processes requests from each client in round-robin. These features are not implemented right now. If they are needed, probably the best thing to do would be to keep them inside io-threads, but change its implementation so that it uses the global threads from the thread pool instead of its own threads. If this change proves it's performing better and is merged, I have some more ideas to improve other areas of gluster: - Integrate synctask threads into the new thread pool I think there is some contention in these threads because during some tests I've seen they were consuming most of the CPU. Probably they suffer from the same problem than io-threads, so replacing them could improve things. - Integrate timers into the new thread pool My idea is to create a per-thread timer where code executed in one thread will create timer events in the same thread. This makes it possible to use structures that don't require any mutex to be modified. Since the thread pool is basically executing computing tasks, which are fast, I think it's feasible to implement a timer in the main loop of each worker thread with a resolution of few millisecond, which I think is good enough for gluster needs. - Integrate with userspace-rcu library in QSBR mode This will make it possible to use some RCU-based structures for anything gluster uses (inodes, fd's, ...). These structures have very fast read operations, which should reduce contention and improve performance in many places. - Integrate I/O threads into the thread pool and reduce context switches The idea here is a bit more complex. Basically I would like to have a function that does an I/O on some device (for example reading fuse requests or waiting for epoll events). We could send a request to the thread pool to execute that function, so it would be executed inside one of the working threads. When the I/O terminates (i.e. it has received a request), the idea is that a call to the same function is added to the thread pool, so that another thread could continue waiting for requests, but the current thread will start processing the received request without a context switch. Note that with all these changes, all dedicated threads that we currently have in gluster could be replaced by the features provided by this new thread pool, so these would be the only threads present in gluster. This is specially important when brick-multiplex is used. I've done some simple tests using a replica 3 volume and a diserse 4+2 volume. These tests are executed on a single machine using an HDD for each brick (not the best scenario, but it should be fine for comparison). The machine is quite powerful (dual Intel Xeon Silver 4114 @2.2 GHz, with 128 GiB RAM). These tests have shown that the limiting factor has been the disk in most cases, so it's hard to tell if the change has really improved things. There is only one clear exception: self-heal on a dispersed volume completes 12.7% faster. The utilization of CPU has also dropped drastically: Old implementation: 12.30 user, 41.78 sys, 43.16 idle, 0.73 wait New implementation: 4.91 user, 5.52 sys, 81.60 idle, 5.91 wait Now I'm running some more tests on NVMe to try to see the effects of the change when disk is not limiting performance. I'll update once I've more data. Xavi  https://review.gluster.org/c/glusterfs/+/20636
_______________________________________________ Gluster-devel mailing list Glusteremail@example.com https://lists.gluster.org/mailman/listinfo/gluster-devel