Hi all, I've now committed the current state of my work to master. The 'fq' policy should still be considered experimental and hence isn't active by default. The rest of this mail is basically some notes put together from my previous 2 or 3 mails, so that they are all in one place. If there's something missing, let me know.
Just to emphasize this: the default noop policy should have no visible effect on the performance when compared to before these commits and is the default choice. If you feel more adventurous, you can enable the *EXPERIMENTAL* fq (fair queuing) policy as outlined below. It should give you major improvements in concurrent I/Os, especially in read latency. Let me know how it works out for you and suggestions on how to improve it. The next few weeks I'll have somewhat more limited time again due to exams coming up, but I'll try to be responsive. The fq policy has a bunch of XXX in the code that could/would improve performance and fairness, and I hope to address them over time, however most of them are no priority. You could also take this opportunity and write your own scheduling policy; either from scratch or by using dsched_fq as a base, as it quite nicely tracks all processes/threads/bufs in the system as is required for most modern I/O scheduling policies. -- The work basically consists of 4 parts: - General system interfacing - I/O scheduler framework (dsched) - I/O scheduler fair queuing policy (dsched_fq or fq) - userland tools (dschedctl and ionice) -- By default you still won't notice any difference, as the default scheduler is the so-called noop (no operation) scheduler; it emulates our previous behaviour. This can be confirmed by dschedctl -l: # dschedctl -l cd0 => noop da0 => noop acd0 => noop ad0 => noop fd0 => noop md0 => noop -- To enable the fq policy on a disk you have two options: 1) set dsched_pol_{diskname}="fq" in /boot/loader.conf; e.g. if it should be enabled for da0, then dsched_pol_da0="fq". You could also apply the fq policy to all disks of a certain type (dsched_pol_da) or to all disks (dsched_pol). Note that sernos are not supported (yet). 2) use dschedctl: # dschedctl -s fq -d da0 Switched scheduler policy of da0 successfully to fq. After this, dschedctl -l should list the scheduler of da0 as fq. Another use of dschedctl is to list the available scheduling policies, which is of limited use right now, but I'll show it's use anyways: # dschedctl -p > noop > fq -- The ionice priority is similar to nice, but the levels nice values range from 0 to 10, and unlike the usual nice, 10 is the highest priority and 0 the lowest. Usage is exactly the same as nice: # ionice -n 10 sh read_big -- A brief description of the inner workings of the FQ policy follows: - all requests (bios) are let through by default without any queuing. - for each process/thread in the system, the average latency of its I/O and the tps is calculated. - A thread that runs every several 100 ms checks if the disk bandwidth is full, and if so, it allocates a fair share of maximum transactions to each process/thread in the system that is doing I/O. This is done taking into account the latency and the tps. Processes/threads exceeding their share get rate limited to a number of tps. - Processes/threads sharing a ioprio get each an equal amount of the pie. - Once a process/thread is rate limited, only the given amount of bios go through. All bios exceeding the fair share of the thread/process in the scheduling time quantum are queued in a per process/thread queue. Reads are queued at the front while writes are added to the back. - Before dispatching a bio for a process/thread, it is checked if its queue is non-empty. If this is the case, these bios are dispatched first before the new bio is dispatched. - A dispatcher thread runs every ~20ms dispatching bios for all processes/threads that have queued bios, up to the maximum number allowed. -- Recent changes: - changing the algorithm to estimate the disk usage percent. Now it's done right, by measuring the time the disk spends idle in one balancing period. - due to the previous change, I have also been able to add a feedback mechanism that tries to dispatch more requests if the disk becomes idle, even if all processes have already reached their rate limit by increasing the limit if needed. - moving the heavier balancing calculations out of the fq_balance thread and into the context of the processes/threads that do I/O, as far as this is possible. Some of the heavy balancing calculations will still occur in the dispatch thread instead of the issuing context. (thanks to Aggelos for the idea) -- Recent bugfixes: - The issue that existed before with a panic after a few live policy switches has been fixed. - The only-write performance has also been improved since the previous version; when only writes are occurring the full disk bandwidth is now used. - Several other panics mostly related to int64 overflows :) -- There are some other interesting tools/settings, mainly: sysctl kern.dsched_debug: the higher the level, the more debug you'll get. By default no debug will be printed. At level 4, only the disk busy-% will be printed, and at 7 all details about the balancing will be shown. If you hit a bug, it would almost certainly be helpful if you could provide the relevant dsched debug at level 7. test/dsched_fq: If you build fqstats (just using 'make' in this directory), you'll be able to read some of the statistics that dsched_fq keeps track of, such as number of allocated structures of each type and number of processes/threads that were rate limited, number of issued transactions, completed transactions and cancelled transactions. Cheers, Alex Hornung