>> From: Stephen Bates <[email protected]>
>> 
>> Hybrid polling currently uses half the average completion time as an
>> estimate of how long to poll for. We can improve upon this by noting
>> that polling before the minimum completion time makes no sense. Add a
>> sysfs entry to use this fact to improve CPU utilization in certain
>> cases.
>> 
>> At the same time the minimum is a bit too long to sleep for since we
>> must factor in OS wake time for the thread. For now allow the user to
>> set this via a second sysfs entry (in nanoseconds).
>> 
>> Testing this patch on Intel Optane SSDs showed that using the minimum
>> rather than half reduced CPU utilization from 59% to 38%. Tuning
>> this via the wake time adjustment allowed us to trade CPU load for
>> latency. For example
>> 
>> io_poll       delay  hyb_use_min adjust      latency CPU load
>> 1     -1     N/A         N/A         8.4     100%
>> 1     0      0           N/A         8.4     57%
>> 1     0      1           0           10.3    34%
>> 1     9      1           1000        9.9     37%
>> 1     0      1           2000        8.4     47%
>> 1     0      1           10000       8.4     100%
>> 
>> Ideally we will extend this to auto-calculate the wake time rather
>> than have it set by the user.
>
> I don't like this, it's another weird knob that will exist but that
> no one will know how to use. For most of the testing I've done
> recently, hybrid is a win over busy polling - hence I think we should
> make that the default. 60% of mean has also, in testing, been shown
> to be a win. So that's an easy fix/change we can consider.

I do agree that the this is a hard knob to tune. I am however not happy that 
the current hybrid default may mean we are polling well before the minimum 
completion time. That just seems like a waste of CPU resources to me. I do 
agree that turning on hybrid as the default and perhaps bumping up the default 
is a good idea.

> To go beyond that, I'd much rather see us tracking the time waste.
> If we consider the total completion time of an IO to be A+B+C, where:
>
> A     Time needed to go to sleep
> B     Sleep time
> C     Time needed to wake up
>
> then we could feasibly track A+C. We already know how long the IO
> will take to complete, as we track that. At that point we'd have
> a full picture of how long we should sleep.

Yes, this is where I was thinking of taking this functionality in the long 
term. It seems like tracking C is something other parts of the kernel might 
need. Does anyone know of any existing code in this space?

> Bonus points for informing the lower level scheduler of this as
> well. If the CPU is going idle, we'll enter some sort of power
> state in the processor. If we were able to pass in how long we
> expect to sleep, we could be making better decisions here.

Yup. Again, this seems like something more general that just the block-layer. I 
will do some digging and see/if anything is available to leverage here.

Cheers
Stephen



Reply via email to